consistency of semi-supervised learning algorithms on graphshyperion.usc.edu/mluq/pres/fh.pdf ·...

Consistency Of Semi-Supervised LearningAlgorithms On Graphs

Franca HoffmannComputing and Mathematical Sciences

California Institute of Technology

Joint work with:Bamdad Hosseini, Zhi Ren and Andrew M. Stuart

arXiv:1906.07658

Research Challenges and Opportunities at the interface of MachineLearning and Uncertainty Quantification

July 25th, 2019

1 / 36

Overview

1 The Setting

2 Inverse Problem

3 Posterior Consistency

4 Numerical Results

5 Conclusions

2 / 36

1. The Setting

3 / 36

Semi-Supervised Learning On Graphs

I Vertices: Z = {1, · · · , n}.I Unlabelled Data X : Z 7→ Rd.

I X = {x1, · · · , xn} ∈ Rd×n.I Labelled Data y : Z ′ 7→ {±1}.

I Z′ ⊂ Z, |Z′| = J � n.

I Extensions:I to multi-class.I to Bayesian.I to large data n→∞.

(Geometry Of Graph) + (Observed Labels) → Label All Of Z.

4 / 36

Clustering And Semi-Supervised Learning

5 / 36

Key Idea

(Observed Labels) + (Graph Laplacian Spectral Geometry)

Labels

Fiedler Vector

6 / 36

The Graph Laplacian

Unlabelled Data

I {xj}nj=1 ∈ Rd.

Weight Function

I Suitable kernel η : R+ 7→ R+.

I Example: η(r) = δ−d exp(−r2/2δ2).

I Example: η(r) = I[0,δ)(r).

Graph Laplacian L = D −W.

I Weighted Adjacency Matrix: W : Z × Z 7→ R+.

I Weighted Degree Matrix: D : Z × Z 7→ R+.I Edge weights Wij = η (|xi − xj |) .I Dij =

(∑kWik

)δij .

7 / 36

2. Inverse Problem

8 / 36

Inverse Problem

Model

yj = sign(uj + ηj), j ∈ Z ′.

I Find u : Z 7→ R;

I Label sign(uj) : Z 7→ {±1}.

9 / 36

Problem Statement (Bayesian Formulation)

Semi-Supervised Learning

I Prior:I u : Z 7→ R Gaussian;I Unlabelled Data

{xj ∈ Rd, j ∈ Z := {1, . . . , n}

}.

I Likelihood:I y|u;I Labelled Data

{yj ∈ {±1}, j ∈ Z′ ⊆ Z

}.

I Posterior:I u|y;I Labels Everywhere

{sign(uj) ∈ {±1}, j ∈ Z

}.

Connection between probability and optimization:

P(u|y) ∝ P(y|u)× P(u)

∝ exp(−Φγ(u; y)

)×N(0, C)

∝ exp(−J(u; y)

).

J(u; y) =1

2〈u,C−1u〉Rn + Φγ(u; y).

10 / 36

Probit

Probit Model

I Objective function: J(u; y) = 12 〈u,C

−1u〉Rn + Φγ(u; y).

I Covariance: C = τ2α(L+ τ2I)−α.

I Likelihood: yj = sign(uj + ηj), ηj ∼ N(0, γ2).

I Misfit: Φγ(u; y) := −∑j∈Z′ log

(Ψγ(yj uj)

).

I Ψγ is the cdf of N(0, γ2).

Rasmussen and Williams, 2006. (MIT Press)

Bertozzi, Luo, Stuart and Zygalakis, 2018. (SIAM JUQ)

x

(x)

x

-log

(x)

11 / 36

Modelling Assumptions

τ = 0, u⊥1.

12 / 36

Modelling Assumptions

τ > 0, u ∈ RN .

13 / 36

A Representer Theorem For Probit

u∗ := arginfuJ(u; y).

Theorem: Sparse Representation Of u∗

Suppose α, γ, τ2 > 0. Then u∗ ∈ RN is unique and has representation

u∗ =

J∑k=1

akck,

where {ck}Jk=1 ∈ RN comprise J columns of C and {ak}Jk=1 ∈ R solve

aj = Fj

(J∑k=1

ak(cj)k

), Fj(t) := −∂t log Ψγ(tyj).

Scholkopf et al, 2001. (MIT Press)

14 / 36

3. Posterior ConsistencyDashti et al, 2013. (Inverse Problems)

Nickl et al, 2018. (arXiv)

15 / 36

Posterior Consistency For Probit (MAP)

yj = sign(u†j + ηj), ηj ∼ N(0, γ2),

u∗ = arg minu∈RN

1

2〈u,C−1u〉+ Φγ(u; y).

Asymptotic Consistency Of Probit

Conditions on Graph G, C and u† to ensure that, as γ ↓ 0,

sign(u∗j )→ sign(u†j), ∀j ∈ {1, · · · , N},

in some sense.

16 / 36

Condition On Graph G

I G = {Z,W} :I Z =

⋃Zk.

I G0 = {Z,W0} :I W0 = diag{W1, · · · ,Wk}.I L0 = diag{L1, · · · , Lk}.

I Gε = {Z,Wε} :I Wε =W0 +O(ε).I Lε = L0 +O(ε).

Condition On Graph Gε: O(ε) perturbation of perfectly clustered data

Condition On u†: sign(u†) is constant on Gk.

17 / 36

Condition On Cε: Carefully Chosen τ

Eigenstructure Of Lε

I Lεφφφj,ε = λj,εφφφj,ε.

I λj,0 = 0, φφφj,0 = 1√Nkχχχj for j = 1, . . . ,K.

I {χχχk}Kk=1 ∈ RN indicator of Zk.

Implications For C0

I C0 := τ2α(L0 + τ2I)−α.

I C0 =∑Kk=1

1Nkχχχkχχχ

Tk +O(τ2α).

18 / 36

Condition On Cε: Carefully Chosen τ

Eigenstructure Of Lε

I Lεφφφj,ε = λj,εφφφj,ε.

I λj,0 = 0, φφφj,0 = 1√Nkχχχj for j = 1, . . . ,K.

I {χχχk}Kk=1 ∈ RN indicator of Zk.

Condition On Cε: ε = o(τ2).

Implications For Cε

I Cε := τ2α(Lε + τ2I)−α.

I Cε =∑Kk=1

1Nkχχχkχχχ

Tk +O

(ε+ τ2α + ε

τ2

).

19 / 36

Consistency of Generalized Probit


1

2〈u, τ−2α(Lε + τ2I)αu〉 −

J∑j=1

log Ψγ(ujyj).

Theorem: Probit is Asymptotically Consistent

Suppose preceding conditions on G, τ and u† hold and that at least onelabel is observed in each Gk. Then for any sequence (ε, τ, γ) ↓ 0 alongwhich ε = o(τ2) it holds that

sign(u∗j )a.s.−−→ sign(u†j), ∀j ∈ {1, · · · , N}.

Implication: τ should be tuned; hierarchical Bayesian.

20 / 36

4. Numerical Results

21 / 36

Set-Up

I Three clusters.

I N = 3× 50 = 150 vertices.

I Two classes.

x3

x2

x1

G0 u†

22 / 36

Accuracy of Probit

I Observe one perfect label in each cluster, no noise: yj = sign(u†j).

I Perturb graph Laplacian Lε = L0 + εL′.

I Logistic CDF: Ψγ(t) = (1 + exp(−t/γ))−1.

I Probit (modified to logistic likelihood):


J(u; y).

I % of points mislabelled by u∗.

/2

0

10

20

30

% o

f m

isla

bel

led p

oin

ts

x3

x2

x1

23 / 36

5. Conclusions

24 / 36

Summary:

I Formulate Probit MAP estimator problem.

I Probit: optimal solution exists.

I Probit: can be solved by dimension reduction to labels.

I Probit: asymptotically consistent.

I ε = o(τ2) is crucial for success.

25 / 36

Generalizations:

I Multi-class using one-hot encoding(this work).

I Different weightings of the graph Laplacian(this and future work).

I Continuum limit(with Assad A. Oberai).

26 / 36

Generalizations:


I Different weightings of the graph Laplacian

L :=

{D

1−pq−1 (D −W )D−

rq−1 , if q 6= 1 ,

D −W, if q = 1,

for parameters p, q, r ∈ R, and re-weighted W .


27 / 36

Generalizations:


I Different weightings of the graph Laplacian(this and future work).


Continuum usingL ≈ L

where

Lu = − 1

ρp∇ ·(ρq∇

(u

ρr

))x ∈ D, ∂u

∂n= 0, x ∈ ∂D.

Dunlop et al, 2019. (ACHA) for (p, q, r) = (1, 2, 0) and p = (3/2, 2, 1/2)

28 / 36

Confocal microscope image: two neurons

Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)

Source: http://cellimagelibrary.org/images/6195

29 / 36

http://cellimagelibrary.org/images/6195

Confocal microscope image: First eigenfunction


30 / 36

Confocal microscope image: Second eigenfunction


31 / 36

Confocal microscope image: Second eigenfunction


32 / 36

References

[1] F. Hoffmann, B. Hosseini, Z. Ren,A.M. Stuart

Consistency of semi-supervisedlearning algorithms on graphs:Probit and one-hot methods.arXiv:1906.07658.

[2] B. Scholkopf and A.J. Smola

Learning with kernels: supportvector machines, regularization,optimization, and beyond. MITPress, 2001.

[3] C.E. Rasmussen and C.K. Williams

Gaussian processes for machinelearning. MIT Press, 2006.

[4] A.L. Bertozzi, X. Luo, A.M. Stuartand K.C. Zygalakis

Uncertainty quantification ingraph-based classification of highdimensional data. SIAM JUQ,6(2018).

[5] M. Dashti, K.J.H. Law, A.M.Stuart, J. Voss

MAP estimators and theirconsistency in Bayesiannonparametric inverse problems,Inverse Problems, 29(2013).

[6] R Nickl, S van de Geer, S Wang

Convergence rates for PenalisedLeast Squares Estimators inPDE-constrained regressionproblems. arXiv:1809.08818.

[7] M.M. Dunlop, D. Slepcev, A.M.Stuart, M. Thorpe

Large data and zero noise limits ofgraph-based semi-supervisedlearning algorithms Applied andComputational Harmonic Analysis,2019.

33 / 36

SSL for UQ: Five clusters & two labelsBamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)

34 / 36

SSL for UQ: Fiedler vector


35 / 36

SSL for UQ: Fiedler vector


36 / 36

consistency of semi-supervised learning algorithms on graphshyperion.usc.edu/mluq/pres/fh.pdf ·...

Documents