consistency of semi-supervised learning algorithms on graphshyperion.usc.edu/mluq/pres/fh.pdf ·...
TRANSCRIPT
Consistency Of Semi-Supervised LearningAlgorithms On Graphs
Franca HoffmannComputing and Mathematical Sciences
California Institute of Technology
Joint work with:Bamdad Hosseini, Zhi Ren and Andrew M. Stuart
arXiv:1906.07658
Research Challenges and Opportunities at the interface of MachineLearning and Uncertainty Quantification
July 25th, 2019
1 / 36
Overview
1 The Setting
2 Inverse Problem
3 Posterior Consistency
4 Numerical Results
5 Conclusions
2 / 36
1. The Setting
3 / 36
Semi-Supervised Learning On Graphs
I Vertices: Z = {1, · · · , n}.I Unlabelled Data X : Z 7→ Rd.
I X = {x1, · · · , xn} ∈ Rd×n.I Labelled Data y : Z ′ 7→ {±1}.
I Z′ ⊂ Z, |Z′| = J � n.
I Extensions:I to multi-class.I to Bayesian.I to large data n→∞.
(Geometry Of Graph) + (Observed Labels) → Label All Of Z.
4 / 36
Clustering And Semi-Supervised Learning
5 / 36
Key Idea
(Observed Labels) + (Graph Laplacian Spectral Geometry)
Labels
Fiedler Vector
6 / 36
The Graph Laplacian
Unlabelled Data
I {xj}nj=1 ∈ Rd.
Weight Function
I Suitable kernel η : R+ 7→ R+.
I Example: η(r) = δ−d exp(−r2/2δ2).
I Example: η(r) = I[0,δ)(r).
Graph Laplacian L = D −W.
I Weighted Adjacency Matrix: W : Z × Z 7→ R+.
I Weighted Degree Matrix: D : Z × Z 7→ R+.I Edge weights Wij = η (|xi − xj |) .I Dij =
(∑kWik
)δij .
7 / 36
2. Inverse Problem
8 / 36
Inverse Problem
Model
yj = sign(uj + ηj), j ∈ Z ′.
I Find u : Z 7→ R;
I Label sign(uj) : Z 7→ {±1}.
9 / 36
Problem Statement (Bayesian Formulation)
Semi-Supervised Learning
I Prior:I u : Z 7→ R Gaussian;I Unlabelled Data
{xj ∈ Rd, j ∈ Z := {1, . . . , n}
}.
I Likelihood:I y|u;I Labelled Data
{yj ∈ {±1}, j ∈ Z′ ⊆ Z
}.
I Posterior:I u|y;I Labels Everywhere
{sign(uj) ∈ {±1}, j ∈ Z
}.
Connection between probability and optimization:
P(u|y) ∝ P(y|u)× P(u)
∝ exp(−Φγ(u; y)
)×N(0, C)
∝ exp(−J(u; y)
).
J(u; y) =1
2〈u,C−1u〉Rn + Φγ(u; y).
10 / 36
Probit
Probit Model
I Objective function: J(u; y) = 12 〈u,C
−1u〉Rn + Φγ(u; y).
I Covariance: C = τ2α(L+ τ2I)−α.
I Likelihood: yj = sign(uj + ηj), ηj ∼ N(0, γ2).
I Misfit: Φγ(u; y) := −∑j∈Z′ log
(Ψγ(yj uj)
).
I Ψγ is the cdf of N(0, γ2).
Rasmussen and Williams, 2006. (MIT Press)
Bertozzi, Luo, Stuart and Zygalakis, 2018. (SIAM JUQ)
x
(x)
x
-log
(x)
11 / 36
Modelling Assumptions
τ = 0, u⊥1.
12 / 36
Modelling Assumptions
τ > 0, u ∈ RN .
13 / 36
A Representer Theorem For Probit
u∗ := arginfuJ(u; y).
Theorem: Sparse Representation Of u∗
Suppose α, γ, τ2 > 0. Then u∗ ∈ RN is unique and has representation
u∗ =
J∑k=1
akck,
where {ck}Jk=1 ∈ RN comprise J columns of C and {ak}Jk=1 ∈ R solve
aj = Fj
(J∑k=1
ak(cj)k
), Fj(t) := −∂t log Ψγ(tyj).
Scholkopf et al, 2001. (MIT Press)
14 / 36
3. Posterior ConsistencyDashti et al, 2013. (Inverse Problems)
Nickl et al, 2018. (arXiv)
15 / 36
Posterior Consistency For Probit (MAP)
yj = sign(u†j + ηj), ηj ∼ N(0, γ2),
u∗ = arg minu∈RN
1
2〈u,C−1u〉+ Φγ(u; y).
Asymptotic Consistency Of Probit
Conditions on Graph G, C and u† to ensure that, as γ ↓ 0,
sign(u∗j )→ sign(u†j), ∀j ∈ {1, · · · , N},
in some sense.
16 / 36
Condition On Graph G
I G = {Z,W} :I Z =
⋃Zk.
I G0 = {Z,W0} :I W0 = diag{W1, · · · ,Wk}.I L0 = diag{L1, · · · , Lk}.
I Gε = {Z,Wε} :I Wε =W0 +O(ε).I Lε = L0 +O(ε).
Condition On Graph Gε: O(ε) perturbation of perfectly clustered data
Condition On u†: sign(u†) is constant on Gk.
17 / 36
Condition On Cε: Carefully Chosen τ
Eigenstructure Of Lε
I Lεφφφj,ε = λj,εφφφj,ε.
I λj,0 = 0, φφφj,0 = 1√Nkχχχj for j = 1, . . . ,K.
I {χχχk}Kk=1 ∈ RN indicator of Zk.
Implications For C0
I C0 := τ2α(L0 + τ2I)−α.
I C0 =∑Kk=1
1Nkχχχkχχχ
Tk +O(τ2α).
18 / 36
Condition On Cε: Carefully Chosen τ
Eigenstructure Of Lε
I Lεφφφj,ε = λj,εφφφj,ε.
I λj,0 = 0, φφφj,0 = 1√Nkχχχj for j = 1, . . . ,K.
I {χχχk}Kk=1 ∈ RN indicator of Zk.
Condition On Cε: ε = o(τ2).
Implications For Cε
I Cε := τ2α(Lε + τ2I)−α.
I Cε =∑Kk=1
1Nkχχχkχχχ
Tk +O
(ε+ τ2α + ε
τ2
).
19 / 36
Consistency of Generalized Probit
u∗ = arg minu∈RN
1
2〈u, τ−2α(Lε + τ2I)αu〉 −
J∑j=1
log Ψγ(ujyj).
Theorem: Probit is Asymptotically Consistent
Suppose preceding conditions on G, τ and u† hold and that at least onelabel is observed in each Gk. Then for any sequence (ε, τ, γ) ↓ 0 alongwhich ε = o(τ2) it holds that
sign(u∗j )a.s.−−→ sign(u†j), ∀j ∈ {1, · · · , N}.
Implication: τ should be tuned; hierarchical Bayesian.
20 / 36
4. Numerical Results
21 / 36
Set-Up
I Three clusters.
I N = 3× 50 = 150 vertices.
I Two classes.
x3
x2
x1
G0 u†
22 / 36
Accuracy of Probit
I Observe one perfect label in each cluster, no noise: yj = sign(u†j).
I Perturb graph Laplacian Lε = L0 + εL′.
I Logistic CDF: Ψγ(t) = (1 + exp(−t/γ))−1.
I Probit (modified to logistic likelihood):
u∗ = arg minu∈RN
J(u; y).
I % of points mislabelled by u∗.
/2
0
10
20
30
% o
f m
isla
bel
led p
oin
ts
x3
x2
x1
23 / 36
5. Conclusions
24 / 36
Summary:
I Formulate Probit MAP estimator problem.
I Probit: optimal solution exists.
I Probit: can be solved by dimension reduction to labels.
I Probit: asymptotically consistent.
I ε = o(τ2) is crucial for success.
25 / 36
Generalizations:
I Multi-class using one-hot encoding(this work).
I Different weightings of the graph Laplacian(this and future work).
I Continuum limit(with Assad A. Oberai).
26 / 36
Generalizations:
I Multi-class using one-hot encoding(this work).
I Different weightings of the graph Laplacian
L :=
{D
1−pq−1 (D −W )D−
rq−1 , if q 6= 1 ,
D −W, if q = 1,
for parameters p, q, r ∈ R, and re-weighted W .
I Continuum limit(with Assad A. Oberai).
27 / 36
Generalizations:
I Multi-class using one-hot encoding(this work).
I Different weightings of the graph Laplacian(this and future work).
I Continuum limit(with Assad A. Oberai).
Continuum usingL ≈ L
where
Lu = − 1
ρp∇ ·(ρq∇
(u
ρr
))x ∈ D, ∂u
∂n= 0, x ∈ ∂D.
Dunlop et al, 2019. (ACHA) for (p, q, r) = (1, 2, 0) and p = (3/2, 2, 1/2)
28 / 36
Confocal microscope image: two neurons
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
Source: http://cellimagelibrary.org/images/6195
29 / 36
Confocal microscope image: First eigenfunction
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
30 / 36
Confocal microscope image: Second eigenfunction
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
31 / 36
Confocal microscope image: Second eigenfunction
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
32 / 36
References
[1] F. Hoffmann, B. Hosseini, Z. Ren,A.M. Stuart
Consistency of semi-supervisedlearning algorithms on graphs:Probit and one-hot methods.arXiv:1906.07658.
[2] B. Scholkopf and A.J. Smola
Learning with kernels: supportvector machines, regularization,optimization, and beyond. MITPress, 2001.
[3] C.E. Rasmussen and C.K. Williams
Gaussian processes for machinelearning. MIT Press, 2006.
[4] A.L. Bertozzi, X. Luo, A.M. Stuartand K.C. Zygalakis
Uncertainty quantification ingraph-based classification of highdimensional data. SIAM JUQ,6(2018).
[5] M. Dashti, K.J.H. Law, A.M.Stuart, J. Voss
MAP estimators and theirconsistency in Bayesiannonparametric inverse problems,Inverse Problems, 29(2013).
[6] R Nickl, S van de Geer, S Wang
Convergence rates for PenalisedLeast Squares Estimators inPDE-constrained regressionproblems. arXiv:1809.08818.
[7] M.M. Dunlop, D. Slepcev, A.M.Stuart, M. Thorpe
Large data and zero noise limits ofgraph-based semi-supervisedlearning algorithms Applied andComputational Harmonic Analysis,2019.
33 / 36
SSL for UQ: Five clusters & two labelsBamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
34 / 36
SSL for UQ: Fiedler vector
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
35 / 36
SSL for UQ: Fiedler vector
Bamdad Hosseini, Assad A. Oberai, Andrew M. Stuart (ongoing)
36 / 36