riemannian adaptive stochastic gradient algorithms on matrix...
TRANSCRIPT
Riemannian adaptive stochastic gradient algorithms on matrix manifoldsHiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India)
Problem of interest
• Consider the problem [1] ofminx∈M
f (x). M : Riemannian manifold• M are represented as matrices of size n × r.• Promising applications include, e.g., matrix/tensor
completion, subspace tracking.
Contributions• Propose a modeling adaptive weight matrices
for row and column subspaces exploiting thegeometry of manifold.
• Develop efficient Riemannian adaptivestochastic gradient algorithms (RASA).
• Achieve a rate of convergence orderO(log(T )/
√T ) for non-convex stochastic
optimization under mild conditions.• Show efficiency of RASA from numerical
experiments on several applications.
Preliminaries
• Riemannian stochastic gradient update:(RSGD) xt+1 = Rxt︸︷︷︸
retraction(−αt gradft(xt)︸ ︷︷ ︸
Riemannianstochastic gradient
),
• Rx(ζ) maps ζ ∈ TxM (tangent space) onto M.• When M = Rd with standard Euclidean inner product,
RSGD update results in(SGD) xt+1 = xt − αt∇ft(xt).
• Euclidean adaptive stochastic gradient updates:• Rescale the learning rate based on past gradients as
xt+1 = xt − αtV−1/2t ∇ft(xt).
• Vt = Diag(vt) is a diagonal matrix such as
(AgaGrad) vt =t∑
k=1∇fk(xk) ◦ ∇fk(xk),
(RMSProp) vt = βvt−1 + (1 − β)∇ft(xt) ◦ ∇ft(xt).
MATLAB source codeThe code, which is compliant to Manopt(https://www.manopt.org/), is available athttps://github.com/hiroyuki-kasai/RSOpt/.
RASA: Riemannian AdaptiveStochastic gradient Algorithms
Exploit matrix structure of Riemannian gradientGt (= gradft(xt) ∈ Rn×r) by separating adaptiveweight matrices corresponding to row subspace Lt
and column subspaces Rt.
c.f. [2] views Gt as a vector in Rnr.
• Exponentially weighted matrices:
Lt = βLt−1 + (1 − β)GtG⊤t /r, (∈ Rn×n)
Rt = βRt−1 + (1 − β)G⊤t Gt/n. (∈ Rr×r)
(β ∈ (0, 1): hyper-parameter)
• Adaptive Riemannian gradient Gt:
G̃t = L−1/4t GtR−1/4
t .
• Full-matrix update:
xt+1 = Rxt(−αtPxt
(G̃t)).
• Px, a linear operator, projects onto tangent space TxM.
• Diagonal modeling of {Lt, Rt} as vectors {lt, rt}:
lt = βlt−1 + (1 − β)diag(GtG⊤t ), (∈ Rn)
rt = βrt−1 + (1 − β)diag(G⊤t Gt). (∈ Rr)
• diag(·) returns diagonal vector of a square matrix.
• Maximum operator for convergence:
l̂t = max(̂lt−1, lt), r̂t = max(r̂t−1, rt).
Alg.1: RASA
Require: Step size {αt}Tt=1, hyper-parameter β.
1: Initialize x1 ∈ M, l0 = l̂0 = 0n, r0 = r̂0 = 0r.2: for t = 1, 2, . . . , T do3: Compute Riemannian stochastic gradient
Gt = gradft(xt).4: Update lt = βlt−1 + (1 − β)diag(GtGT
t )/r.5: Calculate l̂t = max(̂lt−1, lt).6: Update rt = βrt−1 + (1 − β)diag(GT
t Gt)/n.7: Calculate r̂t = max(r̂t−1, rt).8: xt+1 =Rxt
(−αtPxt(Diag(̂l−1/4
t )GtDiag(r̂−1/4t ))).
9: end for
• RASA variants:• RASA-L adapts only the row subspace.• RASA-R adapts only the column subspace.• RASA-LR adapts both the row and column subspaces.
Convergence rate analysis
Extend existing convergence analysis in Euclideanspace, e.g., [3], into Riemannian setting.Additionally, need to take care of(i) upper bound of v̂t (Lem.4.3) for update, and(ii) projection Px of weighted gradient onto TxM.
• For analysis, we use additional notations as• xt+1 = Rxt
(−αtPxt(V−1/2
t gt(xt))) for step 8 in Alg.1,• V̂t = Diag(v̂t), where v̂t = r̂1/2
t ⊗ l̂1/2t , and
• gt(x) as the vectorized representation of gradft(x).
• Definition, assumptions, and lemma:Def.4.1. (Upper-Hessian bounded) There exists a
constant L > 0 such that d2f (Rx(tη))dt2 ≤ L, for
x ∈ U ⊂ M and η ∈ TxM with ∥η∥x = 1,and all t such that Rx(τη) ∈ U for τ ∈ [0, t].
Asm.1.1. f is continuously differentiable and islower bounded, i.e., f (x∗) > −∞.
Asm.1.2. f has H-bounded Riemannian stochasticgradient, i.e., ∥gradfi(x)∥F ≤ H or∥gi(x)∥2 ≤ H .
Asm.1.3. f is upper-Hessian bounded (Def.4.1).Lem.4.2. Under Asm.1 and L > 0 in Def.4.1, we
have f (z) ≤ f (x) + ⟨gradf (x), ξ⟩2 + 12L∥ξ∥2
2,for x ∈ M, where ξ ∈ TxM and Rx(ξ) = z.
• Obtained results:Thm.4.4. Let {xt} and {v̂t} be the sequencesfrom Alg.1. Then, under Asm.1, we have
E
T∑t=2
αt−1
⟨g(xt),
g(xt)√v̂t−1
⟩2
≤ C +
≤ E
L
2T∑
t=1
∥∥∥∥∥∥αtgt(xt)√v̂t
∥∥∥∥∥∥2
2+ H2
T∑t=2
∥∥∥∥∥∥ αt√v̂t
− αt−1√v̂t−1
∥∥∥∥∥∥1
where C is a constant term independent of T .
Cor.4.5. Let αt = 1/√
t and minj∈[d]√
(v̂1)j islower-bounded by a constant c > 0, where d is thedimension of M. Then, under Asm.1, the outputof xt of Alg.1 satisfies
mint∈[2,...,T ]
E∥gradf (xt)∥2F ≤ 1√
T − 1(Q1+Q2 log(T )),
where Q2 = LH3/2c2 and
Q1 = Q2 + 2dH3
c+ HE[f (x1) − f (x∗)].
Numerical evaluations
• PCA problem
0 0.5 1 1.5 2
Number of iterations 104
10-4
10-3
10-2
10-1
Op
tim
ap
ity g
ap
RSGD (5.0e-02)
cRMSProp (5.0e-01)
cRMSProp-M (5.0e-01)
Radagrad (5.0e-01)
Radam (5.0e-01)
Ramsgrad (5.0e-01)
RASA-R (5.0e-02)
RASA-L (5.0e-02)
RASA-LR (5.0e-02)
(a) Case P1: Synthetic dataset.
0 1 2 3 4 5
Number of iterations 104
10-4
10-3
10-2
10-1
Op
tim
ap
ity g
ap
RSGD (5.0e-01)
cRMSProp (5.0e-03)
cRMSProp-M (5.0e-03)
Radagrad (5.0e+00)
Radam (5.0e+00)
Ramsgrad (5.0e+00)
RASA-R (5.0e-02)
RASA-L (5.0e-02)
RASA-LR (1.0e-02)
(b) Case P2: MNIST dataset.
0 2000 4000 6000 8000 10000
Number of iterations
10-2
10-1
100
101
Op
tim
ap
ity g
ap
RSGD (1.0e-05)
cRMSProp (1.0e-04)
cRMSProp-M (1.0e-04)
Radagrad (1.0e+01)
Radam (1.0e+01)
Ramsgrad (1.0e+01)
RASA-R (5.0e-04)
RASA-L (1.0e-03)
RASA-LR (5.0e-02)
(c) Case P3: COIL100 dataset.
• Matrix completion problem
0 2000 4000 6000
Number of iterations
0.74
0.75
0.76
0.77
0.78
0.79
0.8
Ro
ot
me
an
sq
ua
red
err
or
on
tra
inin
g s
et
RSGD (0.10)
Radagrad (1.00)
Radam (1.00)
Ramsgrad (1.00)
RASA-LR (0.01)
(a) Movie-Lens-1M (train).
0 2000 4000 6000
Number of iterations
0.89
0.9
0.91
0.92
0.93
0.94
0.95
Ro
ot
me
an
sq
ua
red
err
or
on
te
st
se
t RSGD
Radagrad
Radam
Ramsgrad
RASA-LR
(b) Movie-Lens-1M (test).
0 2 4 6
Number of iterations 104
0.71
0.72
0.73
0.74
0.75
Ro
ot
me
an
sq
ua
red
err
or
on
tra
inin
g s
et
RSGD (0.10)
Radagrad (1.00)
Radam (1.00)
Ramsgrad (1.00)
RASA-LR (0.01)
(c) Movie-Lens-10M (train).
0 2 4 6
Number of iterations 104
0.81
0.82
0.83
0.84
0.85
Ro
ot
me
an
sq
ua
red
err
or
on
te
st
se
t RSGD
Radagrad
Radam
Ramsgrad
RASA-LR
(d) Movie-Lens-10M (test).
• ICA problem
0 2000 4000 6000 8000 10000
Number of iterations
10-2
10-1
100
Re
lative
op
tim
ap
ity g
ap
RSGD (1.0e-03)
cRMSProp (1.0e-02)
cRMSProp-M (1.0e-02)
Radagrad (5.0e+00)
Radam (1.0e+01)
Ramsgrad (1.0e+01)
RASA-R (1.0e-02)
RASA-L (1.0e-02)
RASA-LR (1.0e-01)
(a) Case I1: YaleB dataset.
0 2000 4000 6000 8000 10000
Number of iterations
10-4
10-3
10-2
10-1
100
Re
lative
op
tim
ap
ity g
ap
RSGD (1.0e-04)
cRMSProp (5.0e-02)
cRMSProp-M (5.0e-02)
Radagrad (5.0e-01)
Radam (1.0e+00)
Ramsgrad (1.0e+00)
RASA-R (1.0e-04)
RASA-L (1.0e-04)
RASA-LR (5.0e-02)
(b) Case I2: COIL100 dataset.
References
[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on MatrixManifolds. Princeton University Press, 2008.
[2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimizationtechniques for deep learning, CVPR, 2018.
[3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-typealgorithms for non-convex optimization, ICLR, 2019.
1 / 1