riemannian adaptive stochastic gradient algorithms on matrix...

Riemannian adaptive stochastic gradient algorithms on matrix manifoldsHiroyuki Kasai (The University of Electro-Communications, Japan), Pratik Jawanpuria (Microsoft, India) and Bamdev Mishra (Microsoft, India)

Problem of interest

• Consider the problem [1] ofminx∈M

f (x). M : Riemannian manifold• M are represented as matrices of size n × r.• Promising applications include, e.g., matrix/tensor

completion, subspace tracking.

Contributions• Propose a modeling adaptive weight matrices

for row and column subspaces exploiting thegeometry of manifold.

• Develop efficient Riemannian adaptivestochastic gradient algorithms (RASA).

• Achieve a rate of convergence orderO(log(T )/

√T ) for non-convex stochastic

optimization under mild conditions.• Show efficiency of RASA from numerical

experiments on several applications.

Preliminaries

• Riemannian stochastic gradient update:(RSGD) xt+1 = Rxt︸︷︷︸

retraction(−αt gradft(xt)︸︷︷︸

Riemannianstochastic gradient

),

• Rx(ζ) maps ζ ∈ TxM (tangent space) onto M.• When M = Rd with standard Euclidean inner product,

RSGD update results in(SGD) xt+1 = xt − αt∇ft(xt).

• Euclidean adaptive stochastic gradient updates:• Rescale the learning rate based on past gradients as

xt+1 = xt − αtV−1/2t ∇ft(xt).

• Vt = Diag(vt) is a diagonal matrix such as

(AgaGrad) vt =t∑

k=1∇fk(xk) ◦ ∇fk(xk),

(RMSProp) vt = βvt−1 + (1 − β)∇ft(xt) ◦ ∇ft(xt).

MATLAB source codeThe code, which is compliant to Manopt(https://www.manopt.org/), is available athttps://github.com/hiroyuki-kasai/RSOpt/.

RASA: Riemannian AdaptiveStochastic gradient Algorithms

Exploit matrix structure of Riemannian gradientGt (= gradft(xt) ∈ Rn×r) by separating adaptiveweight matrices corresponding to row subspace Lt

and column subspaces Rt.

c.f. [2] views Gt as a vector in Rnr.

• Exponentially weighted matrices:

Lt = βLt−1 + (1 − β)GtG⊤t /r, (∈ Rn×n)

Rt = βRt−1 + (1 − β)G⊤t Gt/n. (∈ Rr×r)

(β ∈ (0, 1): hyper-parameter)

• Adaptive Riemannian gradient Gt:

G̃t = L−1/4t GtR−1/4

t .

• Full-matrix update:

xt+1 = Rxt(−αtPxt

(G̃t)).

• Px, a linear operator, projects onto tangent space TxM.

• Diagonal modeling of {Lt, Rt} as vectors {lt, rt}:

lt = βlt−1 + (1 − β)diag(GtG⊤t ), (∈ Rn)

rt = βrt−1 + (1 − β)diag(G⊤t Gt). (∈ Rr)

• diag(·) returns diagonal vector of a square matrix.

• Maximum operator for convergence:

l̂t = max(̂lt−1, lt), r̂t = max(r̂t−1, rt).

Alg.1: RASA

Require: Step size {αt}Tt=1, hyper-parameter β.

1: Initialize x1 ∈ M, l0 = l̂0 = 0n, r0 = r̂0 = 0r.2: for t = 1, 2, . . . , T do3: Compute Riemannian stochastic gradient

Gt = gradft(xt).4: Update lt = βlt−1 + (1 − β)diag(GtGT

t )/r.5: Calculate l̂t = max(̂lt−1, lt).6: Update rt = βrt−1 + (1 − β)diag(GT

t Gt)/n.7: Calculate r̂t = max(r̂t−1, rt).8: xt+1 =Rxt

(−αtPxt(Diag(̂l−1/4

t )GtDiag(r̂−1/4t ))).

9: end for

• RASA variants:• RASA-L adapts only the row subspace.• RASA-R adapts only the column subspace.• RASA-LR adapts both the row and column subspaces.

Convergence rate analysis

Extend existing convergence analysis in Euclideanspace, e.g., [3], into Riemannian setting.Additionally, need to take care of(i) upper bound of v̂t (Lem.4.3) for update, and(ii) projection Px of weighted gradient onto TxM.

• For analysis, we use additional notations as• xt+1 = Rxt

(−αtPxt(V−1/2

t gt(xt))) for step 8 in Alg.1,• V̂t = Diag(v̂t), where v̂t = r̂1/2

t ⊗ l̂1/2t , and

• gt(x) as the vectorized representation of gradft(x).

• Definition, assumptions, and lemma:Def.4.1. (Upper-Hessian bounded) There exists a

constant L > 0 such that d2f (Rx(tη))dt2 ≤ L, for

x ∈ U ⊂ M and η ∈ TxM with ∥η∥x = 1,and all t such that Rx(τη) ∈ U for τ ∈ [0, t].

Asm.1.1. f is continuously differentiable and islower bounded, i.e., f (x∗) > −∞.

Asm.1.2. f has H-bounded Riemannian stochasticgradient, i.e., ∥gradfi(x)∥F ≤ H or∥gi(x)∥2 ≤ H .

Asm.1.3. f is upper-Hessian bounded (Def.4.1).Lem.4.2. Under Asm.1 and L > 0 in Def.4.1, we

have f (z) ≤ f (x) + ⟨gradf (x), ξ⟩2 + 12L∥ξ∥2

2,for x ∈ M, where ξ ∈ TxM and Rx(ξ) = z.

• Obtained results:Thm.4.4. Let {xt} and {v̂t} be the sequencesfrom Alg.1. Then, under Asm.1, we have

E

T∑t=2

αt−1

⟨g(xt),

g(xt)√v̂t−1

⟩2

≤ C +

≤ E

L

2T∑

t=1

∥∥∥∥∥∥αtgt(xt)√v̂t

∥∥∥∥∥∥2

2+ H2

T∑t=2

∥∥∥∥∥∥ αt√v̂t

− αt−1√v̂t−1

∥∥∥∥∥∥1

where C is a constant term independent of T .

Cor.4.5. Let αt = 1/√

t and minj∈[d]√

(v̂1)j islower-bounded by a constant c > 0, where d is thedimension of M. Then, under Asm.1, the outputof xt of Alg.1 satisfies

mint∈[2,...,T ]

E∥gradf (xt)∥2F ≤ 1√

T − 1(Q1+Q2 log(T )),

where Q2 = LH3/2c2 and

Q1 = Q2 + 2dH3

c+ HE[f (x1) − f (x∗)].

Numerical evaluations

• PCA problem

0 0.5 1 1.5 2

Number of iterations 104

10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-02)

cRMSProp (5.0e-01)

cRMSProp-M (5.0e-01)

Radagrad (5.0e-01)

Radam (5.0e-01)

Ramsgrad (5.0e-01)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (5.0e-02)

(a) Case P1: Synthetic dataset.

0 1 2 3 4 5


10-4

10-3

10-2

10-1

Op

tim

ap

ity g

ap

RSGD (5.0e-01)

cRMSProp (5.0e-03)


Radagrad (5.0e+00)

Radam (5.0e+00)

Ramsgrad (5.0e+00)

RASA-R (5.0e-02)

RASA-L (5.0e-02)

RASA-LR (1.0e-02)

(b) Case P2: MNIST dataset.

0 2000 4000 6000 8000 10000

Number of iterations

10-2

10-1

100

101

Op

tim

ap

ity g

ap

RSGD (1.0e-05)

cRMSProp (1.0e-04)


Radagrad (1.0e+01)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (5.0e-04)

RASA-L (1.0e-03)

RASA-LR (5.0e-02)

(c) Case P3: COIL100 dataset.

• Matrix completion problem

0 2000 4000 6000


0.74

0.75

0.76

0.77

0.78

0.79

0.8

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(a) Movie-Lens-1M (train).

0 2000 4000 6000


0.89

0.9

0.91

0.92

0.93

0.94

0.95

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(b) Movie-Lens-1M (test).

0 2 4 6


0.71

0.72

0.73

0.74

0.75

Ro

ot

me

an

sq

ua

red

err

or

on

tra

inin

g s

et

RSGD (0.10)

Radagrad (1.00)

Radam (1.00)

Ramsgrad (1.00)

RASA-LR (0.01)

(c) Movie-Lens-10M (train).

0 2 4 6


0.81

0.82

0.83

0.84

0.85

Ro

ot

me

an

sq

ua

red

err

or

on

te

st

se

t RSGD

Radagrad

Radam

Ramsgrad

RASA-LR

(d) Movie-Lens-10M (test).

• ICA problem

0 2000 4000 6000 8000 10000


10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-03)

cRMSProp (1.0e-02)


Radagrad (5.0e+00)

Radam (1.0e+01)

Ramsgrad (1.0e+01)

RASA-R (1.0e-02)

RASA-L (1.0e-02)

RASA-LR (1.0e-01)

(a) Case I1: YaleB dataset.

0 2000 4000 6000 8000 10000


10-4

10-3

10-2

10-1

100

Re

lative

op

tim

ap

ity g

ap

RSGD (1.0e-04)

cRMSProp (5.0e-02)


Radagrad (5.0e-01)

Radam (1.0e+00)

Ramsgrad (1.0e+00)

RASA-R (1.0e-04)

RASA-L (1.0e-04)

RASA-LR (5.0e-02)

(b) Case I2: COIL100 dataset.

References

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on MatrixManifolds. Princeton University Press, 2008.

[2] S.K. Roy, Z. Mhammedi, and M. Harandi, Geometry aware constrained optimizationtechniques for deep learning, CVPR, 2018.

[3] X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of Adam-typealgorithms for non-convex optimization, ICLR, 2019.

1 / 1

https://www.manopt.org/

https://github.com/hiroyuki-kasai/RSOpt/

https://github.com/hiroyuki-kasai/RSOpt/

riemannian adaptive stochastic gradient algorithms on matrix...

Documents