allerton conference, september 2011jduchi/projects/duchiagjojo11_slides.pdfrelated work stochastic...

42
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September 2011 Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Ergodic Subgradient Descent

John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan

University of California, Berkeley and Royal Institute of Technology (KTH), Sweden

Allerton Conference, September 2011

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19

Page 2: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19

Page 3: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))︸ ︷︷ ︸unbiased

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19

Page 4: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 5: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 6: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 7: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 8: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 9: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 10: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Unbiased gradient estimate: choose i ∈ {1, . . . , n} uniformly at random,

g(t) = ∇fi(x(t))

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Page 11: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Unbiasedness?

Where does the data comefrom? The noise?

◮ Financial data

◮ Autoregressive processes

◮ Markov chains

◮ Queuing systems

◮ Machine learning

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 4 / 19

Page 12: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19

Page 13: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

What we do: Finite-time convergence rates of stochastic optimizationprocedures with dependent noise

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19

Page 14: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19

Page 15: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Example: Distributed Optimization

f(x) =1

n

n∑

i=1

F (x; ξi) =1

n

n∑

i=1

fi(x),

where ξi is data on machine i

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19

Page 16: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Page 17: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Page 18: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Page 19: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

◮ Might not even be able to get samples ξ ∼ Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Page 20: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

◮ Might not even be able to get samples ξ ∼ Π

Solution: stochastic gradient descent, but with samples ξ from adistribution P t, where

P t → Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Page 21: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Ergodic Gradient Descent

Algorithm: Receive ξt ∼ P (· | ξ1, . . . , ξt−1), compute stochastic(sub)gradient:

g(t) ∈ ∂F (x(t); ξt),

then projected gradient step:

x(t+ 1) = ProjX [x(t)− α(t)g(t)]

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 8 / 19

Page 22: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Stochastic Assumption

Ergodicity: The stochastic process ξ1, ξ2, . . . , ξt ∼ P is sufficientlymixing, i.e.

dTV(P (ξt | ξ1, . . . , ξs),Π) → 0

as t− s ↑ ∞. Specifically, define mixing time τmix such that

t− s ≥ τmix(P, ǫ) ⇒ dTV(P (ξt | ξ1, . . . , ξs),Π) ≤ ǫ.

ξ1 ξ2 ξ3 ξt

weakened dependence

︸︷︷︸

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 9 / 19

Page 23: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Page 24: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Page 25: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T

Theorem: (High-probability convergence) With probability at least1− e−κ,

f(x(T ))− f(x∗) ≤ C

√τmix(P )√

T︸ ︷︷ ︸Expected rate

+ C

√κτmix(P ) log τmix(P )

T︸ ︷︷ ︸Deviation

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Page 26: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Examples:

◮ Peer-to-peer Distributed Optimization

◮ Ranking algorithms (optimization on combinatorial spaces)

◮ Slowly mixing Markov chains

◮ Any mixing process

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 11 / 19

Page 27: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Page 28: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Page 29: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Page 30: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Page 31: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Page 32: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Peer-to-peer distributed optimization: Convergence

Convergence rate of

x(t+1) = ProjX(x(t)− α(t)∇fi(t)(x(t))

)

governed by spectral gap oftransition matrix P (here ρ2(P ) issecond singular value of P ):

f(x(T ))−f(x∗) = O(√

log(Tn)

1− ρ2(P )︸ ︷︷ ︸

τmix

· 1√T

)

in expectation and w.h.p.

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 13 / 19

Page 33: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19

Page 34: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Example: Learn a ranking.Receive pairwise userpreferences between items,would like to be oblivious toorder of remainder. P ispartial order for ranked items,{σ ∈ P} is permutations of[n] consistent with P

1

3 6

32

4

57

8

Objective:

f(x) :=1

card(σ ∈ P)

σ∈P

F (x;σ)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19

Page 35: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Partial-order Permutation Markov Chain

Markov chain (Karzanov and Khachiyan 91): pick a random pair (i, j),swap if j ≺ i is consistent with P

1 i j n

j i n1

Mixing time (Wilson 04): τmix(P, ǫ) ≤ 4π2n

3 log nǫ

Convergence rate:

f(x(T ))− f(x∗) = O(n3/2

√log(Tn)√T

)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 15 / 19

Page 36: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Page 37: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Page 38: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

◮ Consequence: expected and high probability convergence

f(x(T ))− f(x∗) = O(T−

1

2β+2

)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Page 39: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Arbitrary mixing process

General convergence guarantee:

f(x(T ))− f(x∗) ≤ C1√T

+ C infǫ>0

{ǫ+

τmix(P, ǫ)√T

}.

Consequence: If τmix(P, ǫ) < ∞, then

f(x(T ))− f(x∗) → 0 with probability 1.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 17 / 19

Page 40: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19

Page 41: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

◮ Future work: dynamic adaptation for unknown mixing rates

◮ Weaken uniformity of mixing time assumptions

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19

Page 42: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Thanks!

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 19 / 19