Download - Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Ergodic Subgradient Descent

John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan

University of California, Berkeley and Royal Institute of Technology (KTH), Sweden

Allerton Conference, September 2011

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))

◮ Update

x(t+ 1) = x(t)− α(t)g(t)


Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))︸︷︷︸unbiased

◮ Update

x(t+ 1) = x(t)− α(t)g(t)


Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)




f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)




f(x) =1

n

n∑

i=1

fi(x)



)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Unbiased gradient estimate: choose i ∈ {1, . . . , n} uniformly at random,

g(t) = ∇fi(x(t))


Unbiasedness?

Where does the data comefrom? The noise?

◮ Financial data

◮ Autoregressive processes

◮ Markov chains

◮ Queuing systems

◮ Machine learning


Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)


Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

What we do: Finite-time convergence rates of stochastic optimizationprocedures with dependent noise


Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.


Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Example: Distributed Optimization

f(x) =1

n

n∑

i=1

F (x; ξi) =1

n

n∑

i=1

fi(x),

where ξi is data on machine i

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)


Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)


Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form


Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)


◮ May not know distribution Π


Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)



◮ Might not even be able to get samples ξ ∼ Π


Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

∫

ΞF (x; ξ)dΠ(ξ)



◮ Might not even be able to get samples ξ ∼ Π

Solution: stochastic gradient descent, but with samples ξ from adistribution P t, where

P t → Π


Ergodic Gradient Descent

Algorithm: Receive ξt ∼ P (· | ξ1, . . . , ξt−1), compute stochastic(sub)gradient:

g(t) ∈ ∂F (x(t); ξt),

then projected gradient step:

x(t+ 1) = ProjX [x(t)− α(t)g(t)]


Stochastic Assumption

Ergodicity: The stochastic process ξ1, ξ2, . . . , ξt ∼ P is sufficientlymixing, i.e.

dTV(P (ξt | ξ1, . . . , ξs),Π) → 0

as t− s ↑ ∞. Specifically, define mixing time τmix such that

t− s ≥ τmix(P, ǫ) ⇒ dTV(P (ξt | ξ1, . . . , ξs),Π) ≤ ǫ.

ξ1 ξ2 ξ3 ξt

weakened dependence

︸︷︷︸


Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):


Main Results:

For algorithm



∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T


Main Results:

For algorithm



∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T

Theorem: (High-probability convergence) With probability at least1− e−κ,

f(x(T ))− f(x∗) ≤ C

√τmix(P )√

T︸︷︷︸Expected rate

+ C

√κτmix(P ) log τmix(P )

T︸︷︷︸Deviation


Examples:

◮ Peer-to-peer Distributed Optimization

◮ Ranking algorithms (optimization on combinatorial spaces)

◮ Slowly mixing Markov chains

◮ Any mixing process




f(x) =1

n

n∑

i=1

fi(x)



)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)


Peer-to-peer distributed optimization: Convergence

Convergence rate of

x(t+1) = ProjX(x(t)− α(t)∇fi(t)(x(t))

)

governed by spectral gap oftransition matrix P (here ρ2(P ) issecond singular value of P ):

f(x(T ))−f(x∗) = O(√

log(Tn)

1− ρ2(P )︸︷︷︸

τmix

· 1√T

)

in expectation and w.h.p.

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)


Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)


Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Example: Learn a ranking.Receive pairwise userpreferences between items,would like to be oblivious toorder of remainder. P ispartial order for ranked items,{σ ∈ P} is permutations of[n] consistent with P

1

3 6

32

4

57

8

Objective:

f(x) :=1

card(σ ∈ P)

∑

σ∈P

F (x;σ)


Partial-order Permutation Markov Chain

Markov chain (Karzanov and Khachiyan 91): pick a random pair (i, j),swap if j ≺ i is consistent with P

1 i j n

j i n1

Mixing time (Wilson 04): τmix(P, ǫ) ≤ 4π2n

3 log nǫ

Convergence rate:

f(x(T ))− f(x∗) = O(n3/2

√log(Tn)√T

)


Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm





(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.





(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

◮ Consequence: expected and high probability convergence

f(x(T ))− f(x∗) = O(T−

1

2β+2

)


Arbitrary mixing process

General convergence guarantee:

f(x(T ))− f(x∗) ≤ C1√T

+ C infǫ>0

{ǫ+

τmix(P, ǫ)√T

}.

Consequence: If τmix(P, ǫ) < ∞, then

f(x(T ))− f(x∗) → 0 with probability 1.


Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)


Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

◮ Future work: dynamic adaptation for unknown mixing rates

◮ Weaken uniformity of mixing time assumptions


Thanks!


Download - Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE

Top Related