![Page 1: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/1.jpg)
Ergodic Subgradient Descent
John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan
University of California, Berkeley and Royal Institute of Technology (KTH), Sweden
Allerton Conference, September 2011
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19
![Page 2: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/2.jpg)
Stochastic Gradient Descent
Goal: solveminimize f(x) subject to x ∈ X .
Repeat: At iteration t
◮ Receive stochastic gradient g(t):
E[g(t) | g(1), . . . , g(t− 1)]
= ∇f(x(t))
◮ Update
x(t+ 1) = x(t)− α(t)g(t)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19
![Page 3: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/3.jpg)
Stochastic Gradient Descent
Goal: solveminimize f(x) subject to x ∈ X .
Repeat: At iteration t
◮ Receive stochastic gradient g(t):
E[g(t) | g(1), . . . , g(t− 1)]
= ∇f(x(t))︸ ︷︷ ︸unbiased
◮ Update
x(t+ 1) = x(t)− α(t)g(t)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19
![Page 4: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/4.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 5: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/5.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 6: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/6.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 7: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/7.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 8: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/8.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 9: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/9.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 10: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/10.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Unbiased gradient estimate: choose i ∈ {1, . . . , n} uniformly at random,
g(t) = ∇fi(x(t))
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19
![Page 11: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/11.jpg)
Unbiasedness?
Where does the data comefrom? The noise?
◮ Financial data
◮ Autoregressive processes
◮ Markov chains
◮ Queuing systems
◮ Machine learning
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 4 / 19
![Page 12: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/12.jpg)
Related Work
◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)
◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)
◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19
![Page 13: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/13.jpg)
Related Work
◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)
◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)
◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)
What we do: Finite-time convergence rates of stochastic optimizationprocedures with dependent noise
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19
![Page 14: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/14.jpg)
Stochastic Optimization
Goal: solve the following problem
minx
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
subject to x ∈ X
Here ξ come from distribution Π on sample space Ξ.
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19
![Page 15: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/15.jpg)
Stochastic Optimization
Goal: solve the following problem
minx
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
subject to x ∈ X
Here ξ come from distribution Π on sample space Ξ.
Example: Distributed Optimization
f(x) =1
n
n∑
i=1
F (x; ξi) =1
n
n∑
i=1
fi(x),
where ξi is data on machine i
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19
![Page 16: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/16.jpg)
Difficulties
Goal: Solve
minx∈X
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19
![Page 17: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/17.jpg)
Difficulties
Goal: Solve
minx∈X
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
◮ Cannot compute expectation in closed form
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19
![Page 18: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/18.jpg)
Difficulties
Goal: Solve
minx∈X
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
◮ Cannot compute expectation in closed form
◮ May not know distribution Π
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19
![Page 19: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/19.jpg)
Difficulties
Goal: Solve
minx∈X
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
◮ Cannot compute expectation in closed form
◮ May not know distribution Π
◮ Might not even be able to get samples ξ ∼ Π
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19
![Page 20: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/20.jpg)
Difficulties
Goal: Solve
minx∈X
f(x) = EΠ[F (x; ξ)] =
∫
ΞF (x; ξ)dΠ(ξ)
◮ Cannot compute expectation in closed form
◮ May not know distribution Π
◮ Might not even be able to get samples ξ ∼ Π
Solution: stochastic gradient descent, but with samples ξ from adistribution P t, where
P t → Π
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19
![Page 21: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/21.jpg)
Ergodic Gradient Descent
Algorithm: Receive ξt ∼ P (· | ξ1, . . . , ξt−1), compute stochastic(sub)gradient:
g(t) ∈ ∂F (x(t); ξt),
then projected gradient step:
x(t+ 1) = ProjX [x(t)− α(t)g(t)]
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 8 / 19
![Page 22: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/22.jpg)
Stochastic Assumption
Ergodicity: The stochastic process ξ1, ξ2, . . . , ξt ∼ P is sufficientlymixing, i.e.
dTV(P (ξt | ξ1, . . . , ξs),Π) → 0
as t− s ↑ ∞. Specifically, define mixing time τmix such that
t− s ≥ τmix(P, ǫ) ⇒ dTV(P (ξt | ξ1, . . . , ξs),Π) ≤ ǫ.
ξ1 ξ2 ξ3 ξt
weakened dependence
︸︷︷︸
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 9 / 19
![Page 23: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/23.jpg)
Main Results:
For algorithm
g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]
have following guarantees on x(T ) = 1T
∑Tt=1 x(t):
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19
![Page 24: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/24.jpg)
Main Results:
For algorithm
g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]
have following guarantees on x(T ) = 1T
∑Tt=1 x(t):
Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,
E[f(x(T ))]− f(x∗) ≤ C
√τmix(P )√
T
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19
![Page 25: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/25.jpg)
Main Results:
For algorithm
g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]
have following guarantees on x(T ) = 1T
∑Tt=1 x(t):
Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,
E[f(x(T ))]− f(x∗) ≤ C
√τmix(P )√
T
Theorem: (High-probability convergence) With probability at least1− e−κ,
f(x(T ))− f(x∗) ≤ C
√τmix(P )√
T︸ ︷︷ ︸Expected rate
+ C
√κτmix(P ) log τmix(P )
T︸ ︷︷ ︸Deviation
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19
![Page 26: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/26.jpg)
Examples:
◮ Peer-to-peer Distributed Optimization
◮ Ranking algorithms (optimization on combinatorial spaces)
◮ Slowly mixing Markov chains
◮ Any mixing process
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 11 / 19
![Page 27: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/27.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19
![Page 28: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/28.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19
![Page 29: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/29.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19
![Page 30: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/30.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19
![Page 31: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/31.jpg)
Peer-to-peer distributed optimization
Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:
f(x) =1
n
n∑
i=1
fi(x)
Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:
x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))
)
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19
![Page 32: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/32.jpg)
Peer-to-peer distributed optimization: Convergence
Convergence rate of
x(t+1) = ProjX(x(t)− α(t)∇fi(t)(x(t))
)
governed by spectral gap oftransition matrix P (here ρ2(P ) issecond singular value of P ):
f(x(T ))−f(x∗) = O(√
log(Tn)
1− ρ2(P )︸ ︷︷ ︸
τmix
· 1√T
)
in expectation and w.h.p.
f4(x)
f5(x) f6(x)
f2(x)
f3(x)
f1(x)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 13 / 19
![Page 33: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/33.jpg)
Optimization over combinatorial spaces
General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19
![Page 34: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/34.jpg)
Optimization over combinatorial spaces
General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)
Example: Learn a ranking.Receive pairwise userpreferences between items,would like to be oblivious toorder of remainder. P ispartial order for ranked items,{σ ∈ P} is permutations of[n] consistent with P
1
3 6
32
4
57
8
Objective:
f(x) :=1
card(σ ∈ P)
∑
σ∈P
F (x;σ)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19
![Page 35: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/35.jpg)
Partial-order Permutation Markov Chain
Markov chain (Karzanov and Khachiyan 91): pick a random pair (i, j),swap if j ≺ i is consistent with P
1 i j n
j i n1
Mixing time (Wilson 04): τmix(P, ǫ) ≤ 4π2n
3 log nǫ
Convergence rate:
f(x(T ))− f(x∗) = O(n3/2
√log(Tn)√T
)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 15 / 19
![Page 36: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/36.jpg)
Slowly mixing Markov chains
◮ Might not have such fast mixing rates (i.e. log 1ǫ )
◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization
(EM) algorithm
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19
![Page 37: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/37.jpg)
Slowly mixing Markov chains
◮ Might not have such fast mixing rates (i.e. log 1ǫ )
◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization
(EM) algorithm
◮ Possible mixing rate is algebraic:
τmix(P, ǫ) ≤ Mǫ−β
for a β > 0.
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19
![Page 38: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/38.jpg)
Slowly mixing Markov chains
◮ Might not have such fast mixing rates (i.e. log 1ǫ )
◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization
(EM) algorithm
◮ Possible mixing rate is algebraic:
τmix(P, ǫ) ≤ Mǫ−β
for a β > 0.
◮ Consequence: expected and high probability convergence
f(x(T ))− f(x∗) = O(T−
1
2β+2
)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19
![Page 39: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/39.jpg)
Arbitrary mixing process
General convergence guarantee:
f(x(T ))− f(x∗) ≤ C1√T
+ C infǫ>0
{ǫ+
τmix(P, ǫ)√T
}.
Consequence: If τmix(P, ǫ) < ∞, then
f(x(T ))− f(x∗) → 0 with probability 1.
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 17 / 19
![Page 40: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/40.jpg)
Conclusions and Discussion
◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise
◮ Convergence rates dependent on mixing time τmix
◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19
![Page 41: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/41.jpg)
Conclusions and Discussion
◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise
◮ Convergence rates dependent on mixing time τmix
◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)
◮ Future work: dynamic adaptation for unknown mixing rates
◮ Weaken uniformity of mixing time assumptions
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19
![Page 42: Allerton Conference, September 2011jduchi/projects/DuchiAgJoJo11_slides.pdfRelated Work Stochastic approximation methods (e.g. Robbins and Monro 1951, Polyak and Juditsky 1992) ODE](https://reader036.vdocuments.us/reader036/viewer/2022081407/6053a600432e3a56a51e71b0/html5/thumbnails/42.jpg)
Thanks!
Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 19 / 19