![Page 1: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/1.jpg)
Adaptive Subgradient Methods
for
Online Learning and Stochastic Optimization
John C. Duchi1,2 Elad Hazan3 Yoram Singer2
1University of California, Berkeley
2Google Research
3Technion
International Symposium on Mathematical Programming 2012
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32
![Page 2: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/2.jpg)
Setting: Online Convex Optimization
Online learning task—repeat:
• Learner plays point xt
• Receive function ft
• Suffer lossft(xt) + ϕ(xt)
• Parameter vector for features
• Receive label yt, features φt
• Suffer regularized logistic losslog [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1
Goal: Attain small regret
T∑t=1
ft(xt) + ϕ(xt)− infx∈X
[ T∑t=1
ft(x) + ϕ(x)
]
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32
![Page 3: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/3.jpg)
Setting: Online Convex Optimization
Online learning task—repeat:
• Learner plays point xt
• Receive function ft
• Suffer lossft(xt) + ϕ(xt)
• Parameter vector for features
• Receive label yt, features φt
• Suffer regularized logistic losslog [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1
Goal: Attain small regret
T∑t=1
ft(xt) + ϕ(xt)− infx∈X
[ T∑t=1
ft(x) + ϕ(x)
]
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32
![Page 4: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/4.jpg)
Setting: Online Convex Optimization
Online learning task—repeat:
• Learner plays point xt
• Receive function ft
• Suffer lossft(xt) + ϕ(xt)
• Parameter vector for features
• Receive label yt, features φt
• Suffer regularized logistic losslog [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1
Goal: Attain small regret
T∑t=1
ft(xt) + ϕ(xt)− infx∈X
[ T∑t=1
ft(x) + ϕ(x)
]
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 2 / 32
![Page 5: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/5.jpg)
Motivation
Text data:
The most unsung birthday
in American business and
technological history
this year may be the 50th
anniversary of the Xerox
914 photocopier.a
aThe Atlantic, July/August 2010.
High-dimensional image features
Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32
![Page 6: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/6.jpg)
Motivation
Text data:
The most unsung birthday
in American business and
technological history
this year may be the 50th
anniversary of the Xerox
914 photocopier.a
aThe Atlantic, July/August 2010.
High-dimensional image features
Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32
![Page 7: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/7.jpg)
Motivation
Text data:
The most unsung birthday
in American business and
technological history
this year may be the 50th
anniversary of the Xerox
914 photocopier.a
aThe Atlantic, July/August 2010.
High-dimensional image features
Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32
![Page 8: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/8.jpg)
Goal?Flipping around the usual sparsity game
minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d
Usually in sparsity-focused depend on
‖ai‖∞︸ ︷︷ ︸dense
· ‖x‖1︸︷︷︸sparse
What we would like:
‖ai‖1︸ ︷︷ ︸sparse
· ‖x‖∞︸ ︷︷ ︸dense
(In general, impossible)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32
![Page 9: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/9.jpg)
Goal?Flipping around the usual sparsity game
minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d
Usually in sparsity-focused depend on
‖ai‖∞︸ ︷︷ ︸dense
· ‖x‖1︸︷︷︸sparse
What we would like:
‖ai‖1︸ ︷︷ ︸sparse
· ‖x‖∞︸ ︷︷ ︸dense
(In general, impossible)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32
![Page 10: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/10.jpg)
Goal?Flipping around the usual sparsity game
minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d
Usually in sparsity-focused depend on
‖ai‖∞︸ ︷︷ ︸dense
· ‖x‖1︸︷︷︸sparse
What we would like:
‖ai‖1︸ ︷︷ ︸sparse
· ‖x‖∞︸ ︷︷ ︸dense
(In general, impossible)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 4 / 32
![Page 11: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/11.jpg)
Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):
xt+1 = argminx∈X
{1
2‖x− xt‖2 + ηt 〈gt, x〉
}or
xt+1 = argminx∈X
{ηtt
t∑τ=1
〈gτ , x〉+1
2t‖x‖2
}
f (x)
f (xt) + 〈gt, x - xt〉
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 5 / 32
![Page 12: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/12.jpg)
Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):
xt+1 = argminx∈X
{1
2‖x− xt‖2 + ηt 〈gt, x〉
}or
xt+1 = argminx∈X
{ηtt
t∑τ=1
〈gτ , x〉+1
2t‖x‖2
}
〈gt, x〉 + 12 ||x - xt||2
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 5 / 32
![Page 13: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/13.jpg)
What is the problem?
• Gradient steps treat all features as equal
• They are not!
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 6 / 32
![Page 14: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/14.jpg)
Adapting to Geometry of Space
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 7 / 32
![Page 15: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/15.jpg)
Why adapt to geometry?
Hard
Nice
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
![Page 16: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/16.jpg)
Why adapt to geometry?
Hard
Nice
yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
![Page 17: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/17.jpg)
Why adapt to geometry?
Hard
Nice
yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1
1 Frequent, irrelevant
2 Infrequent, predictive
3 Infrequent, predictive
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32
![Page 18: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/18.jpg)
Adapting to Geometry of the Space
• Receive gt ∈ ∂ft(xt)• Earlier:
xt+1 = argminx∈X
{1
2‖x− xt‖2 + η 〈gt, x〉
}
• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use
xt+1 = argminx∈X
{1
2‖x− xt‖2A + η 〈gt, x〉
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 9 / 32
![Page 19: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/19.jpg)
Adapting to Geometry of the Space
• Receive gt ∈ ∂ft(xt)• Earlier:
xt+1 = argminx∈X
{1
2‖x− xt‖2 + η 〈gt, x〉
}
• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use
xt+1 = argminx∈X
{1
2‖x− xt‖2A + η 〈gt, x〉
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 9 / 32
![Page 20: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/20.jpg)
Regret Bounds
What does adaptation buy?
• Standard regret bound:
T∑t=1
ft(xt)− ft(x∗) ≤1
2η‖x1 − x∗‖22 +
η
2
T∑t=1
‖gt‖22
• Regret bound with matrix:
T∑t=1
ft(xt)− ft(x∗) ≤1
2η‖x1 − x∗‖2A +
η
2
T∑t=1
‖gt‖2A−1
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 10 / 32
![Page 21: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/21.jpg)
Regret Bounds
What does adaptation buy?
• Standard regret bound:
T∑t=1
ft(xt)− ft(x∗) ≤1
2η‖x1 − x∗‖22 +
η
2
T∑t=1
‖gt‖22
• Regret bound with matrix:
T∑t=1
ft(xt)− ft(x∗) ≤1
2η‖x1 − x∗‖2A +
η
2
T∑t=1
‖gt‖2A−1
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 10 / 32
![Page 22: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/22.jpg)
Meta Learning Problem
• Have regret:
T∑t=1
ft(xt)− ft(x∗) ≤1
η‖x1 − x∗‖2A +
η
2
T∑t=1
‖gt‖2A−1
• What happens if we minimize A in hindsight?
minA
T∑t=1
⟨gt, A
−1gt⟩
subject to A � 0, tr(A) ≤ C
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 11 / 32
![Page 23: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/23.jpg)
Meta Learning Problem• What happens if we minimize A in hindsight?
minA
T∑t=1
⟨gt, A
−1gt⟩
subject to A � 0, tr(A) ≤ C
• Solution is of form
A = c diag
( T∑t=1
gtg>t
) 12
A = c
( T∑t=1
gtg>t
) 12
(diagonal) (full)
(where c chosen to satisfy tr constraint)
• Let g1:t,j be vector of jth gradient component. Optimal:
Aj,j ∝ ‖g1:T , j‖2
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32
![Page 24: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/24.jpg)
Meta Learning Problem• What happens if we minimize A in hindsight?
minA
T∑t=1
⟨gt, A
−1gt⟩
subject to A � 0, tr(A) ≤ C
• Solution is of form
A = c diag
( T∑t=1
gtg>t
) 12
A = c
( T∑t=1
gtg>t
) 12
(diagonal) (full)
(where c chosen to satisfy tr constraint)
• Let g1:t,j be vector of jth gradient component. Optimal:
Aj,j ∝ ‖g1:T , j‖2
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32
![Page 25: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/25.jpg)
Meta Learning Problem• What happens if we minimize A in hindsight?
minA
T∑t=1
⟨gt, A
−1gt⟩
subject to A � 0, tr(A) ≤ C
• Solution is of form
A = c diag
( T∑t=1
gtg>t
) 12
A = c
( T∑t=1
gtg>t
) 12
(diagonal) (full)
(where c chosen to satisfy tr constraint)
• Let g1:t,j be vector of jth gradient component. Optimal:
Aj,j ∝ ‖g1:T , j‖2Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32
![Page 26: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/26.jpg)
Low regret to the best A
• Let g1:t,j be vector of jth gradient component. At time t, use
st =[‖g1:t,j‖2
]dj=1
and At = diag(st)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ η 〈gt, x〉}
• Example:
yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0
s1 =√
3.5 s2 =√
2 s3 = 1
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 13 / 32
![Page 27: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/27.jpg)
Low regret to the best A
• Let g1:t,j be vector of jth gradient component. At time t, use
st =[‖g1:t,j‖2
]dj=1
and At = diag(st)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ η 〈gt, x〉}
• Example:
yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0
s1 =√
3.5 s2 =√
2 s3 = 1
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 13 / 32
![Page 28: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/28.jpg)
Final Convergence GuaranteeAlgorithm: at time t, set
st =[‖g1:t,j‖2
]dj=1
and At = diag(st)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ η 〈gt, x〉}
Define radius
R∞ := maxt‖xt − x∗‖∞ ≤ sup
x∈X‖x− x∗‖∞ .
TheoremThe final regret bound of AdaGrad:
T∑t=1
ft(xt)− ft(x∗) ≤ 2R∞
d∑i=1
‖g1:T,j‖2 .
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 14 / 32
![Page 29: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/29.jpg)
Final Convergence GuaranteeAlgorithm: at time t, set
st =[‖g1:t,j‖2
]dj=1
and At = diag(st)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ η 〈gt, x〉}
Define radius
R∞ := maxt‖xt − x∗‖∞ ≤ sup
x∈X‖x− x∗‖∞ .
TheoremThe final regret bound of AdaGrad:
T∑t=1
ft(xt)− ft(x∗) ≤ 2R∞
d∑i=1
‖g1:T,j‖2 .
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 14 / 32
![Page 30: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/30.jpg)
Understanding the convergence guarantees I
• Stochastic convex optimization:
f(x) := EP [f(x; ξ)]
Sample ξt according to P , define ft(x) := f(x; ξt). Then
E[f
(1
T
T∑t=1
xt
)]− f(x∗) ≤ 2R∞
T
d∑i=1
E[‖g1:T,j‖2
]
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 15 / 32
![Page 31: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/31.jpg)
Understanding the convergence guarantees I
• Stochastic convex optimization:
f(x) := EP [f(x; ξ)]
Sample ξt according to P , define ft(x) := f(x; ξt). Then
E[f
(1
T
T∑t=1
xt
)]− f(x∗) ≤ 2R∞
T
d∑i=1
E[‖g1:T,j‖2
]
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 15 / 32
![Page 32: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/32.jpg)
Understanding the convergence guarantees IISupport vector machine example: define
f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d
• If ξj 6= 0 with probability ∝ j−α for α > 1
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·max
{log d, d1−α/2
})
• Previously best-known method:
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·√d
).
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32
![Page 33: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/33.jpg)
Understanding the convergence guarantees IISupport vector machine example: define
f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d
• If ξj 6= 0 with probability ∝ j−α for α > 1
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·max
{log d, d1−α/2
})
• Previously best-known method:
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·√d
).
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32
![Page 34: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/34.jpg)
Understanding the convergence guarantees IISupport vector machine example: define
f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d
• If ξj 6= 0 with probability ∝ j−α for α > 1
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·max
{log d, d1−α/2
})
• Previously best-known method:
E[f
(1
T
T∑t=1
xt
)]− f(x∗) = O
(‖x∗‖∞√T·√d
).
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 16 / 32
![Page 35: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/35.jpg)
Understanding the convergence guarantees III
Back to regret minimization
• Convergence almost as good as that of the best geometrymatrix:
T∑t=1
ft(xt)− ft(x∗)
≤ 2√d ‖x∗‖∞
√√√√infs
{ T∑t=1
‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d
}
• This (and other bounds) are minimax optimal
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 17 / 32
![Page 36: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/36.jpg)
Understanding the convergence guarantees III
Back to regret minimization
• Convergence almost as good as that of the best geometrymatrix:
T∑t=1
ft(xt)− ft(x∗)
≤ 2√d ‖x∗‖∞
√√√√infs
{ T∑t=1
‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d
}
• This (and other bounds) are minimax optimal
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 17 / 32
![Page 37: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/37.jpg)
The AdaGrad AlgorithmsAnalysis applies to several algorithms
st =[‖g1:t,j‖2
]dj=1
, At = diag(st)
• Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ 〈gt, x〉+ ϕ(x)
}
• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)
xt+1 = argminx∈X
{1
t
t∑τ=1
〈gτ , x〉+ ϕ(x) +1
2t‖x‖2At
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32
![Page 38: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/38.jpg)
The AdaGrad AlgorithmsAnalysis applies to several algorithms
st =[‖g1:t,j‖2
]dj=1
, At = diag(st)
• Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ 〈gt, x〉+ ϕ(x)
}
• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)
xt+1 = argminx∈X
{1
t
t∑τ=1
〈gτ , x〉+ ϕ(x) +1
2t‖x‖2At
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32
![Page 39: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/39.jpg)
The AdaGrad AlgorithmsAnalysis applies to several algorithms
st =[‖g1:t,j‖2
]dj=1
, At = diag(st)
• Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)
xt+1 = argminx∈X
{1
2‖x− xt‖2At
+ 〈gt, x〉+ ϕ(x)
}
• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)
xt+1 = argminx∈X
{1
t
t∑τ=1
〈gτ , x〉+ ϕ(x) +1
2t‖x‖2At
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 18 / 32
![Page 40: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/40.jpg)
An Example and Experimental Results
• `1-regularization
• Text classification
• Image ranking
• Neural network learning
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 19 / 32
![Page 41: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/41.jpg)
AdaGrad with composite updates
Recall more general problem:
T∑t=1
f(xt) + ϕ(xt)− infx∗∈X
[ T∑t=1
f(x) + ϕ(x)
]
• Must solve updates of form
argminx∈X
{〈g, x〉+ ϕ(x) +
1
2‖x‖2A
}
• Luckily, often still simple
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32
![Page 42: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/42.jpg)
AdaGrad with composite updates
Recall more general problem:
T∑t=1
f(xt) + ϕ(xt)− infx∗∈X
[ T∑t=1
f(x) + ϕ(x)
]
• Must solve updates of form
argminx∈X
{〈g, x〉+ ϕ(x) +
1
2‖x‖2A
}
• Luckily, often still simple
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32
![Page 43: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/43.jpg)
AdaGrad with composite updates
Recall more general problem:
T∑t=1
f(xt) + ϕ(xt)− infx∗∈X
[ T∑t=1
f(x) + ϕ(x)
]
• Must solve updates of form
argminx∈X
{〈g, x〉+ ϕ(x) +
1
2‖x‖2A
}
• Luckily, often still simple
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 20 / 32
![Page 44: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/44.jpg)
AdaGrad with `1 regularizationSet gt = 1
t
∑tτ=1 gτ . Need to solve
minx〈gt, x〉+ λ ‖x‖1 +
1
2t〈x, diag(st)x〉
• Coordinate-wise update yields sparsity and adaptivity:
xt+1,j = sign(−gt,j)t
‖g1:t,j‖2[|gt,j| − λ]+
0 50 100 150 200 250 300−0.6
−0.4
−0.2
0
gt
λ truncation
t
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 21 / 32
![Page 45: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/45.jpg)
AdaGrad with `1 regularizationSet gt = 1
t
∑tτ=1 gτ . Need to solve
minx〈gt, x〉+ λ ‖x‖1 +
1
2t〈x, diag(st)x〉
• Coordinate-wise update yields sparsity and adaptivity:
xt+1,j = sign(−gt,j)t
‖g1:t,j‖2[|gt,j| − λ]+
0 50 100 150 200 250 300−0.6
−0.4
−0.2
0
gt
λ truncation
t
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 21 / 32
![Page 46: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/46.jpg)
Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document
ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample
FOBOS AdaGrad PA1 AROW2
Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061
Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039
Test set classification error rate(sparsity of final predictor in parenthesis)
1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32
![Page 47: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/47.jpg)
Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document
ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample
FOBOS AdaGrad PA1 AROW2
Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061
Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039
Test set classification error rate(sparsity of final predictor in parenthesis)
1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32
![Page 48: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/48.jpg)
Image Ranking
ImageNet (Deng et al., 2009), large-scale hierarchical image database
Fish
Vertebrate
Animal
Invertebrate
Mammal
Cow
Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)
Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images
f(x; z1, z2) =[1−
⟨x, z1 − z2
⟩]+
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 23 / 32
![Page 49: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/49.jpg)
Image Ranking
ImageNet (Deng et al., 2009), large-scale hierarchical image database
Fish
Vertebrate
Animal
Invertebrate
Mammal
Cow
Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)
Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images
f(x; z1, z2) =[1−
⟨x, z1 − z2
⟩]+
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 23 / 32
![Page 50: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/50.jpg)
Image Ranking Results
Precision at k: proportion of examples in top k that belong tocategory. Average precision is average placement of all positiveexamples.
Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero
AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267AROW 0.5813 0.8597 0.8165 0.7816 1.0000
PA 0.5581 0.8455 0.7957 0.7576 1.0000Fobos 0.5042 0.7496 0.6950 0.6545 0.8996
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 24 / 32
![Page 51: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/51.jpg)
Neural Network Learning
Wildly non-convex problem:
f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))
where
p(α) =1
1 + exp(α)
⇠1 ⇠2 ⇠3 ⇠4⇠5
x1 x2 x3 x4 x5
p(hx1, ⇠1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
![Page 52: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/52.jpg)
Neural Network Learning
Wildly non-convex problem:
f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))
where
p(α) =1
1 + exp(α)
⇠1 ⇠2 ⇠3 ⇠4⇠5
x1 x2 x3 x4 x5
p(hx1, ⇠1i)
Idea: Use stochastic gradient methods to solve it anyway
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32
![Page 53: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/53.jpg)
Neural Network Learning
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Avera
ge F
ram
e A
ccura
cy (
%)
Accuracy on Test Set
SGDGPU
Downpour SGD
Downpour SGD w/AdagradSandblaster L−BFGS
(Dean et al. 2012)
Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 26 / 32
![Page 54: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/54.jpg)
Conclusions and Discussion
• Family of algorithms that adapt to geometry of data
• Extendable to full matrix case to handle feature correlation
• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity
• Future: Efficient full-matrix adaptivity, other types of adaptation
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 27 / 32
![Page 55: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/55.jpg)
Conclusions and Discussion
• Family of algorithms that adapt to geometry of data
• Extendable to full matrix case to handle feature correlation
• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity
• Future: Efficient full-matrix adaptivity, other types of adaptation
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 27 / 32
![Page 56: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/56.jpg)
Thanks!
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 28 / 32
![Page 57: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/57.jpg)
OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt
1
2‖xt+1 − x∗‖22 ≤
1
2‖xt − x∗‖22 + η (ft(x
∗)− ft(xt)) +η2
2‖gt‖22
• Now: xt+1 = xt − ηA−1gt1
2‖xt+1 − x∗‖2A
=1
2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+
η2
2‖gt‖2A−1
≤ 1
2‖xt − x∗‖2A + η (ft(x
∗)− ft(xt)) +η2
2‖gt‖2A−1
↑dual norm to ‖·‖A
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 29 / 32
![Page 58: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/58.jpg)
OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt
1
2‖xt+1 − x∗‖22 ≤
1
2‖xt − x∗‖22 + η (ft(x
∗)− ft(xt)) +η2
2‖gt‖22
• Now: xt+1 = xt − ηA−1gt1
2‖xt+1 − x∗‖2A
=1
2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+
η2
2‖gt‖2A−1
≤ 1
2‖xt − x∗‖2A + η (ft(x
∗)− ft(xt)) +η2
2‖gt‖2A−1
↑dual norm to ‖·‖A
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 29 / 32
![Page 59: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/59.jpg)
Hindsight minimization
• Focus on diagonal case (full matrix case similar)
mins
T∑t=1
⟨gt, diag(s)−1gt
⟩subject to s � 0, 〈1, s〉 ≤ C
• Let g1:T,j be vector of jth component. Solution is of form
sj ∝ ‖g1:T,j‖2
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 30 / 32
![Page 60: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/60.jpg)
Low regret to the best A
T∑t=1
ft(xt) + ϕ(xt)− ft(x∗)− ϕ(x∗)
≤ 1
2η
T∑t=1
(‖xt − x∗‖2At
− ‖xt+1 − x∗‖2At
)︸ ︷︷ ︸
Term I
+η
2
T∑t=1
‖gt‖2A−1t︸ ︷︷ ︸
Term II
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 31 / 32
![Page 61: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/61.jpg)
Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:
T∑t=1
(‖xt − x∗‖2At
− ‖xt+1 − x∗‖2At
)≤ D2
∞
d∑j=1
‖g1:T,j‖2
• Term II:
T∑t=1
‖gt‖2A−1t≤ 2
T∑t=1
‖gt‖2A−1T
= 2d∑j=1
‖g1:T,j‖2
= 2
√√√√infs
{T∑t=1
〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤
d∑j=1
‖g1:T,j‖2
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 32 / 32
![Page 62: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning](https://reader035.vdocuments.us/reader035/viewer/2022070910/5f9e8a029cae4004e5196150/html5/thumbnails/62.jpg)
Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:
T∑t=1
(‖xt − x∗‖2At
− ‖xt+1 − x∗‖2At
)≤ D2
∞
d∑j=1
‖g1:T,j‖2
• Term II:
T∑t=1
‖gt‖2A−1t≤ 2
T∑t=1
‖gt‖2A−1T
= 2d∑j=1
‖g1:T,j‖2
= 2
√√√√infs
{T∑t=1
〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤
d∑j=1
‖g1:T,j‖2
}
Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 32 / 32