Download - Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Adaptive Subgradient Methods

for

Online Learning and Stochastic Optimization

John C. Duchi1,2 Elad Hazan3 Yoram Singer2

1University of California, Berkeley

2Google Research

3Technion

International Symposium on Mathematical Programming 2012

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32

Page 2: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Setting: Online Convex Optimization

Online learning task—repeat:

• Learner plays point xt

• Receive function ft

• Suffer lossft(xt) + ϕ(xt)

• Parameter vector for features

• Receive label yt, features φt

• Suffer regularized logistic losslog [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1

Goal: Attain small regret

T∑t=1

ft(xt) + ϕ(xt)− infx∈X

[ T∑t=1

ft(x) + ϕ(x)

]

Page 3: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

T∑t=1

[ T∑t=1

ft(x) + ϕ(x)

]

Page 4: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

T∑t=1

[ T∑t=1

ft(x) + ϕ(x)

]

Page 5: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.a

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

Page 6: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Motivation

Text data:

914 photocopier.a

Page 7: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Motivation

Text data:

914 photocopier.a

Page 8: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Goal?Flipping around the usual sparsity game

minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d

Usually in sparsity-focused depend on

‖ai‖∞︸︷︷︸dense

· ‖x‖1︸︷︷︸sparse

What we would like:

‖ai‖1︸︷︷︸sparse

· ‖x‖∞︸︷︷︸dense

(In general, impossible)

Page 9: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

What we would like:

Page 10: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

What we would like:

Page 11: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

xt+1 = argminx∈X

{1

2‖x− xt‖2 + ηt 〈gt, x〉

}or

xt+1 = argminx∈X

{ηtt

t∑τ=1

〈gτ , x〉+1

2t‖x‖2

}

f (x)

f (xt) + 〈gt, x - xt〉

Page 12: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

xt+1 = argminx∈X

{1

2‖x− xt‖2 + ηt 〈gt, x〉

}or

xt+1 = argminx∈X

{ηtt

t∑τ=1

〈gτ , x〉+1

2t‖x‖2

}

〈gt, x〉 + 12 ||x - xt||2

Page 13: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

What is the problem?

• Gradient steps treat all features as equal

• They are not!

Page 14: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Adapting to Geometry of Space

Page 15: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Why adapt to geometry?

Hard

Nice

Page 16: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Hard

Nice

yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive

Page 17: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Hard

Nice

yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

Page 18: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Adapting to Geometry of the Space

• Receive gt ∈ ∂ft(xt)• Earlier:

xt+1 = argminx∈X

{1

2‖x− xt‖2 + η 〈gt, x〉

}

• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

xt+1 = argminx∈X

{1

2‖x− xt‖2A + η 〈gt, x〉

}

Page 19: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Adapting to Geometry of the Space

• Receive gt ∈ ∂ft(xt)• Earlier:

xt+1 = argminx∈X

{1

2‖x− xt‖2 + η 〈gt, x〉

}

• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

xt+1 = argminx∈X

{1

2‖x− xt‖2A + η 〈gt, x〉

}

Page 20: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Regret Bounds

What does adaptation buy?

• Standard regret bound:

T∑t=1

ft(xt)− ft(x∗) ≤1

2η‖x1 − x∗‖22 +

η

2

T∑t=1

‖gt‖22

• Regret bound with matrix:

T∑t=1

2η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

Page 21: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Regret Bounds

What does adaptation buy?

• Standard regret bound:

T∑t=1

2η‖x1 − x∗‖22 +

η

2

T∑t=1

‖gt‖22

• Regret bound with matrix:

T∑t=1

2η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

Page 22: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Meta Learning Problem

• Have regret:

T∑t=1

η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

• What happens if we minimize A in hindsight?

minA

T∑t=1

⟨gt, A

−1gt⟩

subject to A � 0, tr(A) ≤ C

Page 23: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Meta Learning Problem• What happens if we minimize A in hindsight?

minA

T∑t=1

⟨gt, A

−1gt⟩

• Solution is of form

A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)

(where c chosen to satisfy tr constraint)

• Let g1:t,j be vector of jth gradient component. Optimal:

Aj,j ∝ ‖g1:T , j‖2

Page 24: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

minA

T∑t=1

⟨gt, A

−1gt⟩

A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)

Aj,j ∝ ‖g1:T , j‖2

Page 25: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

minA

T∑t=1

⟨gt, A

−1gt⟩

A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)

Aj,j ∝ ‖g1:T , j‖2Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32

Page 26: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Low regret to the best A

• Let g1:t,j be vector of jth gradient component. At time t, use

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ η 〈gt, x〉}

• Example:

yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

s1 =√

3.5 s2 =√

2 s3 = 1

Page 27: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

• Let g1:t,j be vector of jth gradient component. At time t, use

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ η 〈gt, x〉}

• Example:

yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

s1 =√

3.5 s2 =√

2 s3 = 1

Page 28: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Final Convergence GuaranteeAlgorithm: at time t, set

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ η 〈gt, x〉}

Define radius

R∞ := maxt‖xt − x∗‖∞ ≤ sup

x∈X‖x− x∗‖∞ .

TheoremThe final regret bound of AdaGrad:

T∑t=1

ft(xt)− ft(x∗) ≤ 2R∞

d∑i=1

‖g1:T,j‖2 .

Page 29: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Final Convergence GuaranteeAlgorithm: at time t, set

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ η 〈gt, x〉}

Define radius

R∞ := maxt‖xt − x∗‖∞ ≤ sup

x∈X‖x− x∗‖∞ .

TheoremThe final regret bound of AdaGrad:

T∑t=1

ft(xt)− ft(x∗) ≤ 2R∞

d∑i=1

‖g1:T,j‖2 .

Page 30: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Understanding the convergence guarantees I

• Stochastic convex optimization:

f(x) := EP [f(x; ξ)]

Sample ξt according to P , define ft(x) := f(x; ξt). Then

E[f

(1

T

T∑t=1

xt

)]− f(x∗) ≤ 2R∞

T

d∑i=1

E[‖g1:T,j‖2

]

Page 31: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Understanding the convergence guarantees I

• Stochastic convex optimization:

f(x) := EP [f(x; ξ)]

Sample ξt according to P , define ft(x) := f(x; ξt). Then

E[f

(1

T

T∑t=1

xt

)]− f(x∗) ≤ 2R∞

T

d∑i=1

E[‖g1:T,j‖2

]

Page 32: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Understanding the convergence guarantees IISupport vector machine example: define

f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

• If ξj 6= 0 with probability ∝ j−α for α > 1

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·max

{log d, d1−α/2

})

• Previously best-known method:

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).

Page 33: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

{log d, d1−α/2

})

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).

Page 34: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

{log d, d1−α/2

})

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).

Page 35: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Understanding the convergence guarantees III

Back to regret minimization

• Convergence almost as good as that of the best geometrymatrix:

T∑t=1

ft(xt)− ft(x∗)

≤ 2√d ‖x∗‖∞

√√√√infs

{ T∑t=1

‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d

}

• This (and other bounds) are minimax optimal

Page 36: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Understanding the convergence guarantees III

Back to regret minimization

• Convergence almost as good as that of the best geometrymatrix:

T∑t=1

ft(xt)− ft(x∗)

≤ 2√d ‖x∗‖∞

√√√√infs

{ T∑t=1

‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d

}

• This (and other bounds) are minimax optimal

Page 37: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

The AdaGrad AlgorithmsAnalysis applies to several algorithms

st =[‖g1:t,j‖2

]dj=1

, At = diag(st)

• Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ 〈gt, x〉+ ϕ(x)

}

• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)

xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}

Page 38: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

st =[‖g1:t,j‖2

]dj=1

, At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ 〈gt, x〉+ ϕ(x)

}

xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}

Page 39: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

st =[‖g1:t,j‖2

]dj=1

, At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At

+ 〈gt, x〉+ ϕ(x)

}

xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}

Page 40: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

An Example and Experimental Results

• `1-regularization

• Text classification

• Image ranking

• Neural network learning

Page 41: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

AdaGrad with composite updates

Recall more general problem:

T∑t=1

f(xt) + ϕ(xt)− infx∗∈X

[ T∑t=1

f(x) + ϕ(x)

]

• Must solve updates of form

argminx∈X

{〈g, x〉+ ϕ(x) +

1

2‖x‖2A

}

• Luckily, often still simple

Page 42: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

T∑t=1

[ T∑t=1

f(x) + ϕ(x)

]

argminx∈X

{〈g, x〉+ ϕ(x) +

1

2‖x‖2A

}

Page 43: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

T∑t=1

[ T∑t=1

f(x) + ϕ(x)

]

argminx∈X

{〈g, x〉+ ϕ(x) +

1

2‖x‖2A

}

Page 44: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

AdaGrad with `1 regularizationSet gt = 1

t

∑tτ=1 gτ . Need to solve

minx〈gt, x〉+ λ ‖x‖1 +

1

2t〈x, diag(st)x〉

• Coordinate-wise update yields sparsity and adaptivity:

xt+1,j = sign(−gt,j)t

‖g1:t,j‖2[|gt,j| − λ]+

0 50 100 150 200 250 300−0.6

−0.4

−0.2

0

gt

λ truncation

t

Page 45: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

AdaGrad with `1 regularizationSet gt = 1

t

∑tτ=1 gτ . Need to solve

minx〈gt, x〉+ λ ‖x‖1 +

1

2t〈x, diag(st)x〉

• Coordinate-wise update yields sparsity and adaptivity:

xt+1,j = sign(−gt,j)t

‖g1:t,j‖2[|gt,j| − λ]+

0 50 100 150 200 250 300−0.6

−0.4

−0.2

0

gt

λ truncation

t

Page 46: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

FOBOS AdaGrad PA1 AROW2

Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

Test set classification error rate(sparsity of final predictor in parenthesis)

1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32

Page 47: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

FOBOS AdaGrad PA1 AROW2

Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

Test set classification error rate(sparsity of final predictor in parenthesis)

1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32

Page 48: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Image Ranking

ImageNet (Deng et al., 2009), large-scale hierarchical image database

Fish

Vertebrate

Animal

Invertebrate

Mammal

Cow

Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

f(x; z1, z2) =[1−

⟨x, z1 − z2

⟩]+

Page 49: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Image Ranking

ImageNet (Deng et al., 2009), large-scale hierarchical image database

Fish

Vertebrate

Animal

Invertebrate

Mammal

Cow

Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

f(x; z1, z2) =[1−

⟨x, z1 − z2

⟩]+

Page 50: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Image Ranking Results

Precision at k: proportion of examples in top k that belong tocategory. Average precision is average placement of all positiveexamples.

Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero

AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267AROW 0.5813 0.8597 0.8165 0.7816 1.0000

PA 0.5581 0.8455 0.7957 0.7576 1.0000Fobos 0.5042 0.7496 0.6950 0.6545 0.8996

Page 51: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Neural Network Learning

Wildly non-convex problem:

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

where

p(α) =1

1 + exp(α)

⇠1 ⇠2 ⇠3 ⇠4⇠5

x1 x2 x3 x4 x5

p(hx1, ⇠1i)

Idea: Use stochastic gradient methods to solve it anyway

Page 52: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Wildly non-convex problem:

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

where

p(α) =1

1 + exp(α)

⇠1 ⇠2 ⇠3 ⇠4⇠5

x1 x2 x3 x4 x5

p(hx1, ⇠1i)

Idea: Use stochastic gradient methods to solve it anyway

Page 53: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Avera

ge F

ram

e A

ccura

cy (

%)

Accuracy on Test Set

SGDGPU

Downpour SGD

Downpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)

Page 54: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Conclusions and Discussion

• Family of algorithms that adapt to geometry of data

• Extendable to full matrix case to handle feature correlation

• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

• Future: Efficient full-matrix adaptivity, other types of adaptation

Page 55: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Conclusions and Discussion

• Family of algorithms that adapt to geometry of data

• Extendable to full matrix case to handle feature correlation

• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

• Future: Efficient full-matrix adaptivity, other types of adaptation

Page 56: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Thanks!

Page 57: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

1

2‖xt+1 − x∗‖22 ≤

1

2‖xt − x∗‖22 + η (ft(x

∗)− ft(xt)) +η2

2‖gt‖22

• Now: xt+1 = xt − ηA−1gt1

2‖xt+1 − x∗‖2A

=1

2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

η2

2‖gt‖2A−1

≤ 1

2‖xt − x∗‖2A + η (ft(x

∗)− ft(xt)) +η2

2‖gt‖2A−1

↑dual norm to ‖·‖A

Page 58: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

1

2‖xt+1 − x∗‖22 ≤

1

2‖xt − x∗‖22 + η (ft(x

∗)− ft(xt)) +η2

2‖gt‖22

• Now: xt+1 = xt − ηA−1gt1

2‖xt+1 − x∗‖2A

=1

2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

η2

2‖gt‖2A−1

≤ 1

2‖xt − x∗‖2A + η (ft(x

∗)− ft(xt)) +η2

2‖gt‖2A−1

↑dual norm to ‖·‖A

Page 59: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Hindsight minimization

• Focus on diagonal case (full matrix case similar)

mins

T∑t=1

⟨gt, diag(s)−1gt

⟩subject to s � 0, 〈1, s〉 ≤ C

• Let g1:T,j be vector of jth component. Solution is of form

sj ∝ ‖g1:T,j‖2

Page 60: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

T∑t=1

ft(xt) + ϕ(xt)− ft(x∗)− ϕ(x∗)

≤ 1

2η

T∑t=1

(‖xt − x∗‖2At

− ‖xt+1 − x∗‖2At

)︸︷︷︸

Term I

+η

2

T∑t=1

‖gt‖2A−1t︸︷︷︸

Term II

Page 61: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

T∑t=1

(‖xt − x∗‖2At

− ‖xt+1 − x∗‖2At

)≤ D2

∞

d∑j=1

‖g1:T,j‖2

• Term II:

T∑t=1

‖gt‖2A−1t≤ 2

T∑t=1

‖gt‖2A−1T

= 2d∑j=1

‖g1:T,j‖2

= 2

√√√√infs

{T∑t=1

〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤

d∑j=1

‖g1:T,j‖2

}

Page 62: Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

T∑t=1

(‖xt − x∗‖2At

− ‖xt+1 − x∗‖2At

)≤ D2

∞

d∑j=1

‖g1:T,j‖2

• Term II:

T∑t=1

‖gt‖2A−1t≤ 2

T∑t=1

‖gt‖2A−1T

= 2d∑j=1

‖g1:T,j‖2

= 2

√√√√infs

{T∑t=1

〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤

d∑j=1

‖g1:T,j‖2

}

Download - Adaptive Subgradient Methods for Online Learning and Stochastic …stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf · 2014-09-05 · Adaptive Subgradient Methods for Online Learning

Top Related