simple techniques for improving sgd · 2017. 9. 12. · effect of step size on convergence...

51
Simple Techniques for Improving SGD CS6787 Lecture 2 — Fall 2017

Upload: others

Post on 27-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Simple Techniques for Improving SGD

CS6787 Lecture 2 — Fall 2017

Page 2: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Step Sizes and Convergence

Page 3: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Where we left off

• Stochastic gradient descent

• Much faster per iteration than gradient descent• Because we don’t have to process the entire training set

• But converges to a noise ball

xt+1 = xt � ↵rf(xt; yit)

limT!1

E⇥kxT � x

⇤k2⇤ ↵M

2µ� ↵µ

2

Page 4: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Controlling the accuracy

• Want the noise ball to be as small as possible for accurate solutions

• Noise ball proportional to the step size/learning rate

• So should we make the step size as small as possible?

limT!1

E⇥kxT � x⇤k2

⇤ ↵M

2µ� ↵µ2= O(↵)

Page 5: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Effect of step size on convergence

• Let’s go back to the convergence rate proof for SGD• From the previous lecture, we have

• If we’re far from the noise ball i.e.

Ehkxt+1 � x

⇤k2���xt

i (1� ↵µ)2 kxt � x

⇤k2 + ↵

2M.

kxt � x

⇤k2 � 2↵M

µ

Ehkxt+1 � x

⇤k2���xt

i (1� ↵µ)2 kxt � x

⇤k2 + ↵µ

2kxt � x

⇤k2 .

Page 6: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Effect of step size on convergence (continued)

• So to contract by a factor of C, we need to run T steps, where

Ehkxt+1 � x

⇤k2���xt

i (1� ↵µ)

2 kxt � x

⇤k2 + ↵µ

2

kxt � x

⇤k2

⇣1� ↵µ

2

⌘kxt � x

⇤k2 (if ↵µ < 1)

exp

⇣�↵µ

2

⌘kxt � x

⇤k2 .

1 = exp

✓�↵µT

2

◆C , T =

2

↵µlogC

Page 7: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

The Full Effect of Step Size

• Noise ball proportional to the step size

• Convergence time inversely proportional to the step size

• So there’s a trade-off !

limT!1

E⇥kxT � x⇤k2

⇤ ↵M

2µ� ↵µ2= O(↵)

T =

2

↵µlogC

Page 8: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Demo

Page 9: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Can we get the best of both worlds?

• When do we want the step size to be large?• At the beginning of execution? Near the end? Both?

• When do we want the step size to be small?• At the beginning of execution? Near the end? Both?

• What about using a decreasing step size scheme?

At the beginning of execution!

Near the end!

Page 10: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

SGD with Varying Step Size

• Allow the step size to vary over time

• Turns out this is the standard in basically all machine learning!

• Two ways to do it:• Chosen a priori step sizes — step size doesn’t depend on measurements• Adaptive step sizes — choose step size based on measurements & heuristics

xt+1 = xt � ↵trf(xt; yit)

Page 11: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Optimal Step Sizes for Convex Objectives

• Can we use math to choose a step size?• Start with our previous bound

• Right side is minimized when

0 = �µEhkxt � x

⇤k2i+ 2↵tM , ↵t =

µ

2MEhkxt � x

⇤k2i

Ehkxt+1 � x

⇤k2i (1� ↵tµ)

2Ehkxt � x

⇤k2i+ ↵

2tM

(1� ↵tµ)Ehkxt � x

⇤k2i+ ↵

2tM (for ↵tµ < 1)

Page 12: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Optimal Step Sizes (continued)

Let ⇢t = Ehkxt � x

⇤k2i

⇢t+1 (1� ↵tµ) ⇢t + ↵2tM

=⇣1�

⇣ µ

2M⇢t⌘µ⌘⇢t +

⇣ µ

2M⇢t⌘2

M

= ⇢t �µ2

2M⇢2t +

µ2

4M⇢2t

= ⇢t �µ2

4M⇢2t

Page 13: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Optimal Step Sizes (continued)

Let ⇢t = Ehkxt � x

⇤k2i

1

⇢t+1�

✓⇢t �

µ2

4M⇢2t

◆�1

=1

⇢t

✓1� µ2

4M⇢t

◆�1

� 1

⇢t

✓1 +

µ2

4M⇢t

=1

⇢t+

µ2

4M.

(since (1� z)�1 � 1 + z)

Page 14: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Optimal Step Sizes (continued)

• Sometimes called a 1/T rate.• Slower than the linear rate of gradient descent.

Let ⇢t = Ehkxt � x

⇤k2i

1

⇢T� 1

⇢0+

µ2T

4M) ⇢T 4M⇢0

4M + µ2⇢0T

Page 15: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

↵t =µ

2M· 4M⇢04M + µ2⇢0t

=2µ⇢0

4M + µ2⇢0t=

1

⇥(t)

Optimal Step Sizes (continued)

• Substitute back in to find how to set the step size:

• This is a pretty common simple scheme• General form is

↵t =↵0

1 + �t

Page 16: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Demo

Page 17: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Have we solved step sizes for SGD forever?

• No.

• We don’t usually know what are• Even if the problem is convex

• This “optimal” rate optimizes the upper bound on the expected distance-squared to the optimum• But sometimes this bound is loose, and other step size schemes might do better

µ, L, M , and ⇢0

Page 18: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

What if we don’t know the parameters?

• One idea: still use a step size scheme of the form

• Choose parameters via some other method• For example, hand-tuning — which doesn’t scale

• Can we do this automatically?• Yes! This is an example of metaparameter optimization

↵t =↵0

1 + �t

↵0 and t

Page 19: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Other Techniques

• Decrease the step size in epochs

• Still asymptotically but step size decreases in discrete steps

• Useful for parallelization and saves a little compute

• Per-parameter learning rates — e.g. AdaGrad• Replace scalar ⍺ with diagonal matrix A.

• Helps with poorly scaled problems

↵t =1

⇥(t)

xt+1 = xt �Arf(xt; yit)

Page 20: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batching

Page 21: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Gradient Descent vs. SGD

• Gradient descent: all examples at once

• Stochastic gradient descent: one example at a time

• Is it really all or nothing? Can we do something intermediate?

xt+1 = xt � ↵t1

N

NX

i=1

rf(xt; yi)

xt+1 = xt � ↵trf(xt; yit)

Page 22: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch Stochastic Gradient Descent

• An intermediate approach

where Bt is sampled uniformly from the set of all subsets of {1, … , N} of size b.• The b parameter is the batch size• Typically choose b << N.

• Also called mini-batch gradient descent

xt+1 = xt � ↵t1

|Bt|X

i2Bt

rf(xt; yi)

Page 23: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Advantages of Mini-Batch

• Takes less time to compute each update than gradient descent• Only needs to sum up b gradients, rather than N

• But takes more time for each update than SGD• So what’s the benefit?

• It’s more like gradient descent, so maybe it converges faster than SGD?

xt+1 = xt � ↵t1

|Bt|X

i2Bt

rf(xt; yi)

Page 24: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges

• Start by breaking up the update rule into expected update and noise

• Variance analysis

xt+1 � x

⇤ = xt � x

⇤ � ↵t (rh(xt)�rh(x⇤))

� ↵t1

|Bt|X

i2Bt

(rf(xt; yi)�rh(xt))

Var (xt+1 � x

⇤) = Var

↵t

1

|Bt|X

i2Bt

(rf(xt; yi)�rh(xt))

!

Page 25: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges (continued)

Var (xt+1 � x

⇤) =↵

2t

|Bt|2Var

X

i2Bt

(rf(xt; yi)�rh(xt))

!

=↵

2t

|Bt|2Var

NX

i=1

�i�i

!

=↵

2t

|Bt|2NX

i=1

NX

j=1

�i�j�i�j

Let �i = rf(xt; yi)�rh(xt), and �i =

(1 i 2 Bt

0 i /2 Bt

Page 26: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges (continued)

• Because we sampled B uniformly at random, for i ≠ j

• So we can write the variance as

E [�i�j ] = P (i 2 B ^ j 2 B) = P (i 2 B)P (j 2 B|i 2 B) =b

N· b� 1

N � 1

E⇥�2i

⇤= P (i 2 B) =

b

N

Var (xt+1 � x

⇤) =↵

2t

|Bt|2

0

@X

i 6=j

b(b� 1)

N(N � 1)�i�j +

NX

i=1

b

N

�2i

1

A

Page 27: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges (continued)

Var (xt+1 � x

⇤) =↵

2t

b

2

0

@X

i 6=j

b(b� 1)

N(N � 1)�i�j +

NX

i=1

b

N

�2i

1

A

=↵

2t

bN

0

@ b� 1

N � 1

NX

i=1

NX

j=1

�i�j +NX

i=1

✓1� b� 1

N � 1

◆�2

i

1

A

=↵

2t

bN

0

@ b� 1

N � 1

NX

i=1

�i

!2

+N � b

N � 1

NX

i=1

�2i

1

A

=↵

2t (N � b)

bN(N � 1)

NX

i=1

�2i

Page 28: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges (continued)

• Compared with SGD, variance decreased by a factor of b

Var (xt+1 � x

⇤) =↵

2t (N � b)

b(N � 1)· 1

N

NX

i=1

�2i

=↵

2t (N � b)

b(N � 1)Ehkrf(xt; yit)�rh(xt)k2

���xt

i

2t (N � b)

b(N � 1)M

2tM

b

Page 29: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD Converges (continued)

• Recall that SGD converged to a noise ball of size

• Since mini-batching decreases variance by a factor of b, it will have

• Noise ball smaller by the same factor!

limT!1

E⇥kxT � x

⇤k2⇤ ↵M

2µ� ↵µ

2

limT!1

EhkxT � x

⇤k2i ↵M

(2µ� ↵µ

2)b

Page 30: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Advantages of Mini-Batch (reprise)

• Takes less time to compute each update than gradient descent• Only needs to sum up b gradients, rather than N

• Converges to a smaller noise ball than stochastic gradient descent

xt+1 = xt � ↵t1

|Bt|X

i2Bt

rf(xt; yi)

limT!1

EhkxT � x

⇤k2i ↵M

(2µ� ↵µ

2)b

Page 31: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

How to choose the batch size?

• Mini-batching is not a free win• Naively, compared with SGD, it takes b times as much effort to get a b-times-as-

accurate answer• But we could have gotten a b-times-as-accurate answer by just running SGD for

b times as many steps with a step size of ⍺/b.

• But it still makes sense to run it for systems and statistical reasons• Mini-batching exposes more parallelism• Mini-batching lets us estimate statistics about the full gradient more accurately

• Another use case for metaparameter optimization

Page 32: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Mini-Batch SGD is very widely used

• Including in basically all neural network training

• b = 32 is a typical default value for batch size• From “Practical Recommendations for Gradient-Based Training of Deep

Architectures,” Bengio 2012.

Page 33: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Overfitting,Generalization Error, and Regularization

Page 34: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Minimizing Training Loss is Not our Real Goal

• Training loss looks like

• What we actually want to minimize is expected loss on new examples• Drawn from some real-world distribution ɸ

• Typically, assume the training examples were drawn from this distribution

h(x) =1

N

NX

i=1

f(x; yi)

h(x) = Ey⇠� [f(x; y)]

Page 35: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Overfitting

• Minimizing the training loss doesn't generally minimize the expected loss on new examples• They are two different objective functions after all

• Difference between the empirical loss on the training set and the expected loss on new examples is called the generalization error

• Even a model that has high accuracy on the training set can have terrible performance on new examples• Phenomenon is called overfitting

Page 36: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Demo

Page 37: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

How to address overfitting

• Many, many techniques to deal with overfitting• Have varying computational costs

• But this is a systems course…so what can we do with little or no extra computational cost?

• Notice from the demo that some loss functions do better than others• Can we modify our loss function to prevent overfitting?

Page 38: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Regularization

• Add an extra regularization term to the objective function

• Most popular type: L2 regularization

• Also popular: L1 regularization

h(x) =1

N

NX

i=1

f(x; yi) + �

2 kxk22 =1

N

NX

i=1

f(x; yi) + �

2dX

k=1

x

2i

h(x) =1

N

NX

i=1

f(x; yi) + � kxk1 =1

N

NX

i=1

f(x; yi) + �

dX

k=1

|xi|

Page 39: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Benefits of Regularization

• Cheap to compute• For SGD and L2 regularization, there’s just an extra scaling

• Makes the objective strongly convex• This makes it easier to get and prove bounds on convergence

• Helps with overfitting

xt+1 = (1� 2↵t�2)xt � ↵trf(xt; yit)

Page 40: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Motivation

• One way to think about regularization is as a Bayesian prior

• MLE interpretation of learning problem:

• Taking the logarithm:

P (yi|x) =1

Z

exp(�f(x; yi))

P (x|y) = P (y|x)P (x)

P (y)

=

P (y|x)P (x)

P (y)

NY

i=1

1

Z

exp(�f(x; yi))

logP (x|y) = � log(Z)�NX

i=1

f(x; yi) + logP (x)� logP (y)

Page 41: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Motivation (continued)

• So the MLE problem becomes

• Now we need to pick a prior probability distribution for x• Say we choose a Gaussian prior

max

x

logP (x|y) = �NX

i=1

f(x; y

i

) + logP (x) + (constants)

max

x

logP (x|y) = �NX

i=1

f(x; y

i

) + log

1p2⇡�

exp

��

2 kxk2

2

!!+ (C)

= �NX

i=1

f(x; y

i

)� �

2

2

kxk2 + (C) There’s our regularization term!

Page 42: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Motivation (continued)

• By setting a prior, we limit the values we think the model can have

• This prevents the model from doing bad things to try to fit the data• Like the polynomial-fitting example from the demo

Page 43: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Demo

Page 44: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

How to choose the regularization parameter?

• One way is to use an independent validation set to estimate the test error, and set the regularization parameter manually so that it is high enough to avoid overfitting• This is what we saw in the demo

• But doing this naively can be computationally expensive• Need to re-run learning algorithm many times

• Yet another use case for metaparameter optimization

Page 45: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Early Stopping

Page 46: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Asymptotically large training sets

• Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations.

• Can our algorithm in this setting overfit?• No, because its training set is asymptotically equal to the true distribution.

• Can we compute this efficiently?• No, because its training set is asymptotically infinitely large

Page 47: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Consider a second setting

• Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations.

• Setting 2: we have a distribution ɸ and we sample N points from it, then run stochastic gradient descent using each of these points exactly once.

• What is the difference between the output of SGD in these two settings?• Asymptotically, there’s no difference!• So SGD in Setting 2 will also never overfit

Page 48: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Early Stopping

• Motivation: if we only use each training example once for SGD, then we can’t overfit.

• So if we only use each example a few times, we probably won’t overfittoo much.

• Early stopping: just stop running SGD before it converges.

Page 49: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Benefits of Early Stopping

• Cheap to compute• Literally just does less work• It seems like the technique was designed to make systems run faster

•Helps with overfitting

Page 50: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

How Early to Stop

• You’ll see this in detail in next Wednesday’s paper.

• Yet another application of metaparameter tuning.

Page 51: Simple Techniques for Improving SGD · 2017. 9. 12. · Effect of step size on convergence •Let’s go back to the convergence rate proof for SGD •From the previous lecture, we

Questions?

• Upcoming things• Labor day next Monday — no lecture• Paper Presentation #1 on Wednesday — read paper before class