stochastic gradient descentoutline É what is stochastic gradient descent É comparison between bgd...

Post on 02-Mar-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Stochastic Gradient Descent

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu,Zehao Chen and Donglin He

/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

March 26, 2017

Outline

É What is Stochastic Gradient Descent

É Comparison between BGD and SGD

É Analysis on SGD

É Extensions and Variants

2/23 Stochastic Gradient Descent Í Ï

What is Stochastic Gradient Descent? I

É Structural Risk Minimization in Machine LearningÉ Given samples{(, y)}n=1 and a loss function (h, y)É Find a prediction function h (;) by minimizing a risk

measure

R () =n∑

=1

(h (;) , y) =n∑

=1

ƒ()

É Update via BGD

(k+1) =(k) − tk∇R�

(k)�

=(k) − tkn∑

=1

∇ƒ�

(k)�

3/23 Stochastic Gradient Descent Í Ï

What is Stochastic Gradient Descent? IIÉ Update via SGD

(k+1) =(k) − tk∇R�

(k)�

=(k) − tk∇ƒk�

(k)�

É Suppose we want to minimize the sum of functions

mnm∑

=1

ƒ () , = 1,2, ...,m

É BGD would sum all the gradients

(k+1) = (k) − tkn∑

=1

∇ƒ�

(k)�

, k = 1,2, ...

4/23 Stochastic Gradient Descent Í Ï

What is Stochastic Gradient Descent? IIIÉ SGD instead looks at each gradient individually

(k+1) = (k) − tk∇ƒk�

(k)�

, k = 1,2, ...

Where k ∈ {1, ...,m} is some chosen index at iteration kÉ Random rule: choose k ∈ {1, ...,m} uniformly at random

(more commom)É Circle rule: choose k = 1,2, ...,m,1,2, ...,m, ...

5/23 Stochastic Gradient Descent Í Ï

Comparison between BGD and SGD

Figure: The "classic picture"

Gradient computation:É Batch steps:O (np)

É Doable when n is

moderate, but not when

n ≈ 5× 108

É Stochastic steps:O (p)É So clearly, e.g., 10K

stochastic steps are much

more affordable

É Rule of thumb: SGD thrive

far from optimum and

struggle close to optimum

6/23 Stochastic Gradient Descent Í Ï

Comparison between BGD and SGDÉ Update via BGD

É More expensive stepsÉ Opportunities for parallelism

É Update via SGDÉ Very cheap iterationÉ Descent in expectation

É IntuitionÉ Using all the sample data in every iteration is inefficientÉ Data involves a good deal of redundancy in many applications

É Suppose data is 10 copies of a set S.Iteration of BGD 10 times

more expensive, while SGD performs same computations

É Sometimes working with half of the training set is sufficient

7/23 Stochastic Gradient Descent Í Ï

Learning Rate Analysis

Figure: Effects of learning rate on

loss

Figure: An example of a typical loss

func

10/23 Stochastic Gradient Descent Í Ï

Converge AnalysisÉ Computationally, m stochastic steps≈one batch step

É But what about progress?É BGD(one step):

(k+1) = (k) − tm∑

=1

∇ƒ�

(k)�

É SGD(Cyclic rule,k = ,m steps):

(k+m) = (k) − tm∑

=1

∇ƒ�

(k+−1)�

É Difference in direction is∑m=1

∇ƒ�

(k+−1)�

− ∇ƒ�

(k)��

É So SGD should converge if each ∇ƒ () doesn’t vary wildly

with x11/23 Stochastic Gradient Descent Í Ï

Example

Problem:

Solution:É The linear regression loss:

min

m∑

=1

1

2(y −)2

É Update via BGD:

(k+1) =(k) − tkm∑

=1

(k) 2

− y

É Update via SGD:

(k+1) =(k) + tk�

(k)k2k− kyk

12/23 Stochastic Gradient Descent Í Ï

39317
打字机

Example

Figure: SGD loss-iteration times Figure: Result of linear regression

via SGD

13/23 Stochastic Gradient Descent Í Ï

Mini-Batch Gradient DescentÉ Batch Gradient Descent

(k+1) = (k) − tkm∑

=1

∇ƒ�

(k)�

É Stochastic Gradient Descent

(k+1) = (k) − tk∇ƒk�

(k)�

É mini-Batch Gradient Descent

(k+1) = (k) − tkm′∑

=1

∇ƒk�

(k)�

14/23 Stochastic Gradient Descent Í Ï

Challenges

Figure: Problems with the learning rate Figure: Local Minima

15/23 Stochastic Gradient Descent Í Ï

SGD with momentumÉ Accelerate SGD in the relevant direction and dampens

oscillationsÉ Take a big jump in direction of updated accumulated gradientÉ Compute the gradient at the current location

ν(k+1) = γν(k) + tk∇ƒk

(k)�

(k+1) = (k) − ν(k+1) SGD: (k+1) = (k) − tk∇ƒk�

(k)�

16/23 Stochastic Gradient Descent Í Ï

Nesterov Accelerated GradientÉ Accelerate SGD in the relevant direction and dampens

oscillationsÉ Take a big jump in direction of previous accumulated gradientÉ Measure gradient where you end up and make a correction

(k) = (k) + γν(k)

ν(k+1) = γν(k) + tk∇ƒk

(k)�

(k+1) = (k) − ν(k+1)

SGD with momentum

ν(k+1) = γν(k) + tk∇ƒk

(k)�

(k+1) = (k) − ν(k+1)

17/23 Stochastic Gradient Descent Í Ï

Adaptive Gradient AlgorithmÉ Adapts the learning rate to the parameters

É Performs larger updates for infrequentÉ Performs smaller updates for frequent parameters

É It is well-suited for dealing with sparse data

SGD: (k+1) = (k) − tk∇ƒk�

(k)�

ν(k+1) = ν(k) + ∇ƒk�

(k)�2

(k+1) = (k) −α

p

ν(k+1) + ε∇ƒk

(k)�

ε is a smoothing term that avoids division by zero(usually on the

order of 1e−8)

18/23 Stochastic Gradient Descent Í Ï

AdadeltaÉ Restricts window of accumulated past gradients to fixed size

É Reduce AdaGrad’s aggressive, monotonically decreasing

learning rate

As a fraction γ similarly to the Momentum term

ν(k+1) = γν(k) + (1− γ)∇ƒk�

(k)�2

(k+1) = (k) −α

p

ν(k+1) + ε∇ƒk

(k)�

Runing Average k at step k depends only on the previous

average and the current gradient (as a fraction γ similarly to the

Momentum term)

19/23 Stochastic Gradient Descent Í Ï

Adaptive Moment EstimationÉ Keeps an average of past gradients additionally

É Similar to momentum

É mt and t are biased towards zeroÉ In the initial time steps as they are initialized as vectors of 0’sÉ When the decay rates are small (i.e. β1 and β2 are close to 1)

m(k+1) = β1m(k) + (1− β1)∇ƒk�

(k)�

ν(k+1) = β2ν(k) + (1− β2)∇ƒk�

(k)�2

mt =mt

1− βt1

, νt =νt

1− βt2

(k+1) = (k) −tk

p

νt + εmt

bias-corrected first and second moment estimates20/23 Stochastic Gradient Descent Í Ï

Which optimizer to use?

É You should use one of the adaptive learning-rate methods:É If input data is sparse.É For faster convergence and deep or complex neural network

training.

É Insofar, adadelta and adam are very similar algorithms that

do well in similar circumstances.

É Adam slightly outperform adadelta towards the end of

optimization as gradients become sparser.

É Insofar, adam might be the best overall choice.

22/23 Stochastic Gradient Descent Í Ï

ReferenceÉ Hongmin Cai(2016): Sub-gradient Method, Lecture 7

É Cnblogs Murongxixi(2013): Stochastic Gradient Descent

É Leon Bottou(2016): Optimization Methods for Large-Scale

Machine Learning

É Abdelkrim Bennar(2007): Almost sure convergence of a

stochastic approximation process in a convex set

É A. Shapiro, Y. Wardi(1996): Convergence Analysis of

Gradient Descent Stochastic Algorithms

É Wikipedia: Stochastic gradient descent

É Sebastian Ruder (2016): An overview of gradient descent

optimization algorithms

É Zhihua Zhou(2016): Machine Learning, Chapter 6

23/23 Stochastic Gradient Descent Í Ï

Thank you for your time!

24/23 Stochastic Gradient Descent Í Ï

top related