stochastic gradient descentoutline É what is stochastic gradient descent É comparison between bgd...

Stochastic Gradient Descent

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu,Zehao Chen and Donglin He

/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

March 26, 2017

Outline

É What is Stochastic Gradient Descent

É Comparison between BGD and SGD

É Analysis on SGD

É Extensions and Variants

2/23 Stochastic Gradient Descent Í Ï

What is Stochastic Gradient Descent? I

É Structural Risk Minimization in Machine LearningÉ Given samples{(, y)}n=1 and a loss function (h, y)É Find a prediction function h (;) by minimizing a risk

measure

R () =n∑

(h (;) , y) =n∑

É Update via BGD

(k+1) =(k) − tk∇R�

(k)�

=(k) − tkn∑

∇ƒ�

(k)�

What is Stochastic Gradient Descent? IIÉ Update via SGD

(k+1) =(k) − tk∇R�

(k)�

=(k) − tk∇ƒk�

(k)�

É Suppose we want to minimize the sum of functions

mnm∑

ƒ () , = 1,2, ...,m

É BGD would sum all the gradients

(k+1) = (k) − tkn∑

∇ƒ�

(k)�

, k = 1,2, ...

What is Stochastic Gradient Descent? IIIÉ SGD instead looks at each gradient individually

(k+1) = (k) − tk∇ƒk�

(k)�

, k = 1,2, ...

Where k ∈ {1, ...,m} is some chosen index at iteration kÉ Random rule: choose k ∈ {1, ...,m} uniformly at random

(more commom)É Circle rule: choose k = 1,2, ...,m,1,2, ...,m, ...

Comparison between BGD and SGD

Figure: The "classic picture"

Gradient computation:É Batch steps:O (np)

É Doable when n is

moderate, but not when

n ≈ 5× 108

É Stochastic steps:O (p)É So clearly, e.g., 10K

stochastic steps are much

more affordable

É Rule of thumb: SGD thrive

far from optimum and

struggle close to optimum

Comparison between BGD and SGDÉ Update via BGD

É More expensive stepsÉ Opportunities for parallelism

É Update via SGDÉ Very cheap iterationÉ Descent in expectation

É IntuitionÉ Using all the sample data in every iteration is inefficientÉ Data involves a good deal of redundancy in many applications

É Suppose data is 10 copies of a set S.Iteration of BGD 10 times

more expensive, while SGD performs same computations

É Sometimes working with half of the training set is sufficient

Learning Rate Analysis

Figure: Effects of learning rate on

Figure: An example of a typical loss

Converge AnalysisÉ Computationally, m stochastic steps≈one batch step

É But what about progress?É BGD(one step):

(k+1) = (k) − tm∑

∇ƒ�

(k)�

É SGD(Cyclic rule,k = ,m steps):

(k+m) = (k) − tm∑

∇ƒ�

(k+−1)�

É Difference in direction is∑m=1

∇ƒ�

(k+−1)�

− ∇ƒ�

(k)��

É So SGD should converge if each ∇ƒ () doesn’t vary wildly

with x11/23 Stochastic Gradient Descent Í Ï

Example

Problem:

Solution:É The linear regression loss:

2(y −)2

É Update via BGD:

(k+1) =(k) − tkm∑

É Update via SGD:

(k+1) =(k) + tk�

(k)k2k− kyk

Example

Figure: SGD loss-iteration times Figure: Result of linear regression

via SGD

Mini-Batch Gradient DescentÉ Batch Gradient Descent

(k+1) = (k) − tkm∑

∇ƒ�

(k)�

É Stochastic Gradient Descent

(k+1) = (k) − tk∇ƒk�

(k)�

É mini-Batch Gradient Descent

(k+1) = (k) − tkm′∑

∇ƒk�

(k)�

Challenges

Figure: Problems with the learning rate Figure: Local Minima

SGD with momentumÉ Accelerate SGD in the relevant direction and dampens

oscillationsÉ Take a big jump in direction of updated accumulated gradientÉ Compute the gradient at the current location

ν(k+1) = γν(k) + tk∇ƒk

(k)�

(k+1) = (k) − ν(k+1) SGD: (k+1) = (k) − tk∇ƒk�

(k)�

Nesterov Accelerated GradientÉ Accelerate SGD in the relevant direction and dampens

oscillationsÉ Take a big jump in direction of previous accumulated gradientÉ Measure gradient where you end up and make a correction

(k) = (k) + γν(k)

(k)�

(k+1) = (k) − ν(k+1)

SGD with momentum

(k)�

(k+1) = (k) − ν(k+1)

Adaptive Gradient AlgorithmÉ Adapts the learning rate to the parameters

É Performs larger updates for infrequentÉ Performs smaller updates for frequent parameters

É It is well-suited for dealing with sparse data

SGD: (k+1) = (k) − tk∇ƒk�

(k)�

ν(k+1) = ν(k) + ∇ƒk�

(k)�2

(k+1) = (k) −α

ν(k+1) + ε∇ƒk

(k)�

ε is a smoothing term that avoids division by zero(usually on the

order of 1e−8)

AdadeltaÉ Restricts window of accumulated past gradients to fixed size

É Reduce AdaGrad’s aggressive, monotonically decreasing

learning rate

As a fraction γ similarly to the Momentum term

ν(k+1) = γν(k) + (1− γ)∇ƒk�

(k)�2

(k+1) = (k) −α

ν(k+1) + ε∇ƒk

(k)�

Runing Average k at step k depends only on the previous

average and the current gradient (as a fraction γ similarly to the

Momentum term)

Adaptive Moment EstimationÉ Keeps an average of past gradients additionally

É Similar to momentum

É mt and t are biased towards zeroÉ In the initial time steps as they are initialized as vectors of 0’sÉ When the decay rates are small (i.e. β1 and β2 are close to 1)

m(k+1) = β1m(k) + (1− β1)∇ƒk�

(k)�

ν(k+1) = β2ν(k) + (1− β2)∇ƒk�

(k)�2

mt =mt

1− βt1

, νt =νt

1− βt2

(k+1) = (k) −tk

νt + εmt

bias-corrected first and second moment estimates20/23 Stochastic Gradient Descent Í Ï

Which optimizer to use?

É You should use one of the adaptive learning-rate methods:É If input data is sparse.É For faster convergence and deep or complex neural network

training.

É Insofar, adadelta and adam are very similar algorithms that

do well in similar circumstances.

É Adam slightly outperform adadelta towards the end of

optimization as gradients become sparser.

É Insofar, adam might be the best overall choice.

ReferenceÉ Hongmin Cai(2016): Sub-gradient Method, Lecture 7

É Cnblogs Murongxixi(2013): Stochastic Gradient Descent

É Leon Bottou(2016): Optimization Methods for Large-Scale

Machine Learning

É Abdelkrim Bennar(2007): Almost sure convergence of a

stochastic approximation process in a convex set

É A. Shapiro, Y. Wardi(1996): Convergence Analysis of

Gradient Descent Stochastic Algorithms

É Wikipedia: Stochastic gradient descent

É Sebastian Ruder (2016): An overview of gradient descent

optimization algorithms

É Zhihua Zhou(2016): Machine Learning, Chapter 6

Thank you for your time!

stochastic gradient descentoutline É what is stochastic gradient descent É comparison between bgd...

Documents

expected tensor decomposition with stochastic gradient...

speedo: parallelizing stochastic gradient descent for deep...

block stochastic gradient method - arxiv

lecture 6: stochastic gradient descent sanjeev arora elad...

stochastic gradient descent methods

optimal epoch stochastic gradient descent ascent methods

stochastic gradient variational bayes for gamma

generalized stochastic gradient learning › ~nwilliam ›...

stochastic gradient richardson-romberg markov...

perspectives on stochastic gradient descent for machine...

07 logistic regression and stochastic gradient descent

stochastic gradient descent - cmu statistics

asynchronous decentralized parallel stochastic gradient...

stochastic proximal gradient...

stochastic gradient methods for unconstrained...

efficient logistic regression with stochastic gradient...

the implicit regularization of stochastic gradient flow...

semi-stochastic gradient descent methods

variance reduction for stochastic gradient methodsvariance...

probabilistic line searches for stochastic optimizationthe...