adam - a method for stochastic optimization v2 · 2018-08-20 · adam: a method for stochastic...

Dor Ringel

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy BaPresented by Dor Ringel

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

2

Dor Ringel

Content






3

Dor Ringel 4

Dor Ringel 5

Dor Ringel 7

Basic Supervised Machine Learning Terminology

Notation Explanation

! Instance space

" Label space

!~$ unknown probability distribution

%: ! ⟶ " True mapping between instance and label spaces, unknown.

Dor Ringel8

Examples of instance and labels spacesX Y

The space of all RGB images of some dimension Is there a cat in the image ({0,1})

The space of a stock’s historical price sequences The stock’s next day’s closing price ([0,∞))

The space of all finite Chinese sentences The set of all finite English sentences

The space of all information regarding two companies A probability of a merge to be successful

The space of all MRI images A probability, location and type of a tumor

The space of all finite length voice sequences The corresponding Amazon product being referred

. .

. .

. .

Dor Ringel 9



! Instance space

" Label space


%: ! ⟶ " True mapping between instance and label spaces, unknown

Dor Ringel 10



! Instance space

" Label space


%: ! ⟶ " True mapping between instance and label spaces, unknown

{(*+ , -+)}+012 , *+ ∈ !, -+ ∈ " Training set

ℎ: ! ⟶ ", ℎ ∈ 5 Hypothesis, the object we wish to learn

Dor Ringel 11

Basic Supervised Machine Learning theory

• The goal is to find a hypothesis ℎ that approximates " as best as possible, but approximates in what sense?

• We need a way for evaluating a hypothesis’ quality

Dor Ringel 12


Let us define a loss function !: #×# ⟶ [0,∞)

• For example:• Zero-one loss +(ℎ(./) ≠ 1/)• Quadratic loss ℎ(./) − 1/ 3

Here we use: 1/ = 5(./)

• A measure of “how bad” the hypothesis did on a single sample

Dor Ringel 13


Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:

"# ℎ = %&'~# ) ℎ *+ , -(*+)

• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.

Dor Ringel 14


Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:

"# ℎ = %&'~# ) ℎ *+ , -(*+)

• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.• Unfortunately…

Dor Ringel 15

Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.

Dor Ringel 16

Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.

We can, however, compute a proxy of the true Risk, called the Empirical Risk.

Dor Ringel 17

Basic Supervised Machine Learning theoryLet us define the Empirical Risk of a hypothesis ℎ

"#$% ℎ = 1( )*+,-

./ ℎ 0+ , 2 3+

• Recall: {(0+, 3+)}+,-. , 0+ ∈ 9, 3+ ∈ : is the Training set• "; ℎ = <=>~; / ℎ 0+ , 2(0+)

Dor Ringel18

Empirical Risk Minimization (ERM) strategy• After mountains of theory (PAC Learning, VC Theory, etc.), the

following theorem is proven (this is very unformal):

The “best” strategy for a learning algorithm is to minimize the Empirical Risk

Dor Ringel 19

Empirical Risk Minimization strategy (ERM)This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem:

Find !ℎ such that:!ℎ = $%&'()*∈,-./0 ℎ

= $%&'()*∈,12 34

567

8

9 ℎ :5 , < =5

Dor Ringel 21

The hypothesis class can be very simple

• when ! ∈ #$ and % ∈ {−1,1}• + is class of all three dimensional hyper planes• ℎ = ./ +.121 +.323 +.$2$• 4 = {./,.1,.3,.$} ∈ #5

Dor Ringel 22

The hypothesis class can be very complex

• Resent year’s Deep learning architectures, result in models with tens of millions of parameters, ! ∈ #$(&'()

Dor Ringel 23

The bottom line

Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex

functions.

!ℎ = $%&'()*∈,1. /0

123

4

5 ℎ 61 , 8 91

Dor Ringel 24

The bottom line

Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex

functions.

We need a principal method for achieving this goal.

!ℎ = $%&'()*∈,1. /0

123

4

5 ℎ 61 , 8 91

Dor Ringel 25

Introducing – The Gradient method

Dor Ringel

Questions?

Dor Ringel

Content






27

Dor Ringel 28

A couple of notes before we head on

• I stick to the papers’ notations.• From here we’ll use ! " , which is the same as #$%& ℎ .

Dor Ringel 29

Introducing – The Gradient method

Input: learning rate !, tolerance parameter " > 0

Initialization: pick %& ∈ () arbitratily

General Step:

• Set %*+, = %* − !∇0 1 %*• if ∇0 1 %*+, ≤ ", then STOP and %*+,is the output

Dor Ringel 30

Gradient descent example – simple Linear regression

! = #, $ = #

{('( , *()}(-./ , '( ∈ !, *( ∈ $

1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)

ℎ = 2' + 4

Dor Ringel 31


! = #, $ = #

{('( , *()}(-./ , '( ∈ !, *( ∈ $

1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)

ℎ = 2' + 4

The goal is to find ”good” 2 and 4 values

Dor Ringel 32


ℎ = #$ + &

' ℎ $ , ) = ) − ℎ($) -

./01 ℎ = 23 4 ∑672

3 ' ℎ $6 , 8 )6= 2

3 4 ∑6723 )6 − ℎ $6

-=23 4 ∑6723 )6 − (#$6 + &) -

Dor Ringel 33

GD example – computing the gradient• For !:""#$%#& ℎ = "

"#)* + ∑-.)

* /- − (!2- + 4) 6= )* + ∑-.)

* ""# /- − (!2- + 4) 6

= )*∑-.)

* 2 /- − (!2- + 4) (−2-) =88! $ = − 2

9:-.)

*2- /- − (!2- + 4)

• Similarly for 4:884 $ = − 2

9:-.)

*/- − (!2- + 4)

Dor Ringel 34

GD example – computing the gradientSo the gradient vector is:

∇"# $ = &&'#,

&&)#

= − 2,-./0

1

2. 3. − '2. + ) ,− 2,-./0

1

3. − ('2. + ))

Dor Ringel 35

GD example – the complete algorithm

Input: learning rate !, tolerance parameter " > 0

Initialization: pick %& = (m, b) ∈ ./ arbitratily

General Step:

• Set %012 = %0 − !∇56 %0

• 7012 = 70 + ! /9 ∑;<2

9 =; >; − 7=; + ?

• ?012 = ?0 + ! /9 ∑;<2

9 >; − 7=; + ?

• if ∇56 %012 ≤ ", then STOP and %012 is the output

Dor Ringel 36

GD example – visualized

Dor Ringel 38

Variants of Gradient descent• Differ in how much data we use to compute the gradients of the

objective function.• A trade-off between the accuracy of the update and the computation

time per update.

Dor Ringel 39

Batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for the

entire training dataset.

! = ! − $ % ∇'( !; *(,:.), 1(,:.)

Dor Ringel 40

Batch Gradient descent – Pros and ConsPros• Guaranteed to converge to global/local minimum.• An unbiased estimate of gradients.

Cons• Possibly slow or impossible to compute.• Some examples may be redundant.• Converges to the minimum of the basin the parameters are

placed in.

Dor Ringel 41

Stochastic Gradient descent (SGD)• Computes the gradients of the function w.r.t the parameters ! for a

single training sample.

! = ! − $ % ∇'( !; *(,), /(,)

Dor Ringel 42

Stochastic Gradient descent – Pros and Cons

Pros• Much faster to compute.• Potential to jump to better basins (and better local minima).Cons• High variance that causes the objective to fluctuate heavily.

Dor Ringel 43

Mini-batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for a

mini-batch of " training sample.

• " is usually 32 to 256

! = ! − % & ∇() !; +(-:-/0), 3(-:-/0)

Dor Ringel 44

Mini-batch Gradient descent – Pros and Cons

• the “best” of both worlds - fast, explorational and allows for stable convergence.• Makes use of highly optimized matrix optimizations libraries and

hardware.• The method of for most Supervised Machine leaning scenarios.

Dor Ringel 45

Variants of Gradient descent - visualizations

Dor Ringel 48

Its all Mini-batch from here

• The remaining of the presentation will focus on variants of the Mini-batch version. • From here on – Gradient descent, SGD, Gradient step, all refer to the

Mini-batch variant.• We’ll leave out the parameters !(#:#%&), )(#:#%&) for simplicity.

Dor Ringel 49

Challenges and limitations of the plain SGD

• Choosing a proper learning rate.• Sharing the learning rate for all parameters.• Optimization in the face of highly non-convex functions.

Dor Ringel

Questions?

Dor Ringel

Content






54

Dor Ringel 55

Novelties over the plain SGD• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton’s method).

Dor Ringel 56

Dor Ringel57

Momentum (Qian, N. 1999)

• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process

• Need to maintain some history of updates.

• Physics example:• A moving ball acquires “momentum”, at which point it becomes less

sensitive to the direct force (gradient).

Dor Ringel 58

Momentum (Qian, N. 1999) • Add a fraction ! (usually about 0.9) of the update vector of the past

time step to the current update vector.• Faster convergence and reduced oscillations.

"# = !"#%& + ( ) ∇+, -- = - −"#

= - − !"#%& − ( ) ∇+, -

, - − /012345"2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 "234/>( − C2=>858A >=42- = - − ( ) ∇+, -

Dor Ringel 59

Momentum (Qian, N. 1999)

(a) SGD without momentum (b) SGD with momentum

Dor Ringel 60

Nestrov accelerated gradient (Nestrov, Y. 1983)

• Momentum is usually pretty high once we get near our goal point.• The Algorithm has no idea when to slow down and therefore might

miss the goal point.• We would like our momentum to have a kind of foresight.

Dor Ringel 61

Momentum has no idea when to slow down

Dor Ringel 62


• First make a jump based on our previous momentum, calculate the gradients and then make a correction.• look ahead by calculating the gradient not w.r.t. to our current

parameters but w.r.t. the approximate future position of our parameters.

!" = $!"%& + ( ) ∇+, - − $!"%&- = - −!"

, - − /012345!2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 !234/>( − C2=>858A >=42- = - − ( ) ∇+, -

Dor Ringel 63


Dor Ringel 64

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

• Now we are able to adapt our updates to the slope.• But updates are the same for all the parameters being updated.• We would like to adapt our updates to each individual parameter

depending on their importance.

Dor Ringel 65


• Perform larger updates for infrequent parameters and smaller updates for frequent ones.• Use a different learning rate for every parameter !", at every time

step #.• well-suited for dealing with sparse data.• Eliminates the need to manually tune the learning rate (most just use

0.01)

Dor Ringel 66


!",$ = ∇'() *",$*"+,,$ = *",$ − . ,

/(,00+12 !",$

) * − 3456789:6 ;<=7893=* ∈ ?@ − ABCBD686CE∇') * −!CBF96=8 :6783C. −G6BC=9=!CB86

* =*−. 2 ∇') *

Dor Ringel 67


!",$ = ∇'() *",$*"+,,$ = *",$ − .

,

/(,00+12 !",$

• !",$ - the gradient w.r.t the parameter *$, at time step 3.• 4" ∈ 6787- diagonal matrix, 4",$$ is the sum of squares gradients w.r.t *$ up to time step t.• 9 – prevents division by zero (in the order of 1e-8)

) * − :;<=>3?@= ABC>3?:C

* ∈ 67 − DEFEG=3=FH

∇') * −!FEI?=C3 @=>3:F

. −J=EFC?C!FE3=

* =*−. 2 ∇') *

Dor Ringel 68


!",$ = ∇'() *",$

*"+,,$ = *",$ − .,

/(,00+12 !",$

*"+, = *" − .,

/(+1⨀!"

• !",$ - the gradient w.r.t the parameter *$, at time step 4.• 5" ∈ 7898- diagonal matrix, 5",$$ is the sum of squares gradients w.r.t *$ up to time step t.• : – prevents division by zero (in the order of 1e-8)

) * − ;<=>?4@A> BCD?4@;D

* ∈ 78 − EFGFH>4>GI

∇') * −!GFJ@>D4 A>?4;G

. −K>FGD@D!GF4>

* =*−. 2 ∇') *

Dor Ringel 69

Adagrad vs. Plain SGD

Adagrad: !"#$,& = !",& − ) $*+,,,#-

. ∇0+1 !",&

Plain SGD: !"#$,& = !",& − ) . ∇0+1 !",&

Dor Ringel 70

Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)

• In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence.• We need an efficient way of reducing this aggressive, monotonic,

decreasing learning rate.

Dor Ringel 71


• Recursively define a decaying average of those gradients.• The running average at time step t depends only on the previous time

step and the current gradient.

Dor Ringel 72


! "# $ = &! "# $'( + 1− & "$#

,$-( = ,$ − .1

! "# $ + /0 "$

1 , − 234567895 :;<6782<

, ∈ >? − @ABAC575BD∇F1 , − "BAG85<7 95672B

. − H5AB<8<" BA75, = , − . 0 ∇F1 ,; J, L

& − C2C5<7;C 625:.

"$ = ∇FN1 ,$

Dor Ringel 73


! "# $ = &! "# $'( + 1− & "$#

,$-( = ,$ − .1

! "# $ + /0 "$

• ! "# $ – the running average of squared gradient, for time step 1.• 2 – prevents division by zero (in the order of 1e-8)

3 , − 4567819:7 ;<=8194=

, ∈ ?@ − ABCBD717CE

∇G3 , − "CBH97=1 :7814C

. − I7BC=9=" CB17

, = , − . 0 ∇G3 ,; K, M

& − D4D7=1<D 847;.

"$ = ∇GO3 ,$

Dor Ringel 74

Adadelta (Zeiler, M, D. 2012)

Dor Ringel 75

Visualizations of the discussed algorithms

Dor Ringel

Questions?

Dor Ringel

Content






77

Dor Ringel 78

Adam (Kingma D. P, & Ba. J 2014)

• RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, “noisy”, gradient.

• We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients.

Dor Ringel 79

Adam – Update rule

!" = $%!"&% + (1 − $%),"

-" = $.-"&% + (1 − $.),".

/"0% = /" − 11

-" + 23 !"

• !" and -" are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.• Recommended values in the paper are $% = 0.9, $. = 0.999, 8 = 19 − 8

; / − <=>9?@A-9 BCD?@A<D

/ ∈ FG − HIJI!9@9JK

∇M; / −,JINA9D@ -9?@<J

1 −O9IJDAD,JI@9

/ =/−1 3 ∇M; /;Q,R

," = ∇MS; /"

Dor Ringel 80

Adam – Bias towards zero

• as !" and #" are initialized as 0’s, they are biased towards zero.

• Most significant in the initial steps.

• Most significant when $% and $& are close to 1.

• A correction is required.

Dor Ringel 81

Adam – Bias correction

!"# ="#

1−'(#, *+# =

+#1 −',#

-#.( = -# − /1*+# + 1

2 !"#

• Correct both moments to get the final update rule.

3 - − 456789:+7 ;<=89:4=- ∈ ?@ − ABCB"797CD

∇F3 - −GCBH:7=9 +7894C/ −I7BC=:=GCB97

G# =∇FJ3 -#

Dor Ringel 82

Adam vs. Adadelta

Adam: !"#$ = !" − '$

()*#+, -."

Adadelta: !"#$ = !" − '$

)*#+, /"

0 ! − 123456784 9:;5671;

! ∈ => − ?@[email protected]

∇D0 ! − /A@E74;6 84561A

' − F4@A;7;/ A@64

/" = ∇D*0 !"

." = G$."H$ + (1 − G$)/"

8" = GM8"H$ + (1 − GM)/"M

Dor Ringel 85

Adam - Performance

Dor Ringel 86

Adam - Performance

Dor Ringel 87

Adam - Performance

Dor Ringel

Questions?

Dor Ringel

Content






89

Dor Ringel 90

improving Adam• Ndam – Incorporating Nestrov into Adam

• AdamW - decoupling weight decay!"#$ = !" − ' $

()*#+, -." − '/" !"

• AMSGrad – fixing the exponential moving average01" = .23 01"4$, 1"

!"#$ = !" − ' 101" + 8

, ."

• Maximum of past gradients instead of their exp. moving average

• Adam with warm restarts

Dor Ringel 91

Additional approaches• Snapshot ensembles• Learning to optimize

Dor Ringel 92

Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018)

Dor Ringel 93

Summary• A brief walkthrough of Supervised Machine Learning.• A conviction of the importance and relevance of Gradient methods.• An Overview of modern Gradient descent optimization algorithms.• The contribution of Adam.• Innovations future to Adam.

Dor Ringel 94

Dor Ringel

Questions?

Dor Ringel

The End

Dor Ringel98

links• http://ruder.io/optimizing-gradient-descent/index.html• http://ruder.io/deep-learning-optimization-2017/• https://imgur.com/a/Hqolp• https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3• https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network• https://mathematica.stackexchange.com/questions/9928/identifying-critical-points-lines-of-2-3d-

image-cubes• https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d• https://distill.pub/2017/momentum/• https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-

d7834f67a4f6• https://meetshah1995.github.io/semantic-segmentation/deep-

learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html• https://github.com/mattnedrich/GradientDescentExample

http://ruder.io/optimizing-gradient-descent/index.html

http://ruder.io/deep-learning-optimization-2017/

https://imgur.com/a/Hqolp

https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

https://mathematica.stackexchange.com/questions/9928/identifying-critical-points-lines-of-2-3d-image-cubes

https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

https://distill.pub/2017/momentum/

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

https://meetshah1995.github.io/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html

https://github.com/mattnedrich/GradientDescentExample

Dor Ringel99

references• Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

• Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv , pages 1–14, 2014.

• Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop , (1):2013–2016, 2016.

• Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations , pages 1–13, 2015.

• Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.) , 269:543–547.

• Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks :

• the official journal of the International Neural Network Society , 12(1):145–151, 1999.

• Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 , 2012.

• Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

adam - a method for stochastic optimization v2 · 2018-08-20 · adam: a method for stochastic...

Documents