adam - a method for stochastic optimization v2 · 2018-08-20 · adam: a method for stochastic...

88
Dor Ringel Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel

Upload: others

Post on 21-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy BaPresented by Dor Ringel

Page 2: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

2

Page 3: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

3

Page 4: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 4

Page 5: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 5

Page 6: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 7

Basic Supervised Machine Learning Terminology

Notation Explanation

! Instance space

" Label space

!~$ unknown probability distribution

%: ! ⟶ " True mapping between instance and label spaces, unknown.

Page 7: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel8

Examples of instance and labels spacesX Y

The space of all RGB images of some dimension Is there a cat in the image ({0,1})

The space of a stock’s historical price sequences The stock’s next day’s closing price ([0,∞))

The space of all finite Chinese sentences The set of all finite English sentences

The space of all information regarding two companies A probability of a merge to be successful

The space of all MRI images A probability, location and type of a tumor

The space of all finite length voice sequences The corresponding Amazon product being referred

. .

. .

. .

Page 8: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 9

Basic Supervised Machine Learning Terminology

Notation Explanation

! Instance space

" Label space

!~$ unknown probability distribution

%: ! ⟶ " True mapping between instance and label spaces, unknown

Page 9: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 10

Basic Supervised Machine Learning Terminology

Notation Explanation

! Instance space

" Label space

!~$ unknown probability distribution

%: ! ⟶ " True mapping between instance and label spaces, unknown

{(*+ , -+)}+012 , *+ ∈ !, -+ ∈ " Training set

ℎ: ! ⟶ ", ℎ ∈ 5 Hypothesis, the object we wish to learn

Page 10: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 11

Basic Supervised Machine Learning theory

• The goal is to find a hypothesis ℎ that approximates " as best as possible, but approximates in what sense?

• We need a way for evaluating a hypothesis’ quality

Page 11: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 12

Basic Supervised Machine Learning theory

Let us define a loss function !: #×# ⟶ [0,∞)

• For example:• Zero-one loss +(ℎ(./) ≠ 1/)• Quadratic loss ℎ(./) − 1/ 3

Here we use: 1/ = 5(./)

• A measure of “how bad” the hypothesis did on a single sample

Page 12: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 13

Basic Supervised Machine Learning theory

Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:

"# ℎ = %&'~# ) ℎ *+ , -(*+)

• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.

Page 13: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 14

Basic Supervised Machine Learning theory

Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:

"# ℎ = %&'~# ) ℎ *+ , -(*+)

• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.• Unfortunately…

Page 14: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 15

Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.

Page 15: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 16

Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.

We can, however, compute a proxy of the true Risk, called the Empirical Risk.

Page 16: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 17

Basic Supervised Machine Learning theoryLet us define the Empirical Risk of a hypothesis ℎ

"#$% ℎ = 1( )*+,-

./ ℎ 0+ , 2 3+

• Recall: {(0+, 3+)}+,-. , 0+ ∈ 9, 3+ ∈ : is the Training set• "; ℎ = <=>~; / ℎ 0+ , 2(0+)

Page 17: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel18

Empirical Risk Minimization (ERM) strategy• After mountains of theory (PAC Learning, VC Theory, etc.), the

following theorem is proven (this is very unformal):

The “best” strategy for a learning algorithm is to minimize the Empirical Risk

Page 18: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 19

Empirical Risk Minimization strategy (ERM)This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem:

Find !ℎ such that:!ℎ = $%&'()*∈,-./0 ℎ

= $%&'()*∈,12 34

567

8

9 ℎ :5 , < =5

Page 19: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 21

The hypothesis class can be very simple

• when ! ∈ #$ and % ∈ {−1,1}• + is class of all three dimensional hyper planes• ℎ = ./ +.121 +.323 +.$2$• 4 = {./,.1,.3,.$} ∈ #5

Page 20: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 22

The hypothesis class can be very complex

• Resent year’s Deep learning architectures, result in models with tens of millions of parameters, ! ∈ #$(&'()

Page 21: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 23

The bottom line

Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex

functions.

!ℎ = $%&'()*∈,1. /0

123

4

5 ℎ 61 , 8 91

Page 22: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 24

The bottom line

Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex

functions.

We need a principal method for achieving this goal.

!ℎ = $%&'()*∈,1. /0

123

4

5 ℎ 61 , 8 91

Page 23: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 25

Introducing – The Gradient method

Page 24: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Questions?

Page 25: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

27

Page 26: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 28

A couple of notes before we head on

• I stick to the papers’ notations.• From here we’ll use ! " , which is the same as #$%& ℎ .

Page 27: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 29

Introducing – The Gradient method

Input: learning rate !, tolerance parameter " > 0

Initialization: pick %& ∈ () arbitratily

General Step:

• Set %*+, = %* − !∇0 1 %*• if ∇0 1 %*+, ≤ ", then STOP and %*+,is the output

Page 28: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 30

Gradient descent example – simple Linear regression

! = #, $ = #

{('( , *()}(-./ , '( ∈ !, *( ∈ $

1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)

ℎ = 2' + 4

Page 29: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 31

Gradient descent example – simple Linear regression

! = #, $ = #

{('( , *()}(-./ , '( ∈ !, *( ∈ $

1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)

ℎ = 2' + 4

The goal is to find ”good” 2 and 4 values

Page 30: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 32

Gradient descent example – simple Linear regression

ℎ = #$ + &

' ℎ $ , ) = ) − ℎ($) -

./01 ℎ = 23 4 ∑672

3 ' ℎ $6 , 8 )6= 2

3 4 ∑6723 )6 − ℎ $6

-=23 4 ∑6723 )6 − (#$6 + &) -

Page 31: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 33

GD example – computing the gradient• For !:""#$%#& ℎ = "

"#)* + ∑-.)

* /- − (!2- + 4) 6= )* + ∑-.)

* ""# /- − (!2- + 4) 6

= )*∑-.)

* 2 /- − (!2- + 4) (−2-) =88! $ = − 2

9:-.)

*2- /- − (!2- + 4)

• Similarly for 4:884 $ = − 2

9:-.)

*/- − (!2- + 4)

Page 32: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 34

GD example – computing the gradientSo the gradient vector is:

∇"# $ = &&'#,

&&)#

= − 2,-./0

1

2. 3. − '2. + ) ,− 2,-./0

1

3. − ('2. + ))

Page 33: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 35

GD example – the complete algorithm

Input: learning rate !, tolerance parameter " > 0

Initialization: pick %& = (m, b) ∈ ./ arbitratily

General Step:

• Set %012 = %0 − !∇56 %0

• 7012 = 70 + ! /9 ∑;<2

9 =; >; − 7=; + ?

• ?012 = ?0 + ! /9 ∑;<2

9 >; − 7=; + ?

• if ∇56 %012 ≤ ", then STOP and %012 is the output

Page 34: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 36

GD example – visualized

Page 35: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 38

Variants of Gradient descent• Differ in how much data we use to compute the gradients of the

objective function.• A trade-off between the accuracy of the update and the computation

time per update.

Page 36: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 39

Batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for the

entire training dataset.

! = ! − $ % ∇'( !; *(,:.), 1(,:.)

Page 37: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 40

Batch Gradient descent – Pros and ConsPros• Guaranteed to converge to global/local minimum.• An unbiased estimate of gradients.

Cons• Possibly slow or impossible to compute.• Some examples may be redundant.• Converges to the minimum of the basin the parameters are

placed in.

Page 38: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 41

Stochastic Gradient descent (SGD)• Computes the gradients of the function w.r.t the parameters ! for a

single training sample.

! = ! − $ % ∇'( !; *(,), /(,)

Page 39: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 42

Stochastic Gradient descent – Pros and Cons

Pros• Much faster to compute.• Potential to jump to better basins (and better local minima).Cons• High variance that causes the objective to fluctuate heavily.

Page 40: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 43

Mini-batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for a

mini-batch of " training sample.

• " is usually 32 to 256

! = ! − % & ∇() !; +(-:-/0), 3(-:-/0)

Page 41: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 44

Mini-batch Gradient descent – Pros and Cons

• the “best” of both worlds - fast, explorational and allows for stable convergence.• Makes use of highly optimized matrix optimizations libraries and

hardware.• The method of for most Supervised Machine leaning scenarios.

Page 42: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 45

Variants of Gradient descent - visualizations

Page 43: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 48

Its all Mini-batch from here

• The remaining of the presentation will focus on variants of the Mini-batch version. • From here on – Gradient descent, SGD, Gradient step, all refer to the

Mini-batch variant.• We’ll leave out the parameters !(#:#%&), )(#:#%&) for simplicity.

Page 44: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 49

Challenges and limitations of the plain SGD

• Choosing a proper learning rate.• Sharing the learning rate for all parameters.• Optimization in the face of highly non-convex functions.

Page 45: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Questions?

Page 46: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

54

Page 47: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 55

Novelties over the plain SGD• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton’s method).

Page 48: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 56

Page 49: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel57

Momentum (Qian, N. 1999)

• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process

• Need to maintain some history of updates.

• Physics example:• A moving ball acquires “momentum”, at which point it becomes less

sensitive to the direct force (gradient).

Page 50: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 58

Momentum (Qian, N. 1999) • Add a fraction ! (usually about 0.9) of the update vector of the past

time step to the current update vector.• Faster convergence and reduced oscillations.

"# = !"#%& + ( ) ∇+, -- = - −"#

= - − !"#%& − ( ) ∇+, -

, - − /012345"2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 "234/>( − C2=>858A >=42- = - − ( ) ∇+, -

Page 51: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 59

Momentum (Qian, N. 1999)

(a) SGD without momentum (b) SGD with momentum

Page 52: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 60

Nestrov accelerated gradient (Nestrov, Y. 1983)

• Momentum is usually pretty high once we get near our goal point.• The Algorithm has no idea when to slow down and therefore might

miss the goal point.• We would like our momentum to have a kind of foresight.

Page 53: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 61

Momentum has no idea when to slow down

Page 54: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 62

Nestrov accelerated gradient (Nestrov, Y. 1983)

• First make a jump based on our previous momentum, calculate the gradients and then make a correction.• look ahead by calculating the gradient not w.r.t. to our current

parameters but w.r.t. the approximate future position of our parameters.

!" = $!"%& + ( ) ∇+, - − $!"%&- = - −!"

, - − /012345!2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 !234/>( − C2=>858A >=42- = - − ( ) ∇+, -

Page 55: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 63

Nestrov accelerated gradient (Nestrov, Y. 1983)

Page 56: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 64

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

• Now we are able to adapt our updates to the slope.• But updates are the same for all the parameters being updated.• We would like to adapt our updates to each individual parameter

depending on their importance.

Page 57: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 65

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

• Perform larger updates for infrequent parameters and smaller updates for frequent ones.• Use a different learning rate for every parameter !", at every time

step #.• well-suited for dealing with sparse data.• Eliminates the need to manually tune the learning rate (most just use

0.01)

Page 58: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 66

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

!",$ = ∇'() *",$*"+,,$ = *",$ − . ,

/(,00+12 !",$

) * − 3456789:6 ;<=7893=* ∈ ?@ − ABCBD686CE∇') * −!CBF96=8 :6783C. −G6BC=9=!CB86

* =*−. 2 ∇') *

Page 59: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 67

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

!",$ = ∇'() *",$*"+,,$ = *",$ − .

,

/(,00+12 !",$

• !",$ - the gradient w.r.t the parameter *$, at time step 3.• 4" ∈ 6787- diagonal matrix, 4",$$ is the sum of squares gradients w.r.t *$ up to time step t.• 9 – prevents division by zero (in the order of 1e-8)

) * − :;<=>3?@= ABC>3?:C

* ∈ 67 − DEFEG=3=FH

∇') * −!FEI?=C3 @=>3:F

. −J=EFC?C!FE3=

* =*−. 2 ∇') *

Page 60: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 68

Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)

!",$ = ∇'() *",$

*"+,,$ = *",$ − .,

/(,00+12 !",$

*"+, = *" − .,

/(+1⨀!"

• !",$ - the gradient w.r.t the parameter *$, at time step 4.• 5" ∈ 7898- diagonal matrix, 5",$$ is the sum of squares gradients w.r.t *$ up to time step t.• : – prevents division by zero (in the order of 1e-8)

) * − ;<=>?4@A> BCD?4@;D

* ∈ 78 − EFGFH>4>GI

∇') * −!GFJ@>D4 A>?4;G

. −K>FGD@D!GF4>

* =*−. 2 ∇') *

Page 61: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 69

Adagrad vs. Plain SGD

Adagrad: !"#$,& = !",& − ) $*+,,,#-

. ∇0+1 !",&

Plain SGD: !"#$,& = !",& − ) . ∇0+1 !",&

Page 62: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 70

Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)

• In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence.• We need an efficient way of reducing this aggressive, monotonic,

decreasing learning rate.

Page 63: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 71

Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)

• Recursively define a decaying average of those gradients.• The running average at time step t depends only on the previous time

step and the current gradient.

Page 64: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 72

Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)

! "# $ = &! "# $'( + 1− & "$#

,$-( = ,$ − .1

! "# $ + /0 "$

1 , − 234567895 :;<6782<

, ∈ >? − @ABAC575BD∇F1 , − "BAG85<7 95672B

. − H5AB<8<" BA75, = , − . 0 ∇F1 ,; J, L

& − C2C5<7;C 625:.

"$ = ∇FN1 ,$

Page 65: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 73

Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)

! "# $ = &! "# $'( + 1− & "$#

,$-( = ,$ − .1

! "# $ + /0 "$

• ! "# $ – the running average of squared gradient, for time step 1.• 2 – prevents division by zero (in the order of 1e-8)

3 , − 4567819:7 ;<=8194=

, ∈ ?@ − ABCBD717CE

∇G3 , − "CBH97=1 :7814C

. − I7BC=9=" CB17

, = , − . 0 ∇G3 ,; K, M

& − D4D7=1<D 847;.

"$ = ∇GO3 ,$

Page 66: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 74

Adadelta (Zeiler, M, D. 2012)

Page 67: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 75

Visualizations of the discussed algorithms

Page 68: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Questions?

Page 69: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

77

Page 70: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 78

Adam (Kingma D. P, & Ba. J 2014)

• RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, “noisy”, gradient.

• We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients.

Page 71: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 79

Adam – Update rule

!" = $%!"&% + (1 − $%),"

-" = $.-"&% + (1 − $.),".

/"0% = /" − 11

-" + 23 !"

• !" and -" are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.• Recommended values in the paper are $% = 0.9, $. = 0.999, 8 = 19 − 8

; / − <=>9?@A-9 BCD?@A<D

/ ∈ FG − HIJI!9@9JK

∇M; / −,JINA9D@ -9?@<J

1 −O9IJDAD,JI@9

/ =/−1 3 ∇M; /;Q,R

," = ∇MS; /"

Page 72: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 80

Adam – Bias towards zero

• as !" and #" are initialized as 0’s, they are biased towards zero.

• Most significant in the initial steps.

• Most significant when $% and $& are close to 1.

• A correction is required.

Page 73: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 81

Adam – Bias correction

!"# ="#

1−'(#, *+# =

+#1 −',#

-#.( = -# − /1*+# + 1

2 !"#

• Correct both moments to get the final update rule.

3 - − 456789:+7 ;<=89:4=- ∈ ?@ − ABCB"797CD

∇F3 - −GCBH:7=9 +7894C/ −I7BC=:=GCB97

G# =∇FJ3 -#

Page 74: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 82

Adam vs. Adadelta

Adam: !"#$ = !" − '$

()*#+, -."

Adadelta: !"#$ = !" − '$

)*#+, /"

0 ! − 123456784 9:;5671;

! ∈ => − ?@[email protected]

∇D0 ! − /A@E74;6 84561A

' − F4@A;7;/ A@64

/" = ∇D*0 !"

." = G$."H$ + (1 − G$)/"

8" = GM8"H$ + (1 − GM)/"M

Page 75: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 85

Adam - Performance

Page 76: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 86

Adam - Performance

Page 77: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 87

Adam - Performance

Page 78: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Questions?

Page 79: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Content

• Background• Supervised ML theory and the

importance of optimum finding• Gradient descent and its variants• Limitations of SGD

• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp

• Adam• Update rule• Bias correction• AdaMax

• Post Adam innovations• Improving Adam• Additional approaches• Shampoo

89

Page 80: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 90

improving Adam• Ndam – Incorporating Nestrov into Adam

• AdamW - decoupling weight decay!"#$ = !" − ' $

()*#+, -." − '/" !"

• AMSGrad – fixing the exponential moving average01" = .23 01"4$, 1"

!"#$ = !" − ' 101" + 8

, ."

• Maximum of past gradients instead of their exp. moving average

• Adam with warm restarts

Page 81: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 91

Additional approaches• Snapshot ensembles• Learning to optimize

Page 82: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 92

Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018)

Page 83: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 93

Summary• A brief walkthrough of Supervised Machine Learning.• A conviction of the importance and relevance of Gradient methods.• An Overview of modern Gradient descent optimization algorithms.• The contribution of Adam.• Innovations future to Adam.

Page 84: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel 94

Page 85: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

Questions?

Page 86: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel

The End

Page 87: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel98

links• http://ruder.io/optimizing-gradient-descent/index.html• http://ruder.io/deep-learning-optimization-2017/• https://imgur.com/a/Hqolp• https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3• https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network• https://mathematica.stackexchange.com/questions/9928/identifying-critical-points-lines-of-2-3d-

image-cubes• https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d• https://distill.pub/2017/momentum/• https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-

d7834f67a4f6• https://meetshah1995.github.io/semantic-segmentation/deep-

learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html• https://github.com/mattnedrich/GradientDescentExample

Page 88: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

Dor Ringel99

references• Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

• Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv , pages 1–14, 2014.

• Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop , (1):2013–2016, 2016.

• Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations , pages 1–13, 2015.

• Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.) , 269:543–547.

• Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks :

• the official journal of the International Neural Network Society , 12(1):145–151, 1999.

• Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 , 2012.

• Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.