adam - a method for stochastic optimization v2 · 2018-08-20 · adam: a method for stochastic...
TRANSCRIPT
Dor Ringel
Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy BaPresented by Dor Ringel
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
2
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
3
Dor Ringel 4
Dor Ringel 5
Dor Ringel 7
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown.
Dor Ringel8
Examples of instance and labels spacesX Y
The space of all RGB images of some dimension Is there a cat in the image ({0,1})
The space of a stock’s historical price sequences The stock’s next day’s closing price ([0,∞))
The space of all finite Chinese sentences The set of all finite English sentences
The space of all information regarding two companies A probability of a merge to be successful
The space of all MRI images A probability, location and type of a tumor
The space of all finite length voice sequences The corresponding Amazon product being referred
. .
. .
. .
Dor Ringel 9
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown
Dor Ringel 10
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown
{(*+ , -+)}+012 , *+ ∈ !, -+ ∈ " Training set
ℎ: ! ⟶ ", ℎ ∈ 5 Hypothesis, the object we wish to learn
Dor Ringel 11
Basic Supervised Machine Learning theory
• The goal is to find a hypothesis ℎ that approximates " as best as possible, but approximates in what sense?
• We need a way for evaluating a hypothesis’ quality
Dor Ringel 12
Basic Supervised Machine Learning theory
Let us define a loss function !: #×# ⟶ [0,∞)
• For example:• Zero-one loss +(ℎ(./) ≠ 1/)• Quadratic loss ℎ(./) − 1/ 3
Here we use: 1/ = 5(./)
• A measure of “how bad” the hypothesis did on a single sample
Dor Ringel 13
Basic Supervised Machine Learning theory
Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:
"# ℎ = %&'~# ) ℎ *+ , -(*+)
• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.
Dor Ringel 14
Basic Supervised Machine Learning theory
Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:
"# ℎ = %&'~# ) ℎ *+ , -(*+)
• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.• Unfortunately…
Dor Ringel 15
Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.
Dor Ringel 16
Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.
We can, however, compute a proxy of the true Risk, called the Empirical Risk.
Dor Ringel 17
Basic Supervised Machine Learning theoryLet us define the Empirical Risk of a hypothesis ℎ
"#$% ℎ = 1( )*+,-
./ ℎ 0+ , 2 3+
• Recall: {(0+, 3+)}+,-. , 0+ ∈ 9, 3+ ∈ : is the Training set• "; ℎ = <=>~; / ℎ 0+ , 2(0+)
Dor Ringel18
Empirical Risk Minimization (ERM) strategy• After mountains of theory (PAC Learning, VC Theory, etc.), the
following theorem is proven (this is very unformal):
The “best” strategy for a learning algorithm is to minimize the Empirical Risk
Dor Ringel 19
Empirical Risk Minimization strategy (ERM)This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem:
Find !ℎ such that:!ℎ = $%&'()*∈,-./0 ℎ
= $%&'()*∈,12 34
567
8
9 ℎ :5 , < =5
Dor Ringel 21
The hypothesis class can be very simple
• when ! ∈ #$ and % ∈ {−1,1}• + is class of all three dimensional hyper planes• ℎ = ./ +.121 +.323 +.$2$• 4 = {./,.1,.3,.$} ∈ #5
Dor Ringel 22
The hypothesis class can be very complex
• Resent year’s Deep learning architectures, result in models with tens of millions of parameters, ! ∈ #$(&'()
Dor Ringel 23
The bottom line
Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex
functions.
!ℎ = $%&'()*∈,1. /0
123
4
5 ℎ 61 , 8 91
Dor Ringel 24
The bottom line
Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex
functions.
We need a principal method for achieving this goal.
!ℎ = $%&'()*∈,1. /0
123
4
5 ℎ 61 , 8 91
Dor Ringel 25
Introducing – The Gradient method
Dor Ringel
Questions?
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
27
Dor Ringel 28
A couple of notes before we head on
• I stick to the papers’ notations.• From here we’ll use ! " , which is the same as #$%& ℎ .
Dor Ringel 29
Introducing – The Gradient method
Input: learning rate !, tolerance parameter " > 0
Initialization: pick %& ∈ () arbitratily
General Step:
• Set %*+, = %* − !∇0 1 %*• if ∇0 1 %*+, ≤ ", then STOP and %*+,is the output
Dor Ringel 30
Gradient descent example – simple Linear regression
! = #, $ = #
{('( , *()}(-./ , '( ∈ !, *( ∈ $
1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)
ℎ = 2' + 4
Dor Ringel 31
Gradient descent example – simple Linear regression
! = #, $ = #
{('( , *()}(-./ , '( ∈ !, *( ∈ $
1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)
ℎ = 2' + 4
The goal is to find ”good” 2 and 4 values
Dor Ringel 32
Gradient descent example – simple Linear regression
ℎ = #$ + &
' ℎ $ , ) = ) − ℎ($) -
./01 ℎ = 23 4 ∑672
3 ' ℎ $6 , 8 )6= 2
3 4 ∑6723 )6 − ℎ $6
-=23 4 ∑6723 )6 − (#$6 + &) -
Dor Ringel 33
GD example – computing the gradient• For !:""#$%#& ℎ = "
"#)* + ∑-.)
* /- − (!2- + 4) 6= )* + ∑-.)
* ""# /- − (!2- + 4) 6
= )*∑-.)
* 2 /- − (!2- + 4) (−2-) =88! $ = − 2
9:-.)
*2- /- − (!2- + 4)
• Similarly for 4:884 $ = − 2
9:-.)
*/- − (!2- + 4)
Dor Ringel 34
GD example – computing the gradientSo the gradient vector is:
∇"# $ = &&'#,
&&)#
= − 2,-./0
1
2. 3. − '2. + ) ,− 2,-./0
1
3. − ('2. + ))
Dor Ringel 35
GD example – the complete algorithm
Input: learning rate !, tolerance parameter " > 0
Initialization: pick %& = (m, b) ∈ ./ arbitratily
General Step:
• Set %012 = %0 − !∇56 %0
• 7012 = 70 + ! /9 ∑;<2
9 =; >; − 7=; + ?
• ?012 = ?0 + ! /9 ∑;<2
9 >; − 7=; + ?
• if ∇56 %012 ≤ ", then STOP and %012 is the output
Dor Ringel 36
GD example – visualized
Dor Ringel 38
Variants of Gradient descent• Differ in how much data we use to compute the gradients of the
objective function.• A trade-off between the accuracy of the update and the computation
time per update.
Dor Ringel 39
Batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for the
entire training dataset.
! = ! − $ % ∇'( !; *(,:.), 1(,:.)
Dor Ringel 40
Batch Gradient descent – Pros and ConsPros• Guaranteed to converge to global/local minimum.• An unbiased estimate of gradients.
Cons• Possibly slow or impossible to compute.• Some examples may be redundant.• Converges to the minimum of the basin the parameters are
placed in.
Dor Ringel 41
Stochastic Gradient descent (SGD)• Computes the gradients of the function w.r.t the parameters ! for a
single training sample.
! = ! − $ % ∇'( !; *(,), /(,)
Dor Ringel 42
Stochastic Gradient descent – Pros and Cons
Pros• Much faster to compute.• Potential to jump to better basins (and better local minima).Cons• High variance that causes the objective to fluctuate heavily.
Dor Ringel 43
Mini-batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for a
mini-batch of " training sample.
• " is usually 32 to 256
! = ! − % & ∇() !; +(-:-/0), 3(-:-/0)
Dor Ringel 44
Mini-batch Gradient descent – Pros and Cons
• the “best” of both worlds - fast, explorational and allows for stable convergence.• Makes use of highly optimized matrix optimizations libraries and
hardware.• The method of for most Supervised Machine leaning scenarios.
Dor Ringel 45
Variants of Gradient descent - visualizations
Dor Ringel 48
Its all Mini-batch from here
• The remaining of the presentation will focus on variants of the Mini-batch version. • From here on – Gradient descent, SGD, Gradient step, all refer to the
Mini-batch variant.• We’ll leave out the parameters !(#:#%&), )(#:#%&) for simplicity.
Dor Ringel 49
Challenges and limitations of the plain SGD
• Choosing a proper learning rate.• Sharing the learning rate for all parameters.• Optimization in the face of highly non-convex functions.
Dor Ringel
Questions?
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
54
Dor Ringel 55
Novelties over the plain SGD• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton’s method).
Dor Ringel 56
Dor Ringel57
Momentum (Qian, N. 1999)
• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process
• Need to maintain some history of updates.
• Physics example:• A moving ball acquires “momentum”, at which point it becomes less
sensitive to the direct force (gradient).
Dor Ringel 58
Momentum (Qian, N. 1999) • Add a fraction ! (usually about 0.9) of the update vector of the past
time step to the current update vector.• Faster convergence and reduced oscillations.
"# = !"#%& + ( ) ∇+, -- = - −"#
= - − !"#%& − ( ) ∇+, -
, - − /012345"2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 "234/>( − C2=>858A >=42- = - − ( ) ∇+, -
Dor Ringel 59
Momentum (Qian, N. 1999)
(a) SGD without momentum (b) SGD with momentum
Dor Ringel 60
Nestrov accelerated gradient (Nestrov, Y. 1983)
• Momentum is usually pretty high once we get near our goal point.• The Algorithm has no idea when to slow down and therefore might
miss the goal point.• We would like our momentum to have a kind of foresight.
Dor Ringel 61
Momentum has no idea when to slow down
Dor Ringel 62
Nestrov accelerated gradient (Nestrov, Y. 1983)
• First make a jump based on our previous momentum, calculate the gradients and then make a correction.• look ahead by calculating the gradient not w.r.t. to our current
parameters but w.r.t. the approximate future position of our parameters.
!" = $!"%& + ( ) ∇+, - − $!"%&- = - −!"
, - − /012345!2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 !234/>( − C2=>858A >=42- = - − ( ) ∇+, -
Dor Ringel 63
Nestrov accelerated gradient (Nestrov, Y. 1983)
Dor Ringel 64
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
• Now we are able to adapt our updates to the slope.• But updates are the same for all the parameters being updated.• We would like to adapt our updates to each individual parameter
depending on their importance.
Dor Ringel 65
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
• Perform larger updates for infrequent parameters and smaller updates for frequent ones.• Use a different learning rate for every parameter !", at every time
step #.• well-suited for dealing with sparse data.• Eliminates the need to manually tune the learning rate (most just use
0.01)
Dor Ringel 66
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$*"+,,$ = *",$ − . ,
/(,00+12 !",$
) * − 3456789:6 ;<=7893=* ∈ ?@ − ABCBD686CE∇') * −!CBF96=8 :6783C. −G6BC=9=!CB86
* =*−. 2 ∇') *
Dor Ringel 67
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$*"+,,$ = *",$ − .
,
/(,00+12 !",$
• !",$ - the gradient w.r.t the parameter *$, at time step 3.• 4" ∈ 6787- diagonal matrix, 4",$$ is the sum of squares gradients w.r.t *$ up to time step t.• 9 – prevents division by zero (in the order of 1e-8)
) * − :;<=>3?@= ABC>3?:C
* ∈ 67 − DEFEG=3=FH
∇') * −!FEI?=C3 @=>3:F
. −J=EFC?C!FE3=
* =*−. 2 ∇') *
Dor Ringel 68
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$
*"+,,$ = *",$ − .,
/(,00+12 !",$
*"+, = *" − .,
/(+1⨀!"
• !",$ - the gradient w.r.t the parameter *$, at time step 4.• 5" ∈ 7898- diagonal matrix, 5",$$ is the sum of squares gradients w.r.t *$ up to time step t.• : – prevents division by zero (in the order of 1e-8)
) * − ;<=>?4@A> BCD?4@;D
* ∈ 78 − EFGFH>4>GI
∇') * −!GFJ@>D4 A>?4;G
. −K>FGD@D!GF4>
* =*−. 2 ∇') *
Dor Ringel 69
Adagrad vs. Plain SGD
Adagrad: !"#$,& = !",& − ) $*+,,,#-
. ∇0+1 !",&
Plain SGD: !"#$,& = !",& − ) . ∇0+1 !",&
Dor Ringel 70
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
• In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence.• We need an efficient way of reducing this aggressive, monotonic,
decreasing learning rate.
Dor Ringel 71
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
• Recursively define a decaying average of those gradients.• The running average at time step t depends only on the previous time
step and the current gradient.
Dor Ringel 72
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
! "# $ = &! "# $'( + 1− & "$#
,$-( = ,$ − .1
! "# $ + /0 "$
1 , − 234567895 :;<6782<
, ∈ >? − @ABAC575BD∇F1 , − "BAG85<7 95672B
. − H5AB<8<" BA75, = , − . 0 ∇F1 ,; J, L
& − C2C5<7;C 625:.
"$ = ∇FN1 ,$
Dor Ringel 73
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
! "# $ = &! "# $'( + 1− & "$#
,$-( = ,$ − .1
! "# $ + /0 "$
• ! "# $ – the running average of squared gradient, for time step 1.• 2 – prevents division by zero (in the order of 1e-8)
3 , − 4567819:7 ;<=8194=
, ∈ ?@ − ABCBD717CE
∇G3 , − "CBH97=1 :7814C
. − I7BC=9=" CB17
, = , − . 0 ∇G3 ,; K, M
& − D4D7=1<D 847;.
"$ = ∇GO3 ,$
Dor Ringel 74
Adadelta (Zeiler, M, D. 2012)
Dor Ringel 75
Visualizations of the discussed algorithms
Dor Ringel
Questions?
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
77
Dor Ringel 78
Adam (Kingma D. P, & Ba. J 2014)
• RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, “noisy”, gradient.
• We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients.
Dor Ringel 79
Adam – Update rule
!" = $%!"&% + (1 − $%),"
-" = $.-"&% + (1 − $.),".
/"0% = /" − 11
-" + 23 !"
• !" and -" are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.• Recommended values in the paper are $% = 0.9, $. = 0.999, 8 = 19 − 8
; / − <=>9?@A-9 BCD?@A<D
/ ∈ FG − HIJI!9@9JK
∇M; / −,JINA9D@ -9?@<J
1 −O9IJDAD,JI@9
/ =/−1 3 ∇M; /;Q,R
," = ∇MS; /"
Dor Ringel 80
Adam – Bias towards zero
• as !" and #" are initialized as 0’s, they are biased towards zero.
• Most significant in the initial steps.
• Most significant when $% and $& are close to 1.
• A correction is required.
Dor Ringel 81
Adam – Bias correction
!"# ="#
1−'(#, *+# =
+#1 −',#
-#.( = -# − /1*+# + 1
2 !"#
• Correct both moments to get the final update rule.
3 - − 456789:+7 ;<=89:4=- ∈ ?@ − ABCB"797CD
∇F3 - −GCBH:7=9 +7894C/ −I7BC=:=GCB97
G# =∇FJ3 -#
Dor Ringel 82
Adam vs. Adadelta
Adam: !"#$ = !" − '$
()*#+, -."
Adadelta: !"#$ = !" − '$
)*#+, /"
0 ! − 123456784 9:;5671;
! ∈ => − ?@[email protected]
∇D0 ! − /A@E74;6 84561A
' − F4@A;7;/ A@64
/" = ∇D*0 !"
." = G$."H$ + (1 − G$)/"
8" = GM8"H$ + (1 − GM)/"M
Dor Ringel 85
Adam - Performance
Dor Ringel 86
Adam - Performance
Dor Ringel 87
Adam - Performance
Dor Ringel
Questions?
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
89
Dor Ringel 90
improving Adam• Ndam – Incorporating Nestrov into Adam
• AdamW - decoupling weight decay!"#$ = !" − ' $
()*#+, -." − '/" !"
• AMSGrad – fixing the exponential moving average01" = .23 01"4$, 1"
!"#$ = !" − ' 101" + 8
, ."
• Maximum of past gradients instead of their exp. moving average
• Adam with warm restarts
Dor Ringel 91
Additional approaches• Snapshot ensembles• Learning to optimize
Dor Ringel 92
Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018)
Dor Ringel 93
Summary• A brief walkthrough of Supervised Machine Learning.• A conviction of the importance and relevance of Gradient methods.• An Overview of modern Gradient descent optimization algorithms.• The contribution of Adam.• Innovations future to Adam.
Dor Ringel 94
Dor Ringel
Questions?
Dor Ringel
The End
Dor Ringel98
links• http://ruder.io/optimizing-gradient-descent/index.html• http://ruder.io/deep-learning-optimization-2017/• https://imgur.com/a/Hqolp• https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3• https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network• https://mathematica.stackexchange.com/questions/9928/identifying-critical-points-lines-of-2-3d-
image-cubes• https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d• https://distill.pub/2017/momentum/• https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-
d7834f67a4f6• https://meetshah1995.github.io/semantic-segmentation/deep-
learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html• https://github.com/mattnedrich/GradientDescentExample
Dor Ringel99
references• Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
• Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv , pages 1–14, 2014.
• Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop , (1):2013–2016, 2016.
• Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations , pages 1–13, 2015.
• Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.) , 269:543–547.
• Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks :
• the official journal of the International Neural Network Society , 12(1):145–151, 1999.
• Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 , 2012.
• Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.