question about gradient descent hung-yi lee. larger gradient, larger steps? best step:
TRANSCRIPT
Question aboutGradient descent
Hung-yi Lee
Larger gradient, larger steps?
𝑦=𝑎𝑥2+𝑏𝑥+𝑐
|𝜕 𝑦𝜕 𝑥 |=¿2𝑎𝑥+𝑏∨¿
𝑥0
¿ 𝑥0−𝑏2𝑎
∨¿
𝑥0
¿2𝑎𝑥0+𝑏∨¿
Best step:
−𝑏2𝑎
¿2𝑎𝑥0+𝑏∨ 2𝑎
Contradiction
𝑤𝑡+1←𝑤 𝑡−𝜂𝜎𝑡 𝑔
𝑡
𝜎 𝑡=√𝛼 (𝜎𝑡− 1 )2+(1−𝛼 ) (𝑔𝑡 )2
𝑤𝑡+1←𝑤 𝑡−𝜂
√∑𝑖=0𝑡
(𝑔𝑖 )2𝑔𝑡
Original Gradient descent
Adagrad
RMSprop
𝑤𝑡+1←𝑤 𝑡−𝜂𝑔𝑡
𝑔𝑡=𝜕𝐶 (𝑤 𝑡 )𝜕𝑤
Larger gradient, larger step
Divided by first derivative
Divided by first derivative
Second Derivative
𝑦=𝑎𝑥2+𝑏𝑥+𝑐
|𝜕 𝑦𝜕 𝑥 |=¿2𝑎𝑥+𝑏∨¿
−𝑏2𝑎
𝑥0
¿ 𝑥0−𝑏2𝑎
∨¿
𝑥0¿2𝑎𝑥0+𝑏∨¿
Best step:
𝜕2 𝑦𝜕 𝑥2
=2𝑎 The best step is|First derivative|
Second derivative
¿2𝑎𝑥0+𝑏∨ 2𝑎
More than one parameters
𝑤1
𝑤2
𝑤1
𝑤2
|First derivative|
Second derivativeThe best step is
a
b
c
d
c < a
c > d
Larger second derivative
smaller second derivative
a > b
What to do with Adagrad and RMSprop?
|First derivative|
Second derivative
The best step is
Use first derivative to estimate second derivative
√ ( first derivative )2
𝑤1 𝑤2
larger second derivative
smaller second derivative
Acknowledgement
• This question is raised by 李廣和
Thanks for your attention!