![Page 1: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/1.jpg)
Gradient DescentLyle Ungar
University of PennsylvaniaIn part from slides written jointly with Zack Ives
Learning objectivesKnow standard, coordinate, stochastic gradient, and minibatch gradient descent
Adagrad: core idea
![Page 2: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/2.jpg)
Gradient Descentu We almost always want to minimize some loss functionu Example: Sum of Squared Errors (SSE):
𝑆𝑆𝐸 q =1𝑛&!"#
$
𝑟! q %
𝑟!(q) = 𝑦(!) − ŷ 𝒙 ! ; q
![Page 3: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/3.jpg)
Mean Squared Error
http://www.math.uah.edu/stat/expect/Variance.html
In one dimension, lookslike a parabola centeredaround the optimalvalue μ
(Generalizes to d dimensions)
𝑆𝑆𝐸 q =1𝑛)!"#
𝑟! q $
q
![Page 4: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/4.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
Current value q
What if we use theslope of the tangentto decide where to“go next”?
the gradient𝛻𝑆𝑆𝐸 q
= lim!→#
ŷ(q + 𝑑 − ŷ q𝑑
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
theory.stanford.edu/~tim/s15/l/l15.pdf
q: = q− h 𝛻𝑆𝑆𝐸 q
![Page 5: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/5.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q =1𝑛3$%&
'𝑑𝑑q𝑟$ q (
![Page 6: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/6.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q
![Page 7: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/7.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 ⋅ 𝑟! q ⋅𝜕𝑟!(q)𝜕q
𝜕𝑟$𝜕q)
= 𝑥)$
![Page 8: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/8.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝜕𝑟$𝜕q)
= 𝑥)$
𝛻𝑆𝑆𝐸 q
=1𝑛'!"#
$
2 𝑟! q ⋅ 𝒙 !
![Page 9: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/9.jpg)
Getting Closer
http://www.math.uah.edu/stat/expect/Variance.html
We can compute the gradient numerically… But sometimes better to use analytics (calculus)!
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $
Current value q
𝜕𝑟$𝜕q)
= 𝑥)$
𝛻𝑆𝑆𝐸 q
=2𝑛'!"#
$
𝑟! q ⋅ 𝒙 !
Step h
q ∶= q− h 𝛻𝑀𝑆𝐸(q)
![Page 10: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/10.jpg)
Key questionsu How big a step h to take?
l Too small and it takes a long timel Too big and it will be unstable
u“Optimal:” scale h ~ 1/sqrt(iteration)u Adaptive (a simple version)
l E.g. each time, increase step size by 10%n If error ever increases, cut set size in half
q: = q− h 𝛻𝑆𝑆𝐸(q)
![Page 11: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/11.jpg)
For ||w||1 or ||y-y||1 use coordinate descent
https://en.wikipedia.org/wiki/Coordinate_descent
Repeat:For j=1:pqj:= qj - hdErr/dqj
![Page 12: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/12.jpg)
Elastic net parameter search
Size of coefficients
Regularization penalty (inverse) Zou and Hastie
![Page 13: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/13.jpg)
Stochastic Gradient Descentu If we have a very large data set, update the model
after observing each single observationl “online” or “streaming” learning
𝑆𝑆𝐸 q =1𝑛)!"#
%
𝑟! q $ 𝛻𝑆𝑆𝐸𝑖 q =𝑑𝑑q𝑟! q $
q ∶= q− h 𝛻𝑆𝑆𝐸𝑖 q
![Page 14: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/14.jpg)
Mini-batchu Update the model every k observations
l Batch size k (e.g. 50)u More efficient than pure stochastic gradient
or full gradient descent𝑆𝑆𝐸 q =
1𝑛)!"#
%
𝑟! q $ 𝛻𝑆𝑆𝐸𝑘 q =1𝑘'!"#
#$%𝑑𝑑q 𝑟!
q &
q ∶= q− h 𝛻𝑆𝑆𝐸𝑘 q
![Page 15: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/15.jpg)
Adagrad• Define a per-feature learning rate for feature j as:
• Gt,j is the sum of squares of gradients of feature j over time t• Frequently occurring features in the gradients get small learning rates; rare features get higher ones • Key idea: “learn slowly” from frequent features but “pay attention” to rare but informative feature
@
@✓jcost✓(xk, yk)
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
![Page 16: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/16.jpg)
Adagrad
In practice, add a small constant 𝜁 > 0 to prevent dividing by zero
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
⌘t,j =⌘pGt,j
Gt,j =tX
k=1
g2k,j<latexit sha1_base64="q8emAzDFCHDHqZmCF6eU8SK6HTU=">AAACM3icbVDLSgMxFM34rPVVdekmKIILKTN1oZuC6KLiqoJVoVOHTJqpsZmHyR2hhPkBd/6JGz9EF4K4UMStvyBmWsHngcDhnHO5ucdPBFdg2w/W0PDI6Nh4YaI4OTU9M1uamz9UcSopa9BYxPLYJ4oJHrEGcBDsOJGMhL5gR353J/ePLphUPI4OoJewVkg6EQ84JWAkr7TnMiCehrWzDFexG0hCdS5l2lXnEnRt4GVZsfaVUmno6W7VyU4Adwwz8kml6JWW7bLdB/5LnE+yvFW73Hl3r27rXunObcc0DVkEVBClmo6dQEsTCZwKlhXdVLGE0C7psKahEQmZaun+zRleMUobB7E0LwLcV79PaBIq1Qt9kwwJnKrfXi7+5zVTCDZbmkdJCiyig0VBKjDEOC8Qt7lkFETPEEIlN3/F9JSY3sDUnJfg/D75LzmslJ31cmXftLGNBiigRbSEVpGDNtAW2kV11EAUXaN79ISerRvr0XqxXgfRIetzZgH9gPX2AcHsrrI=</latexit>
✓j ✓j �⌘p
Gt,j + ⇣gt,j
<latexit sha1_base64="OBxDJ3usqHT9hv3J+8q8hjcf0Bs=">AAACOXicbZDLThsxFIY93AoDhRSWbCwiqkptoxmo1C5RWcAySASQMtHI45xJHDwX7DOtUmteqxvegh1SN12AEFteAE8yC26/ZOn3d86Rff4ol0Kj5107M7Nz8wvvFpfc5ZX3q2uND+snOisUhw7PZKbOIqZBihQ6KFDCWa6AJZGE0+h8v6qf/gKlRZYe4ziHXsIGqYgFZ2hR2Gi7AQ4BWWhGJf0YSIiRKZX9pjUe0a80iBXjJrDX0gT6QqE5CA1+GZUl/UyDPxWngylx3bDR9FreRPS18WvTJLXaYeMq6Ge8SCBFLpnWXd/LsWeYQsEllG5QaMgZP2cD6FqbsgR0z0w2L+m2JX0aZ8qeFOmEPp0wLNF6nES2M2E41C9rFXyr1i0w/tEzIs0LhJRPH4oLSTGjVYy0LxRwlGNrGFfC/pXyIbM5oQ27CsF/ufJrc7LT8ndbO0ffmns/6zgWySbZIp+IT76TPXJI2qRDOPlL/pEbcutcOv+dO+d+2jrj1DMb5Jmch0cMvq0H</latexit>
![Page 17: 6b gradient descent - seas.upenn.educis520/lectures/gradient_descent.pdf · Gradient Descent Lyle Ungar University of Pennsylvania In part from slides written jointly with Zack Ives](https://reader033.vdocuments.us/reader033/viewer/2022050216/5f62255f521a00358844dc5d/html5/thumbnails/17.jpg)
Recap: Gradient Descentu “Follow the slope” towards a minimum
l Analytical or numerical derivativel Need to pick step size
n larger = faster convergence but instabilityu Lots of variations
l Coordinate descentl Stochastic gradient descent or mini-batch
u Can get caught in local minimal Alternative, simulated annealing, uses randomness