the marginal value of adaptive gradient methods in machine learning
TRANSCRIPT
![Page 1: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/1.jpg)
IDS Lab
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Does deep learning really doing some generalization? 2presented by Jamie Seol
![Page 2: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/2.jpg)
IDS Lab
Jamie Seol
Preface• Toy problem: smooth quadratic strong convex optimization• Let object f be as following, and WLOG suppose A to be a
symmetric and nonsingular
• why WLOG? symmetric because it’s a quadratic form, and singular curvature (curvature of quadratic function is A) is reducible in quadratic function
• moreover, strong convex = positive definite curvature• meaning that all eigenvalues are positive
![Page 3: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/3.jpg)
IDS Lab
Jamie Seol
Preface• Note that A was a real symmetric matrix, so by the spectral
theorem, A has eigendecomposition with unitary basis
• In this simple objective function, we can explicitly compute the optima
![Page 4: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/4.jpg)
IDS Lab
Jamie Seol
Preface• We’ll apply a gradient descent! let superscript be an iteration:
• Will it converge to the optima? let’s check it out!
• We use some tricky trick using change of basis
• This new sequence x(k) should converge to 0• But when?
![Page 5: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/5.jpg)
IDS Lab
Jamie Seol
Preface
• This holds• [homework: prove it]
• With rewriting by element-wise notation:
![Page 6: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/6.jpg)
IDS Lab
Jamie Seol
Preface
• So, the gradient descent converges only if
• for all i• In summary, it converges when
• And the optimal is
• where 𝜎(A) denote a spectral radius of A, meaning the maximal absolute value among eigenvalues [homework: n = 1 case]
![Page 7: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/7.jpg)
IDS Lab
Jamie Seol
Preface (appendix)
• Actually, this result is rather obvious• Note that A was a curvature of the objective, and the spectral
radius or the largest eigenvalue means "stretching" above A’s principal axis• curvature ← see differential geometry• principal axis ← see linear algebra
• So, it is vacuous that the learning rate should be in a safe area regarding the "stretching", which can be done with simple normalization
![Page 8: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/8.jpg)
IDS Lab
Jamie Seol
Preface• Similarly, the optimal momentum decay can also be induced,
using condition number 𝜅• condition number of a matrix is ratio between maximal and
minimal (absolute) eigenvalues
• Therefore, if we can control the boundary of the spectral radius of the objective, then we can approximate the optimal parameters for gradient descent• this is the main idea of the YellowFin optimizer
![Page 9: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/9.jpg)
IDS Lab
Jamie Seol
Preface• So what?
• We pretty much do know well about behaviors of gradient descent• if the objective is smooth quadratic strong convex..• but the objectives of deep learning is not nice enough!
• We just don’t really know about characteristics of deep learning objective functions yet• requires more research
![Page 10: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/10.jpg)
IDS Lab
Jamie Seol
Preface 2• Here’s a typical linear regression problem
• If the number of features d is bigger than the number of samples m, than it is underdetermined system
• So it has (possibly infinitely) many solutions
• Let’s use stochastic gradient descent (SGD)• which solution will SGD find?
![Page 11: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/11.jpg)
IDS Lab
Jamie Seol
Preface 2• Actually, we’ve already discussed about this in the previous
seminar
• Anyway, even if the system is underdetermined, SGD always converges to some unique solution which belongs to span of X
![Page 12: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/12.jpg)
IDS Lab
Jamie Seol
Preface 2• Moreover, experiments show that SGD’s solution has small norm
• We know that the l2-regularization helps generalization• l2-regularization: keeping parameter’s norm small
• So, we can say that the SGD has implicit regularization• but there’s evidence that l2-regularization does not help at all…
• see previous seminar presented by me
• 잘 되지만 사실 잘 안되고, 그래도 좋은 편이지만 그닥 좋지만은 않다…
![Page 13: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/13.jpg)
IDS Lab
Jamie Seol
Introduction• In summary,
• adaptive gradient descent methods• might be poor
• at generalization
![Page 14: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/14.jpg)
IDS Lab
Jamie Seol
Preliminaries• Famous non-adaptive gradient descent methods:
• Stochastic Gradient Descent [SGD]
• Heavy-Ball [HB] (Polyak, 1964)
• Nesterov’s Accelerated Gradient [NAG] (Nesterov, 1983)
![Page 15: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/15.jpg)
IDS Lab
Jamie Seol
Preliminaries• Adaptive methods can be summarized as:
• AdaGrad (Duchi, 2011)• RMSProp (Tieleman and Hinton, 2012, in coursera!)• Adam (Kingma and Ba, 2015)
• In short, these methods adaptively changes learning rate and momentum decay
![Page 16: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/16.jpg)
IDS Lab
Jamie Seol
Preliminaries• All together
![Page 17: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/17.jpg)
IDS Lab
Jamie Seol
Synopsis• For a system with multiple solution, what solution does an
algorithm find and how well does it generalize to unseen data?
• Claim: there exists a constructive problem(dataset) in which• non-adaptive methods work well and
• finds a solution with good generalization power• adaptive methods work poor
• finds a solution with poor generalization power• we even can make this arbitrarily poor, while the non-
adaptive solution still working
![Page 18: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/18.jpg)
IDS Lab
Jamie Seol
Problem settings• Think of a simple binary least-squares classification problem
• When d > n, if there is a optima with loss 0 then there are infinite number of optima
• But as shown in preface 2, SGD converges to the unique solution• with known to be the minimum norm solution
• which generalizes well• why? becase in here, it’s also the largest margin solution
• All other non-adaptive methods also converges to the same
![Page 19: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/19.jpg)
IDS Lab
Jamie Seol
Lemma• Let sign(x) denote a function that maps each component of x to its
sign• ex) sign([2, -3]) = [1, -1]
• If there exists a solution proportional to sign(XTy), this is precisely the unique solution where all adaptive methods converge• quite interesting lemma!• pf) use induction
• Note that this solution is just:• mean of positive labeled vectors - mean of negative labeled
vectors
![Page 20: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/20.jpg)
IDS Lab
Jamie Seol
![Page 21: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/21.jpg)
IDS Lab
Jamie Seol
Funny dataset• Let’s fool adaptive methods
• first, assign yi to 1 with probability p > 1/2
• when y = [-1, -1, -1, -1]
• when y = [1, 1, 1, 1]
![Page 22: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/22.jpg)
IDS Lab
Jamie Seol
Funny dataset• Note that for such a dataset, the only discriminative feature is the
first one!• if y = [1, -1, -1, 1, -1] then X becomes:
![Page 23: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/23.jpg)
IDS Lab
Jamie Seol
Funny dataset• Let and assume b > 0 (p > 1/2)• Suppose , then
![Page 24: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/24.jpg)
IDS Lab
Jamie Seol
Funny dataset• So, holds!
• Take a closer look• all first three are 1, and rest is 0 for new data
• this solution is bad!• it will classify every new data to positive class!!!
• what a horrible generalization!
![Page 25: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/25.jpg)
IDS Lab
Jamie Seol
Funny dataset• How about non-adaptive method?
• So, when , the solution makes no errors• wow
![Page 26: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/26.jpg)
IDS Lab
Jamie Seol
Funny dataset• Think this is too extreme?• Well, even in the real dataset, the following are rather common:
• a few frequent feature (j = 2, 3)• some are good indicators, but hard to identify (j = 1)
• many other sparse feature (other)
![Page 27: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/27.jpg)
IDS Lab
Jamie Seol
Experiments
• (authors said that they downloaded models from internet…)• Results in summary:
• adaptive makes poor generalization• even if it had lower loss than the non-adaptive ones!!!
• adaptive looks fast, but that’s it• adaptive says "no more tuning" but tuning initial values were
still significant• and it requires as much time as non-adaptive tuning…
![Page 28: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/28.jpg)
IDS Lab
Jamie Seol
Experiments• CIFAR-10
• use non-adaptive
![Page 29: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/29.jpg)
IDS Lab
Jamie Seol
Experiments• low training loss, more test error (Adam vs HB)
![Page 30: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/30.jpg)
IDS Lab
Jamie Seol
Experiments• Character-level language model
• AdaGrad looks very fast, but indeed, not good• surprisingly, RMSProp closely trails SGD on test
![Page 31: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/31.jpg)
IDS Lab
Jamie Seol
Experiments• Parsing
• well, it is true that non-adaptive methods are slow
![Page 32: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/32.jpg)
IDS Lab
Jamie Seol
Conclusion• Adaptive methods are not advantageous for optimization• It might be fast, but poor generalization
• then why is Adam so popular?• because it’s popular…?• specially, known to be popular in GAN and Q-learning
• these are not exactly optimization problems• we don’t know any nature of objectives in those two yet
![Page 33: The marginal value of adaptive gradient methods in machine learning](https://reader034.vdocuments.us/reader034/viewer/2022051710/5a6db37d7f8b9a22298b4631/html5/thumbnails/33.jpg)
IDS Lab
Jamie Seol
References• Wilson, Ashia C., et al. "The Marginal Value of Adaptive Gradient
Methods in Machine Learning." arXiv preprint arXiv:1705.08292 (2017). • Zhang, Jian, Ioannis Mitliagkas, and Christopher Ré. "YellowFin and the
Art of Momentum Tuning." arXiv preprint arXiv:1706.03471 (2017). • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016). • Polyak, Boris T. "Some methods of speeding up the convergence of
iteration methods." USSR Computational Mathematics and Mathematical Physics 4.5 (1964): 1-17.
• Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006