semi-stochastic gradient descent methods jakub konečný university of edinburgh lehigh university...
TRANSCRIPT
![Page 1: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/1.jpg)
Semi-Stochastic Gradient Descent Methods
Jakub KonečnýUniversity of Edinburgh
Lehigh UniversityDecember 3, 2014
![Page 2: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/2.jpg)
Based on Basic Method: S2GD
Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013
Mini-batching (& proximal setting): mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD:
mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014
Coordinate descent variant: S2CD Konečný, Qu and Richtárik. S2CD: Semi-
Stochastic Coordinate Descent, October 2014
![Page 3: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/3.jpg)
Introduction
![Page 4: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/4.jpg)
Large scale problem setting Problems are often structured
Frequently arising in machine learning
Structure – sum of functions is BIG
![Page 5: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/5.jpg)
Examples Linear regression (least squares)
Logistic regression (classification)
![Page 6: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/6.jpg)
Assumptions Lipschitz continuity of derivative of
Strong convexity of
![Page 7: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/7.jpg)
Gradient Descent (GD) Update rule
Fast convergence rate
Alternatively, for accuracy we need
iterations
Complexity of single iteration – (measured in gradient evaluations)
![Page 8: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/8.jpg)
Stochastic Gradient Descent (SGD) Update rule
Why it works
Slow convergence
Complexity of single iteration – (measured in gradient evaluations)
a step-size parameter
![Page 9: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/9.jpg)
Goal
GD SGD
Fast convergence
gradient evaluations in each iteration
Slow convergence
Complexity of iteration independent of
Combine in a single algorithm
![Page 10: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/10.jpg)
Semi-Stochastic Gradient Descent
S2GD
![Page 11: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/11.jpg)
Intuition
The gradient does not change drastically We could reuse the information from “old”
gradient
![Page 12: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/12.jpg)
Modifying “old” gradient Imagine someone gives us a “good” point
and
Gradient at point , near , can be expressed as
Approximation of the gradient
Already computed gradientGradient changeWe can try to estimate
![Page 13: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/13.jpg)
The S2GD Algorithm
Simplification; size of the inner loop is random, following a geometric rule
![Page 14: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/14.jpg)
Theorem
![Page 15: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/15.jpg)
Convergence rate
How to set the parameters ?
Can be made arbitrarily small, by decreasing
For any fixed , can be made arbitrarily small by increasing
![Page 16: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/16.jpg)
Setting the parameters
The accuracy is achieved by setting
Total complexity (in gradient evaluations)
# of epochs
full gradient evaluation cheap iterations
# of epochs
stepsize
# of iterations
Fix target accuracy
![Page 17: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/17.jpg)
Complexity S2GD complexity
GD complexity iterations complexity of a single iteration Total
![Page 18: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/18.jpg)
Related Methods SAG – Stochastic Average Gradient
(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)
Refresh single stochastic gradient in each iteration Need to store gradients. Similar convergence rate Cumbersome analysis
SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)
Refined analysis
MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)
Similar to SAG, slightly worse performance Elegant analysis
![Page 19: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/19.jpg)
Related Methods SVRG – Stochastic Variance Reduced
Gradient(Rie Johnson, Tong Zhang, 2013)
Arises as a special case in S2GD Prox-SVRG
(Tong Zhang, Lin Xiao, 2014)
Extended to proximal setting
EMGD – Epoch Mixed Gradient Descent(Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013)
Handles simple constraints, Worse convergence rate
![Page 20: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/20.jpg)
Experiment (logistic regression on: ijcnn, rcv, real-sim, url)
![Page 21: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/21.jpg)
Extensions
![Page 22: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/22.jpg)
Extensions summary S2GD:
Efficient handling of sparse data Pre-processing with SGD (S2GD+) Non-strongly convex losses High-probability result
Minibatching: mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD:
mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014
Coordinate descent variant:S2CD Konečný, Qu, Richtárik. S2CD: Semi-Stochastic
Coordinate Descent, October 2014
![Page 23: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/23.jpg)
Sparse data For linear/logistic regression, gradient copies
sparsity pattern of example.
But the update direction is fully dense
Can we do something about it?
DENSESPARSE
![Page 24: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/24.jpg)
Sparse data Yes we can! To compute , we only need coordinates
of corresponding to nonzero elements of
For each coordinate , remember when was it updated last time – Before computing in inner iteration
number , update required coordinates Step being Compute direction and make a single update
Number of iterations when the coordinate was not updatedThe “old gradient”
![Page 25: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/25.jpg)
Sparse data implementation
![Page 26: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/26.jpg)
S2GD+ Observing that SGD can make reasonable
progress, while S2GD computes first full gradient (in case we are starting from arbitrary point),we can formulate the following algorithm (S2GD+)
![Page 27: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/27.jpg)
S2GD+ Experiment
![Page 28: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/28.jpg)
High Probability Result The result holds only in expectation Can we say anything about the concentration
of the result in practice?
For any
we have:
Paying just logarithm of probabilityIndependent from other parameters
![Page 29: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/29.jpg)
Code Efficient implementation for logistic
regression -available at MLOSS
http://mloss.org/software/view/556/
![Page 30: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/30.jpg)
mS2GD (mini-batch S2GD) How does mini-batching influence the
algorithm? Replace
by
Provides two-fold speedup Provably less gradient evaluations are needed
(up to certain number of mini-batches) Easy possibility of parallelism
![Page 31: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/31.jpg)
S2CD (Semi-Stochastic Coordinate Descent) SGD type methods
Sampling rows (training examples) of data matrix Coordinate Descent type methods
Sampling columns (features) of data matrix
Question: Can we do both? Sample both columns and rows
![Page 32: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/32.jpg)
S2CD (Semi-Stochastic Coordinate Descent)
Comlpexity
S2GD
![Page 33: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/33.jpg)
S2GD as a Learning Algorithm
![Page 34: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/34.jpg)
Problem with “us” optimizers Optimizers care about optimization Statisticians care about statistics In isolation Practical need to control both statistical
predictive power and effort spent on optimization, is not well understood.
Optimizers should be aware of… The following framework is mostly due to
Bottou and Bousquets, 2007
![Page 35: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/35.jpg)
Machine Learning Setting Space of input-output pairs Unknown distribution A relationship between inputs and outputs Loss function to measure discrepancy
between predicted and real output
Define Expected Risk
![Page 36: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/36.jpg)
Machine Learning Setting Ideal goal: Find such that,
But you cannot even evaluate
Define Expected Risk
![Page 37: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/37.jpg)
Machine Learning Setting We at least have iid samples
Define Empirical Risk
![Page 38: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/38.jpg)
First learning principle – fix a family of candidate prediction functions
Find Empirical Minimizer
Define Empirical Risk
Machine Learning Setting
![Page 39: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/39.jpg)
Since optimal is unlikely to belong to ,we also define
Define Empirical Risk
Machine Learning Setting
![Page 40: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/40.jpg)
Finding by minimizing the Empirical Risk exactly is often computationally expensive
Run optimization algorithm that returns such that
Define Empirical Risk
Machine Learning Setting
![Page 41: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/41.jpg)
Recapitulation
Ideal optimum
“Best” from our family
Empirical Minimizer
From approximate optimization
![Page 42: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/42.jpg)
Machine Learning Goal Big goal is to minimize the Excess Risk
Approximation error Estimation Error Optimization Error
![Page 43: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/43.jpg)
Generic Machine Learning Problem All this leads to a complicated compromise
Three variables Family of functions Number or examples Optimization accuracy
Two constraints Maximal number of examples Maximal computational time available
![Page 44: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/44.jpg)
Generic Machine Learning Problem Small scale learning problem
If first inequality is tight Can reduce to insignificant levels and recover
approximation-estimation tradeoff (well studied) Large scale learning problem
If second inequality is tight More complicated compromise
![Page 45: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/45.jpg)
Solving Large Scale ML Problem Several simplifications needed
Not carefully balance the three terms; instead we only ensure that asymptotically
Consider fixed family of functions , linearly parameterized by a vector Effectively setting to be a constant
Simplifies to Estimation–Optimization tradeoff
![Page 46: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/46.jpg)
Estimation–Optimization tradeoff Using uniform convergence bounds, one can
obtain
Often considered weak
![Page 47: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/47.jpg)
Estimation–Optimization tradeoff Using Localized Bounds (Bousquet, PhD
thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get
… if we can establish the following variance condition
Often , for example under strong convexity, or making assumptions on the data distribution
![Page 48: Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014](https://reader037.vdocuments.us/reader037/viewer/2022110116/551c1856550346a34f8b575f/html5/thumbnails/48.jpg)
Estimation–Optimization tradeoff Using the previous bounds yields
where is an absolute constant We want to push this term below Choosing and using
and we get the following table