semi-stochastic gradient descent peter richtárik anc/dtc seminar, school of informatics, university...
TRANSCRIPT
![Page 1: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/1.jpg)
Semi-Stochastic Gradient Descent
Peter Richtárik
ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014
![Page 2: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/2.jpg)
Based on
Basic method: S2GD & S2GD+ Konecny and Richtarik. Semi-Stochastic Gradient
Descent Methods, arXiv:1312.1666, December 2013
Mini-batching (& proximal setup): mS2GD Konecny, Liu, Richtarik and Takac. mS2GD:
Minibatch Semi-Stochastic Coordinate Descent in the Proximal Setting, October 2014
Coordinate descent variant: S2CD Konecny, Qu and Richtarik. S2CD: Semi-
Stochastic Coordinate Descent, October 2014
![Page 3: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/3.jpg)
The Problem
![Page 4: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/4.jpg)
Minimizing Average Loss Problems are often structured
Frequently arising in machine learning
Structure – sum of functions is BIG
![Page 5: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/5.jpg)
Examples Linear regression (least squares)
Logistic regression (classification)
![Page 6: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/6.jpg)
Assumptions Lipschitz continuity of the gradient of
Strong convexity of
![Page 7: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/7.jpg)
Applications
![Page 8: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/8.jpg)
SPAM DETECTION
![Page 9: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/9.jpg)
PAGE RANKING
![Page 10: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/10.jpg)
FA’K’E DETECTION
![Page 11: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/11.jpg)
RECOMMENDER SYSTEMS
![Page 12: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/12.jpg)
GEOTAGGING
![Page 13: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/13.jpg)
Gradient Descentvs
Stochastic Gradient Descent
![Page 14: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/14.jpg)
http://madeincalifornia.blogspot.co.uk/2012/11/gradient-descent-algorithm.html
![Page 15: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/15.jpg)
Gradient Descent (GD) Update rule
Fast convergence rate
Alternatively, for accuracy we need
iterations
Complexity of single iteration: (measured in gradient evaluations)
![Page 16: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/16.jpg)
Stochastic Gradient Descent (SGD) Update rule
Why it works
Slow convergence
Complexity of single iteration – (measured in gradient evaluations)
a step-size parameter
![Page 17: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/17.jpg)
Dream…
GD SGD
Fast convergence
gradient evaluations in each iteration
Slow convergence
Complexity of iteration independent of
Combine in a single algorithm
![Page 18: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/18.jpg)
S2GD:Semi-Stochastic Gradient Descent
![Page 19: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/19.jpg)
Why dream may come true…
The gradient does not change drastically We could reuse old information
![Page 20: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/20.jpg)
Modifying “old” gradient Imagine someone gives us a “good” point
and
Gradient at point , near , can be expressed as
Approximation of the gradient
Already computed gradientGradient changeWe can try to estimate
![Page 21: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/21.jpg)
The S2GD Algorithm
Simplification; size of the inner loop is random, following a geometric rule
![Page 22: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/22.jpg)
Theorem
![Page 23: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/23.jpg)
Convergence Rate
How to set the parameters ?
Can be made arbitrarily small, by decreasing
For any fixed , can be made arbitrarily small by increasing
![Page 24: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/24.jpg)
Setting the Parameters
The accuracy is achieved by setting
Total complexity (in gradient evaluations)
# of epochs
full gradient evaluation cheap iterations
# of epochs
stepsize
# of iterations
Fix target accuracy
![Page 25: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/25.jpg)
Complexity S2GD complexity
GD complexity iterations complexity of a single iteration Total
![Page 26: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/26.jpg)
Experiment (logistic regression on: ijcnn, rcv, real-sim, url)
![Page 27: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/27.jpg)
Related Methods SAG – Stochastic Average Gradient
(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)
Refresh single stochastic gradient in each iteration Need to store gradients. Similar convergence rate Cumbersome analysis
SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)
Refined analysis MISO - Minimization by Incremental Surrogate
Optimization (Julien Mairal, 2014)
Similar to SAG, slightly worse performance Elegant analysis
![Page 28: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/28.jpg)
Related Methods
SVRG – Stochastic Variance Reduced Gradient(Rie Johnson, Tong Zhang, 2013)
Arises as a special case in S2GD Prox-SVRG
(Tong Zhang, Lin Xiao, 2014)
Extended to proximal setting EMGD – Epoch Mixed Gradient Descent
(Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013)
Handles simple constraints, Worse convergence rate
![Page 29: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/29.jpg)
Extensions
![Page 30: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/30.jpg)
Extensions S2GD:
Efficient handling of sparse data Pre-processing with SGD (-> S2GD+) Inexact computation of gradients Non-strongly convex losses High-probability result
Mini-batching: mS2GD Konecny, Liu, Richtarik and Takac. mS2GD:
Minibatch Semi-Stochastic Coordinate Descent in the Proximal Setting, October 2014
Coordinate variant: S2CD Konecny, Qu and Richtarik. S2CD: Semi-Stochastic
Coordinate Descent, October 2014
![Page 31: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/31.jpg)
Semi-Stochastic Coordinate Descent
Complexity:
S2GD:
![Page 32: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/32.jpg)
Mini-batch Semi-Stochastic Gradient Descent
![Page 33: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/33.jpg)
Sparse Data For linear/logistic regression, gradient copies
sparsity pattern of example.
But the update direction is fully dense
Can we do something about it?
DENSESPARSE
![Page 34: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/34.jpg)
Sparse Data (Continued) Yes we can! To compute , we only need coordinates
of corresponding to nonzero elements of
For each coordinate , remember when was it updated last time – Before computing in inner iteration
number , update required coordinates Step being Compute direction and make a single update
Number of iterations when the coordinate was not updatedThe “old gradient”
![Page 35: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/35.jpg)
S2GD: Implementation for Sparse Data
![Page 36: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/36.jpg)
S2GD+ Observing that SGD can make reasonable
progress while S2GD computes the first full gradient,we can formulate the following algorithm (S2GD+)
![Page 37: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/37.jpg)
S2GD+ Experiment
![Page 38: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/38.jpg)
High Probability Result The result holds only in expectation Can we say anything about the concentration
of the result in practice?
For any
we have:
Paying just logarithm of probabilityIndependent from other parameters
![Page 39: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/39.jpg)
Inexact Case Question: What if we have access to inexact
oracle? Assume we can get the same update direction
with error :
S2GD algorithm in this setting gives
with
![Page 40: Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014](https://reader038.vdocuments.us/reader038/viewer/2022103015/551c1864550346a84f8b5771/html5/thumbnails/40.jpg)
Code Efficient implementation for logistic
regression -available at MLOSS
http://mloss.org/software/view/556/