![Page 1: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/1.jpg)
Peter Richtárik
Randomized Dual Coordinate Ascentwith Arbitrary Sampling
1st UCL Workshop on the Theory of Big Data – London– January 2015
![Page 2: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/2.jpg)
Coauthors
Zheng QuEdinburgh, Mathematics
Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014
Tong ZhangRutgers, Statistics
Baidu, Big Data Lab
![Page 3: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/3.jpg)
Part 1MACHINE LEARNING
![Page 4: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/4.jpg)
Statistical nature of data
Data (e.g., image, text,
measurements, …)
Label
![Page 5: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/5.jpg)
Prediction of labels from data
Find
Such that when (data, label) pair is drawn from the distribution
Then
Predicted label True label
Linear predictor
![Page 6: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/6.jpg)
Measure of Success
We want the expected loss (=risk) to be small:
data label
![Page 7: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/7.jpg)
Finding a Linear Predictor viaEmpirical Risk Minimization
Draw i.i.d. data (samples) from the distribution
Output predictor which minimizes the empirical risk:
![Page 8: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/8.jpg)
Part 2OPTIMIZATION
![Page 9: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/9.jpg)
Primal Problem
d = # features (parameters)
n = # samples 1 - strongly convex function (regularizer)
- smooth & convex
regularizationparameter
![Page 10: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/10.jpg)
Assumption 1
Loss functions have Lipschitz gradient
Lipschitz constant
![Page 11: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/11.jpg)
Assumption 2
Regularizer is 1-strongly convex
subgradient
![Page 12: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/12.jpg)
Dual Problem
- strongly convex 1 – smooth
& convex
![Page 13: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/13.jpg)
Part 3ALGORITHM
![Page 14: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/14.jpg)
Quartz
![Page 15: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/15.jpg)
Fenchel Duality
Weak duality
Optimality conditions
![Page 16: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/16.jpg)
The Algorithm
![Page 17: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/17.jpg)
Quartz: Bird’s Eye View
STEP 1: PRIMAL UPDATE
STEP 2: DUAL UPDATE
![Page 18: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/18.jpg)
The Algorithm
STEP 1
STEP 2
Convex combinationconstant
![Page 19: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/19.jpg)
Randomized Primal-Dual Methods
SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014
![Page 20: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/20.jpg)
Part 4MAIN RESULT
![Page 21: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/21.jpg)
Assumption 3 (Expected Separable Overapproximation)
inequality must hold for all
![Page 22: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/22.jpg)
Complexity Theorem (QRZ’14)
![Page 23: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/23.jpg)
Part 5aUPDATING ONE DUALVARIABLE AT A TIME
![Page 24: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/24.jpg)
Complexity of Quartz specialized to serial sampling
![Page 25: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/25.jpg)
Data
![Page 26: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/26.jpg)
Standard primal update
Experiment: Quartz vs SDCA,uniform vs optimal sampling
“Aggressive” primal update
![Page 27: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/27.jpg)
Part 5bTAU-NICE SAMPLING
(STANDARD MINIBATCHING)
![Page 28: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/28.jpg)
Data sparsity
A normalized measure of average sparsity of the data
“Fully sparse data” “Fully dense data”
![Page 29: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/29.jpg)
Complexity of Quartz
![Page 30: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/30.jpg)
Speedup
Assume the data is normalized:
Then:
Linear speedup up to a certain data-independent minibatch size:
Further data-dependent speedup, up to the extreme case:
![Page 31: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/31.jpg)
Speedup: sparse data
![Page 32: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/32.jpg)
Speedup: denser data
![Page 33: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/33.jpg)
Speedup: fully dense data
![Page 34: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/34.jpg)
astro_ph: n = 29,882 density = 0.08%
![Page 35: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/35.jpg)
CCAT: n = 781,265 density = 0.16%
![Page 36: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/36.jpg)
Primal-dual methods with tau-nice sampling
SS-Shwartz & T Zhang ‘13
SS-Shwartz & T Zhang ‘13
Y Zhang & L Xiao ‘14
![Page 37: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/37.jpg)
For sufficiently sparse data, Quartz wins even when compared against accelerated methods
Acce
lera
ted
![Page 38: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/38.jpg)
Part 6ESO
Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014
![Page 39: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/39.jpg)
Computation of ESO parameters
Lemma (QR’14b) {For simplicity, assume that m = 1}
![Page 40: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/40.jpg)
ESO
For any sampling , ESO holds with
Theorem (QR’14b)
where
![Page 41: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/41.jpg)
ESO
![Page 42: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/42.jpg)
Part 7DISTRIBUTED
QUARTZ
![Page 43: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/43.jpg)
Distributed Quartz: Perform the Dual Updates in a Distributed Manner
Quartz STEP 2: DUAL UPDATE
Data required to compute the update
![Page 44: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/44.jpg)
Distribution of Datan = # dual variables Data matrix
![Page 45: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/45.jpg)
Distributed sampling
![Page 46: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/46.jpg)
Distributed sampling
Random set of dual variables
![Page 47: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/47.jpg)
Distributed sampling & distributed coordinate descent
P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013
Previously studied (not in the primal-dual setup):
Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014
Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014
2
strongly convex & smooth
convex & smooth
![Page 48: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/48.jpg)
Complexity of distributed Quartz
![Page 49: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/49.jpg)
Reallocating load: theoretical speedup
n = 1,000,000density = 100%
n = 1,000,000density = 0.01%
![Page 50: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/50.jpg)
Reallocating load: experiment
Data: webspamn = 350,000density = 33.51%
![Page 51: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/51.jpg)
Experiment
Machine: 128 nodes of Hector Supercomputer (4096 cores)
Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB
Algorithm: with c = 512
P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013
![Page 52: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/52.jpg)
LASSO: 3TB data + 128 nodes
![Page 53: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/53.jpg)
Experiment
Machine: 128 nodes of Archer Supercomputer
Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)
Algorithm: Hydra2 with c = 256
Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014
![Page 54: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/54.jpg)
LASSO: 5TB data (d = 50b) + 128 nodes
![Page 55: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/55.jpg)
Related Work
Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv:1412.80630, 2014
Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv:1412.8060, 2014
P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013
![Page 56: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015](https://reader038.vdocuments.us/reader038/viewer/2022110320/56649cb15503460f94976367/html5/thumbnails/56.jpg)
END