descent methods

59
Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013 descent methods

Upload: taber

Post on 10-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

descent methods. Peter Richt á rik. Parallel coordinate. Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013. Randomized Coordinate Descent in 2D. 2D Optimization. Contours of a function. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: descent methods

Peter Richtárik

Parallel coordinateSimons Institute for the Theory of Computing, BerkeleyParallel and Distributed Algorithms for Inference and Optimization, October 23, 2013

descent methods

Page 2: descent methods
Page 3: descent methods

Randomized Coordinate Descent

in 2D

Page 4: descent methods

Find the minimizer of

2D OptimizationContours of a function

Goal:

a2 =b2

Page 5: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Page 6: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Page 7: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Page 8: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Page 9: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Page 10: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Page 11: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Page 12: descent methods

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Page 13: descent methods

Convergence of Randomized Coordinate Descent

Strongly convex F

Smooth or ‘simple’ nonsmooth F‘difficult’ nonsmooth F

Focus on n

(big data = big n)

Page 14: descent methods

Parallelization Dream

Depends on to what extent we can add up individual updates, which depends on the properties of F and the

way coordinates are chosen at each iteration

Serial Parallel

What do we actually get?WANT

Page 15: descent methods

How (not) to ParallelizeCoordinate Descent

Page 16: descent methods

“Naive” parallelization

Do the same thing as before, but

for MORE or ALL coordinates &

ADD UP the updates

Page 17: descent methods

Failure of naive parallelization

1a

1b

0

Page 18: descent methods

Failure of naive parallelization

1

1a

1b

0

Page 19: descent methods

Failure of naive parallelization

1

2a

2b

Page 20: descent methods

Failure of naive parallelization

1

2a

2b

2

Page 21: descent methods

Failure of naive parallelization

2

OOPS!

Page 22: descent methods

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Page 23: descent methods

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Page 24: descent methods

Averaging may be too conservative 2

WANT

BAD!!!But we wanted:

Page 25: descent methods

What to do?

Averaging:

Summation:

Update to coordinate i

i-th unit coordinate vector

Figure out when one can safely use:

Page 26: descent methods

Optimization Problems

Page 27: descent methods

Problem

Convex (smooth or nonsmooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Page 28: descent methods

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 29: descent methods

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 30: descent methods

3 models for f with small

1

2

3

Smooth partially separable f [RT’11b ]

Nonsmooth max-type f [FR’13]

f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

Page 31: descent methods

General Theory

Page 32: descent methods

Randomized Parallel Coordinate Descent Method

Random set of coordinates (sampling)

Current iterate New iterate i-th unit coordinate vector

Update to i-th coordinate

Page 33: descent methods

ESO: Expected Separable Overapproximation

Definition [RT’11b]

1. Separable in h2. Can minimize in parallel3. Can compute updates for only

Shorthand:

Minimize in h

Page 34: descent methods

Convergence rate: convex f

average # updated coordinates per iteration

# coordinatesstepsize parameter

error tolerance# iterations

implies

Theorem [RT’11b]

Page 35: descent methods

Convergence rate: strongly convex f

implies

Strong convexity constant of the regularizer

Strong convexity constant of the loss f

Theorem [RT’11b]

Page 36: descent methods

Partial Separability and

Doubly Uniform Samplings

Page 37: descent methods

Serial uniform samplingProbability law:

Page 38: descent methods

-nice samplingProbability law:

Good for shared memory systems

Page 39: descent methods

Doubly uniform sampling

Probability law:

Can model unreliable processors / machines

Page 40: descent methods

ESO for partially separable functionsand doubly uniform samplings

Theorem [RT’11b]

1 Smooth partially separable f [RT’11b ]

Page 41: descent methods

Theoretical speedup

Much of Big Data is here!

degree of partial separability

# coordinates

# coordinate updates / iter

WEAK OR NO SPEEDUP: Non-separable (dense) problems

LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

Page 42: descent methods
Page 43: descent methods

n = 1000(# coordinates)

Theory

Page 44: descent methods

Practice

n = 1000(# coordinates)

Page 45: descent methods

Experiment with a 1 billion-by-2 billion

LASSO problem

Page 46: descent methods

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Page 47: descent methods

Coordinate Updates

Page 48: descent methods

Iterations

Page 49: descent methods

Wall Time

Page 50: descent methods

Distributed-Memory Coordinate Descent

Page 51: descent methods

Distributed -nice sampling

Probability law:

Machine 2Machine 1 Machine 3

Good for a distributed version of coordinate descent

Page 52: descent methods

ESO: Distributed setting

Theorem [RT’13b]

3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

spectral norm of the data

Page 53: descent methods

Bad partitioning at most doubles # of iterations

spectral norm of the partitioning

Theorem [RT’13b]

# nodes

# iterations = implies

# updates/node

Page 54: descent methods
Page 55: descent methods

LASSO with a 3TB data matrix

128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM

= # coordinates

Page 56: descent methods

Conclusions• Coordinate descent methods scale very well

to big data problems of special structure– Partial separability (sparsity)– Small spectral norm of the data– Nesterov separability, ...

• Care is needed when combining updates (add them up? average?)

Page 57: descent methods

• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.

• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.

• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.

• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.

• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.

References: serial coordinate descent

Page 58: descent methods

• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011

• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012

• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013

• [FR’12] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013

• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013

• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013

References: parallel coordinate descent

Good entry point to the topic (4p paper)

Page 59: descent methods

• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.

• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.

• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.

• Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS 2013.

References: parallel coordinate descent

TALK TOMORROW