cocoa: communication-efficient coordinate ascent

COCOA Communication-Efficient

Coordinate Ascent

Virginia Smith ! Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

LARGE-SCALE OPTIMIZATION


COCOA


COCOA

RESULTS

LARGE-SCALE OPTIMIZATION!

COCOA

RESULTS

Machine Learning with Large Datasets

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …



MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …




OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Example: SVM Classification


wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||


wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

�

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)


wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

�

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)

Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients

Newton and Quasi-Newton methods Coordinate descent

Stochastic and incremental gradient methods SMO

SVMlight LIBLINEAR





SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …





SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem:efficiently solving objective

when data is distributed

Distributed Optimization


reduce: w = w � ↵P

k �w


“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees


Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication


Pk �w


average: w := 1K

Pk wk


Pk �w


average: w := 1K

Pk wk

“always communicate”

“never communicate”


k �w


✔ low communication

average: w := 1K

Pk wk




k �w


✔ low communication

average: w := 1K

Pk wk



ZDWJ, 2012


k �w


✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk



ZDWJ, 2012


k �w

Mini-batch

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees


Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication


Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground


Pi2b �w

Are we done?

Mini-batch Limitations


1. METHODS BEYOND SGD



2. STALE UPDATES



2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE


COCOA

RESULTS


COCOA!

RESULTS



2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE


1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Communication-Efficient Distributed Dual

Coordinate Ascent (CoCoA)

1. Primal-Dual Framework

PRIMAL DUAL≥


PRIMAL DUAL≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#


PRIMAL DUAL≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi


PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

2. Immediately Apply Updates


for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE


for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

for i 2 b�w �w � ↵riP (w)w w +�w

endOutput: �↵[k] and �w := A[k]�↵[k]

FRESH

3. Average over K

3. Average over K

reduce: w = w + 1K

Pk �wk

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

↵

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]




Initialize: ↵


(0) 0



(t�1)[k] ,w(t�1)

)

↵

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end







✔ <10 lines of code in Spark




Initialize: ↵


(0) 0



(t�1)[k] ,w(t�1)

)

↵

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end








✔ primal-dual framework allows for any internal optimization method




Initialize: ↵


(0) 0



(t�1)[k] ,w(t�1)

)

↵

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end









✔ local updates applied immediately




Initialize: ↵


(0) 0



(t�1)[k] ,w(t�1)

)

↵

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end









✔ local updates applied immediately

✔ average over K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

⌘

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

⌘


e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�



✔ it converges!

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

⌘


e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�



✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods


COCOA

RESULTS!

Empirical Results in

Dataset Training (n) Features (d) Sparsity Workers (K)

Cov 522,911 54 22.22% 4

Rcv1 677,399 47,236 0.16% 8

Imagenet 32,751 160,000 100% 32

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Log P

rim

al S

uboptim

alit

y

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e5)batch−SGD (H=1)

0 100 200 300 40010

−6

10−4

10−2

100

102

RCV1

Time (s)

Log P

rim

al S

uboptim

alit

y

0 100 200 300 40010

−6

10−4

10−2

100

102

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e4)batch−SGD (H=100)

COCOA Take-Aways

COCOA Take-AwaysA framework for distributed optimization

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methods

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines


Strong empirical results & convergence guarantees



To appear inNIPS ‘14



To appear inNIPS ‘14www.berkeley.edu/~vsmith

http://www.berkeley.edu/~vsmith

cocoa: communication-efficient coordinate ascent

Software