cocoa: communication-efficient coordinate ascent
DESCRIPTION
"COCOA: Communication-Efficient Coordinate Ascent" presentation from AMPCamp 5 by Virginia SmithTRANSCRIPT
COCOA Communication-Efficient
Coordinate Ascent
Virginia Smith ! Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan
LARGE-SCALE OPTIMIZATION
LARGE-SCALE OPTIMIZATION
COCOA
LARGE-SCALE OPTIMIZATION
COCOA
RESULTS
LARGE-SCALE OPTIMIZATION!
COCOA
RESULTS
Machine Learning with Large Datasets
Machine Learning with Large Datasets
Machine Learning with Large Datasets
image/music/video tagging document categorization
item recommendation click-through rate prediction
sequence tagging protein structure prediction
sensor data prediction spam classification
fraud detection
Machine Learning Workflow
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
Example: SVM Classification
Example: SVM Classification
Example: SVM Classification
Example: SVM Classification
Example: SVM Classification
Example: SVM Classification
Example: SVM Classification
wx-b=1
wx-b=-1
wx-b=0
w
2 / ||w||
Example: SVM Classification
wx-b=1
wx-b=-1
wx-b=0
w
2 / ||w||minw2Rd
�
2||w||2 + 1
n
nX
i=1
`hinge(yiwTxi)
Example: SVM Classification
wx-b=1
wx-b=-1
wx-b=0
w
2 / ||w||minw2Rd
�
2||w||2 + 1
n
nX
i=1
`hinge(yiwTxi)
Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients
Newton and Quasi-Newton methods Coordinate descent
Stochastic and incremental gradient methods SMO
SVMlight LIBLINEAR
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
Open Problem:efficiently solving objective
when data is distributed
Distributed Optimization
Distributed Optimization
Distributed Optimization
reduce: w = w � ↵P
k �w
Distributed Optimization
reduce: w = w � ↵P
k �w
Distributed Optimization
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
average: w := 1K
Pk wk
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
average: w := 1K
Pk wk
“always communicate”
“never communicate”
reduce: w = w � ↵P
k �w
Distributed Optimization✔ convergence guarantees✗ high communication
✔ low communication
average: w := 1K
Pk wk
“always communicate”
“never communicate”
reduce: w = w � ↵P
k �w
Distributed Optimization✔ convergence guarantees✗ high communication
✔ low communication
average: w := 1K
Pk wk
“always communicate”
“never communicate”
ZDWJ, 2012
reduce: w = w � ↵P
k �w
Distributed Optimization✔ convergence guarantees✗ high communication
✔ low communication✗ convergence not guaranteed
average: w := 1K
Pk wk
“always communicate”
“never communicate”
ZDWJ, 2012
reduce: w = w � ↵P
k �w
Mini-batch
Mini-batch
Mini-batch
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
a natural middle-ground
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
a natural middle-ground
reduce: w = w � ↵|b|
Pi2b �w
Are we done?
Mini-batch Limitations
Mini-batch Limitations
1. METHODS BEYOND SGD
Mini-batch Limitations
1. METHODS BEYOND SGD
2. STALE UPDATES
Mini-batch Limitations
1. METHODS BEYOND SGD
2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE
LARGE-SCALE OPTIMIZATION
COCOA
RESULTS
LARGE-SCALE OPTIMIZATION
COCOA!
RESULTS
Mini-batch Limitations
1. METHODS BEYOND SGD
2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE
Mini-batch Limitations
1. METHODS BEYOND SGD Use Primal-Dual Framework
2. STALE UPDATES Immediately apply local updates
3. AVERAGE OVER BATCH SIZE Average over K << batch size
1. METHODS BEYOND SGD Use Primal-Dual Framework
2. STALE UPDATES Immediately apply local updates
3. AVERAGE OVER BATCH SIZE Average over K << batch size
Communication-Efficient Distributed Dual
Coordinate Ascent (CoCoA)
1. Primal-Dual Framework
PRIMAL DUAL≥
1. Primal-Dual Framework
PRIMAL DUAL≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
1. Primal-Dual Framework
PRIMAL DUAL≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
max
↵2Rn
"D(↵) := �||A↵||2 � 1
n
nX
i=1
`⇤i (�↵i)
#
Ai =1
�n
xi
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
max
↵2Rn
"D(↵) := �||A↵||2 � 1
n
nX
i=1
`⇤i (�↵i)
#
Ai =1
�n
xi
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practice
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
max
↵2Rn
"D(↵) := �||A↵||2 � 1
n
nX
i=1
`⇤i (�↵i)
#
Ai =1
�n
xi
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
max
↵2Rn
"D(↵) := �||A↵||2 � 1
n
nX
i=1
`⇤i (�↵i)
#
Ai =1
�n
xi
2. Immediately Apply Updates
2. Immediately Apply Updates
for i 2 b�w �w � ↵riP (w)
end
w w +�w
STALE
2. Immediately Apply Updates
for i 2 b�w �w � ↵riP (w)
end
w w +�w
STALE
for i 2 b�w �w � ↵riP (w)w w +�w
endOutput: �↵[k] and �w := A[k]�↵[k]
FRESH
3. Average over K
3. Average over K
reduce: w = w + 1K
Pk �wk
CoCoAAlgorithm 1: CoCoA
Input: T � 1, scaling parameter 1 �K K (default: �K := 1).
Data: {(xi, yi)}ni=1 distributed over K machines
Initialize: ↵
(0)[k] 0 for all machines k, and w
(0) 0
for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel
(�↵[k],�wk) LocalDualMethod(↵
(t�1)[k] ,w(t�1)
)
↵
(t)[k] ↵
(t�1)[k] +
�K
K �↵[k]
end
reduce w
(t) w
(t�1)+
�K
K
PKk=1 �wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk, and w 2 Rd
consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nki=1
Output: �↵[k] and �w := A[k]�↵[k]
CoCoAAlgorithm 1: CoCoA
Input: T � 1, scaling parameter 1 �K K (default: �K := 1).
Data: {(xi, yi)}ni=1 distributed over K machines
Initialize: ↵
(0)[k] 0 for all machines k, and w
(0) 0
for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel
(�↵[k],�wk) LocalDualMethod(↵
(t�1)[k] ,w(t�1)
)
↵
(t)[k] ↵
(t�1)[k] +
�K
K �↵[k]
end
reduce w
(t) w
(t�1)+
�K
K
PKk=1 �wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk, and w 2 Rd
consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nki=1
Output: �↵[k] and �w := A[k]�↵[k]
✔ <10 lines of code in Spark
CoCoAAlgorithm 1: CoCoA
Input: T � 1, scaling parameter 1 �K K (default: �K := 1).
Data: {(xi, yi)}ni=1 distributed over K machines
Initialize: ↵
(0)[k] 0 for all machines k, and w
(0) 0
for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel
(�↵[k],�wk) LocalDualMethod(↵
(t�1)[k] ,w(t�1)
)
↵
(t)[k] ↵
(t�1)[k] +
�K
K �↵[k]
end
reduce w
(t) w
(t�1)+
�K
K
PKk=1 �wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk, and w 2 Rd
consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nki=1
Output: �↵[k] and �w := A[k]�↵[k]
✔ <10 lines of code in Spark
✔ primal-dual framework allows for any internal optimization method
CoCoAAlgorithm 1: CoCoA
Input: T � 1, scaling parameter 1 �K K (default: �K := 1).
Data: {(xi, yi)}ni=1 distributed over K machines
Initialize: ↵
(0)[k] 0 for all machines k, and w
(0) 0
for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel
(�↵[k],�wk) LocalDualMethod(↵
(t�1)[k] ,w(t�1)
)
↵
(t)[k] ↵
(t�1)[k] +
�K
K �↵[k]
end
reduce w
(t) w
(t�1)+
�K
K
PKk=1 �wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk, and w 2 Rd
consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nki=1
Output: �↵[k] and �w := A[k]�↵[k]
✔ <10 lines of code in Spark
✔ primal-dual framework allows for any internal optimization method
✔ local updates applied immediately
CoCoAAlgorithm 1: CoCoA
Input: T � 1, scaling parameter 1 �K K (default: �K := 1).
Data: {(xi, yi)}ni=1 distributed over K machines
Initialize: ↵
(0)[k] 0 for all machines k, and w
(0) 0
for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel
(�↵[k],�wk) LocalDualMethod(↵
(t�1)[k] ,w(t�1)
)
↵
(t)[k] ↵
(t�1)[k] +
�K
K �↵[k]
end
reduce w
(t) w
(t�1)+
�K
K
PKk=1 �wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk, and w 2 Rd
consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nki=1
Output: �↵[k] and �w := A[k]�↵[k]
✔ <10 lines of code in Spark
✔ primal-dual framework allows for any internal optimization method
✔ local updates applied immediately
✔ average over K
Convergence
E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)
1
K
�n�
� + �n�
◆T ⇣D(↵⇤)�D(↵(0))
⌘
Assumptions: are -smooth LocalDualMethod makes improvement per step
e.g. for SDCA ⇥
⇥ =
✓1� �n�
1 + �n�
1
n
◆H
`i 1/�
applies also to duality gap !
measure of difficulty of data partition 0 � n/K
Convergence
E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)
1
K
�n�
� + �n�
◆T ⇣D(↵⇤)�D(↵(0))
⌘
Assumptions: are -smooth LocalDualMethod makes improvement per step
e.g. for SDCA ⇥
⇥ =
✓1� �n�
1 + �n�
1
n
◆H
`i 1/�
applies also to duality gap !
measure of difficulty of data partition 0 � n/K
✔ it converges!
Convergence
E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)
1
K
�n�
� + �n�
◆T ⇣D(↵⇤)�D(↵(0))
⌘
Assumptions: are -smooth LocalDualMethod makes improvement per step
e.g. for SDCA ⇥
⇥ =
✓1� �n�
1 + �n�
1
n
◆H
`i 1/�
applies also to duality gap !
measure of difficulty of data partition 0 � n/K
✔ it converges!
✔ convergence rate matches current state-of-the-art mini-
batch methods
Convergence
E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)
1
K
�n�
� + �n�
◆T ⇣D(↵⇤)�D(↵(0))
⌘
Assumptions: are -smooth LocalDualMethod makes improvement per step
e.g. for SDCA ⇥
⇥ =
✓1� �n�
1 + �n�
1
n
◆H
`i 1/�
applies also to duality gap !
measure of difficulty of data partition 0 � n/K
✔ it converges!
✔ convergence rate matches current state-of-the-art mini-
batch methods
LARGE-SCALE OPTIMIZATION
COCOA
RESULTS!
LARGE-SCALE OPTIMIZATION
COCOA
RESULTS!
Empirical Results in
Dataset Training (n) Features (d) Sparsity Workers (K)
Cov 522,911 54 22.22% 4
Rcv1 677,399 47,236 0.16% 8
Imagenet 32,751 160,000 100% 32
0 200 400 600 80010
−6
10−4
10−2
100
102
Imagenet
Time (s)
Log P
rim
al S
uboptim
alit
y
0 200 400 600 80010
−6
10−4
10−2
100
102
Imagenet
COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)
0 200 400 600 80010
−6
10−4
10−2
100
102
Imagenet
Time (s)
Lo
g P
rim
al S
ub
op
tima
lity
0 200 400 600 80010
−6
10−4
10−2
100
102
Imagenet
COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)
0 20 40 60 80 10010
−6
10−4
10−2
100
102
Cov
Time (s)
Lo
g P
rim
al S
ub
op
tima
lity
0 20 40 60 80 10010
−6
10−4
10−2
100
102
Cov
COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e5)batch−SGD (H=1)
0 100 200 300 40010
−6
10−4
10−2
100
102
RCV1
Time (s)
Log P
rim
al S
uboptim
alit
y
0 100 200 300 40010
−6
10−4
10−2
100
102
COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e4)batch−SGD (H=100)
COCOA Take-Aways
COCOA Take-AwaysA framework for distributed optimization
COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methods
COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines
COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines
Strong empirical results & convergence guarantees
COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines
Strong empirical results & convergence guarantees
To appear inNIPS ‘14
COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines
Strong empirical results & convergence guarantees
To appear inNIPS ‘14www.berkeley.edu/~vsmith