sparse optimization - lecture: parallel and distributed ...wotaoyin/summer2013/slides/lec...sparse...
Post on 14-Jul-2020
13 Views
Preview:
TRANSCRIPT
Sparse Optimization
Lecture: Parallel and Distributed Sparse Optimization
Instructor: Wotao Yin
July 2013
online discussions on piazza.com
Those who complete this lecture will know
• basics of parallel computing
• how to parallel a bottleneck of existing sparse optimization method
• primal and dual decomposition
• GRock: greedy coordinate-block descent, and its parallel implementation
1 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
2 / 74
Serial computing
Traditional computing is serial
• a single CPU, one instruction is executed at a time
• a problem is broken into a series of instructions, executed one after another
CPU
Problem
tN t1t2· · ·
7 / 74
Parallel computing
• a problem is broken into concurrent parts, executed simultaneously,
coordinated by a controller
• uses multiple cores/CPUs/networked computers, or their combinations
• expected to solve a problem in less time, or a larger problem
CPU
CPU
CPU
t1t2tN · · ·
Problem
8 / 74
Commercial applications
Examples: internet data mining, logistics/supply chain, finance, online ad,
recommendation, sales data analysis, virtual reality, computer graphics (cartoon
movies), medicine design, medical imaging, network video, ...
9 / 74
Applications in science and engineering
• simulation: many real-world events happen concurrently, interrelated
examples: galaxy, ocean, weather, traffic, assembly line, queues
• use in science/engineering: environment, physics, bioscience, chemistry,
geoscience, mechanical engineering, mathematics (computerized proofs),
defense, intelligence
10 / 74
Benefits of being parallel
• break single-CPU limits
• save time
• process big data
• access non-local resource
• handle concurrent data streams / enable collaboration
11 / 74
Parallel overhead
• computing overhead: start up time, synchronization wait time, data
communication time, termination (data collection time)
• I/O overhead: slow read and write of non-local data
• algorithm overhead: extra variables, data duplication
• coding overhead: language, library, operating system, debug, maintenance
12 / 74
Different parallel computers
• basic element is like a single computer, a 70-year old architecture
components: CPU (control/arithmetic units), memory, input/output
• SIMD: single instruction multiple data, GPU-like
applications: image filtering, PDE on grid, ...
• MISD: multiple instruction single data, rarely seen
conceivable applications: cryptography attack, global optimization
• MIMD: multiple instruction multiple data
most of today’s supercomputer, clusters, clouds, multi-core PCs
13 / 74
Tech jargons
• HPC: high-performance computing
• node/CPU/processor/core
• shared memory: either all processors access a common physical memory,
or parallel tasks have direct address of logical memory
• SMP: multiple processors share a single memory, shared-memory
computing
• distributed memory: parallel tasks see local memory and use
communication to access remote memory
• communication: exchange of data or synchronization controls; either
through shared memory or network
• synchronization: coordination, one processor awaits others
14 / 74
Tech jargons
• granularity: coarse – more work between communication events; fine –
less work between communication events
• Embarrassingly parallel: main tasks are independent, little need of
coordination
• speed scalability: how speedup can increases with additional resources
data scalability: how problem size can increase with additional resources
scalability affected by: problem model, algorithm, memory bandwidth,
network speed, parallel overhead
sometimes, adding more processors/memory may not save time!
• memory architectures: uniform or non-uniform shared, distributed, hybrid
• parallel models: shared memory, message passing, data parallel, SPMD,
MPMD, etc.
15 / 74
Parallel speedup
• definition:
speedup =serial time
parallel time
time is in the wall-clock sense
• Amdahl’s Law: assume infinite processors and no overhead
ideal max speedup =1
1− ρ
where ρ = percentage of parallelized code, can depend on input size
0 0.2 0.4 0.6 0.8 10
5
10
15
20
Parallel portion of code
Speedup
16 / 74
Parallel speedup
• assume N processors and no overhead
ideal speedup =1
(ρ/N) + (1− ρ)
• ρ may also depend on N
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
17 / 74
Parallel speedup
• ε := parallel overhead
• in the real world
actual speedup =1
(ρ/N) + (1− ρ) + ε
100
102
104
106
108
0
2
4
6
8
10
Number of processors
Speedup
25%50%90%95%
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
when ε = N when ε = log(N)
18 / 74
Basic types of communication
Broadcast Scatter
Gather Reduce
1 2 3 3
9
19 / 74
Allreduce
Allreduce sum via butterfly communication
2 3 6 4 8 1 7 5
5 5 10 10 9 9 12 12
15 15 15 15 21 21 21 21
36 36 36 36 36 36 36 36
0 1 2 3 4 5 6 7
log(N) layers, N parallel communications per layer
20 / 74
Additional important topics
• synchronization types: barrier, lock, synchronous communication
• load balancing: static or dynamic distribution of tasks/data so that all
processors are busy all time
• granularity: fine grain vs coarse grain
• I/O: a traditional parallelism killer, but modern database/parallel file
system alleviates the problem (e.g., Hadoop Distributed File System
(HDFS))
21 / 74
Example: π, area computation
-0.4 -0.2 0 0.2 0.4
-0.4
-0.2
0
0.2
0.4
105 W 100 W 95 W
25 N
30 N
35 N
• box area is 1× 1
• disc area is π/4
• ratio of area ≈ % inside points
• π = 4# points in the disc
# all the points
22 / 74
Serial pseudo-code
1: total = 100000
2: in disc = 0
3: for k = 1 to total do
4: x = a uniformly random point in [−1/2, 1/2]2
5: if x is in the disc then
6: in disc++
7: end if
8: end for
9: return 4*in disc/total
23 / 74
Parallel π computation
-0.4 -0.2 0 0.2 0.4
-0.4
-0.2
0
0.2
0.4
• each sample is independent of others
• use multiple processors to run the for-loop
• use SPMD, process #1 collects results and return π
24 / 74
SPMD pseudo-code
1: total = 100000
2: in disc = 0
3: P = find total parallel processors
4: i = find my processor id
5: total = total/P
6: for k = 1 to total do
7: x← a uniformly random point in [0, 1]2
8: if x is in the disc then
9: in disc++
10: end if
11: end for
12: if i == 1 then
13: total in disc = reduce(in disc, SUM)
14: return 4*total in disc/(total* P)
15: end if
• requires only one collective
communication;
• speedup = 11/N+O(log(N))
= N1+cN log(N)
, where c is a
very small number;
• main loop doesn’t require any
synchronization.
25 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
26 / 74
Decomposable objective and constraints enables parallelism
• variables x = [x1,x2, . . . ,xN ]T
• embarrassingly parallel (not an interesting case): min∑i fi(xi)
• consensus minimization, no constraints
min f(x) =N∑i=1
fi(x) + r(x)
LASSO example: f(x) = λ2
∑i ‖A(i)x− bi‖22 + ‖x‖1
• coupling variable z
minx,z
f(x, z) =
N∑i=1
[fi(xi, z) + ri(xi)] + r(z)
27 / 74
• coupling constraints
min f(x) =
N∑i=1
fi(xi) + ri(xi)
subject to
• equality constraints:∑Ni=1 Aixi = b
• inequality constraints:∑Ni=1 Aixi ≤ b
• graph-coupling constraints Ax = b, where A is sparse
• combinations of independent variables, coupling variables and constraints
example:
minx,w,z
∑Ni=1 fi(xi, z)
s.t.∑iAixi ≤ b
Bixi ≤ w, ∀iw ∈ Ω
• variable coupling ←→ constraint coupling:
min
N∑i=1
fi(xi, z) −→ min
N∑i=1
fi(xi, zi) s.t. zi = z ∀i
28 / 74
Approaches toward parallelism
Keep an existing serial algorithm, parallelize its expensive part(s)
Introduce a new parallel algorithm by
• formulating an equivalent model
• replacing the model by an approximate model
where the new model lands itself for parallel computing
29 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
30 / 74
Example: parallel ISTA
minimizex
λ‖x‖1 + f(x)
Prox-linear approximation:
arg minx
λ‖x‖1 + 〈∇f(x),x− xk〉+1
2δk‖x− xk‖22
Iterative soft-threshold algorithm:
xk+1 = shrink(xk − δk∇f(x), λδk
)• shrink(y, t) has a simple close form: max(abs(y)-t,0).*sign(y)
• the dominating computation is ∇f(x)
examples: f(x) = 12‖Ax− b‖22, logistic loss of Ax− b, hinge loss ... where
A is often dense and can go very large-scale
31 / 74
Parallel gradient computing of f(x) = g(Ax+ b): row partition
Assumption: A is big and dense; g(y) =∑gi(yi) decomposable
Then f(x) = g(Ax + b) =∑gi(A(i)x + bi)
• let A(i) and bi be the ith row block of A and b, respectively
...
A(1)
A(2)
A(M)
• store A(i),bi on node i
• goal: to compute ∇f(x) =∑
AT(i)∇gi(A(i)x + bi), or similarly ∂f(x)
32 / 74
Parallel gradient computing of f(x) = g(Ax+ b): row partition
Steps
1. compute A(i)x + bi ∀i in parallel;
2. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;
3. compute AT(i)gi ∀i in parallel;
4. allreduce∑Mi=1 A
T(i)gi.
• step 1, 2 and 3 are local computation and communication-free;
• step 4 requires an allreduce sum communication.
33 / 74
Parallel gradient computing of f(x) = g(Ax+ b): column partition
· · ·A1 A2 AN
column block partition
Ax =N∑i=1
Aixi =⇒ f(x) = g(Ax + b) = g(∑
Aixi + b)
∇f(x) =
AT
1∇g(Ax + b)
AT2∇g(Ax + b)
...
ATN∇g(Ax + b)
, similar for ∂f(x)
34 / 74
Parallel computing of ∇f(x) = g(Ax+ b): column partition
Steps
1. compute Aixi + 1Nb ∀i in parallel
2. allreduce Ax + b =∑Ni=1 Aixi + 1
Nb;
3. evaluate ∇g(Ax + b) ∀i in each process;
4. compute ATi ∇g(Ax + b) ∀i in parallel.
• each processor keeps Ai and xi, as well as a copy of b;
• step 2 requires an allreduce sum communication;
• step 3 involves duplicated computation in each process (hopefully, g is
simple).
35 / 74
Two-way partition
· · ·· · ·
· · ·
··· ··· ···
A1,1 A1,2 A1,N
A2,2A2,1
AM,1
A2,N
AM,NAM,2
···
f(x) =∑
gi(A(i)x + bi) =∑i
gi(∑j
Ai,jxj + bi)
∇f(x) =∑i
AT(i)∇gi(A(i)x + bi) =
∑i
AT(i)∇gi(
∑j
Ai,jxj + bi)
36 / 74
Two-way partition
Step
1. compute Ai,jxj + 1Nbi ∀i in parallel;
2. allreduce∑Nj=1 Ai,jxj + bi for every i;
3. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;
4. compute ATj,igi ∀i in parallel;
5. allreduce∑Ni=1 A
Tj,igi for every j.
• processor (i, j) keeps Ai,j , xj and bi;
• step 2 and 5 require parallel allreduce sum communication;
37 / 74
Example: parallel ISTA for LASSO
LASSO
min f(x) = λ‖x‖1 +1
2‖Ax− b‖22
algorithm
xk+1 = shrink(xk − δkATAxk + δkA
Tb, λδk)
Serial pseudo-code
1: initialize x = 0 and δ;
2: pre-compute ATb
3: while not converged do
4: g = ATAx−ATb
5: x = shrink (x− δg, λδ);
6: end while
38 / 74
Example: parallel ISTA for LASSO
distribute blocks of rows to M nodes
...
A(1)
A(2)
A(M)
1: initialize x = 0 and δ;
2: i = find my processor id
3: processor i loads A(i)
4: pre-compute δATb
5: while not converged do
6: processor i computes ci = AT(i)A(i)x
7: y = allreduce(ci, SUM)
8: x = shrink(xk − δc + δATb, λδ
);
9: end while
• assume A ∈ Rm×n
• one allreduce per iteration
• speedup
≈ 1ρ/M+(1−ρ)+O(log(M))
• P is close to 1
• requires synchronization
39 / 74
Example: parallel ISTA for LASSO
distribute blocks of columns to N nodes
· · ·A1 A2 AN
1: initialize x = 0 and δ;
2: i = find my processor id
3: processor i loads Ai
4: pre-compute δATi b
5: while not converged do
6: processor i computes yi = Aixi
7: y = allreduce(yi, SUM)
8: processor i computes zi = ATi y
9: xk+1i = shrink
(xki − δzi + δAT
i b, λδ);
10: end while
• one allreduce per iteration
• speedup
≈ 1ρ/N+(1−ρ)+O(m
nlog(N))
• P is close to 1
• requires synchronization
• if m n, this approach is
faster
40 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition3
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
3Dantzig and Wolfe [1960], Spingarn [1985], Bertsekas and Tsitsiklis [1997]
41 / 74
Unconstrained minimization with coupling variables
x = [x1,x2, . . . ,xN ]T .
z is the coupling/bridging/complicating variable.
minx,z
f(x, z) =
N∑i=1
fi(xi, z)
equivalently to separable objective with coupling constraints
minx,z
N∑i=1
fi(xi, zi) s.t. zi = z ∀i
Examples
• network utility maximization (NUM), extend to multiple layers4
• domain decomposition (z are overlapping boundary variables)
• system of equations: A is block-diagonal but with a few dense columns
4see tutorial: Palomar and Chiang [2006]
42 / 74
Primal decomposition (bi-level optimization)
Introduce easier subproblems
gi(z) = minxi
fi(xi, z), i = 1, . . . , N
then
minx,z
N∑i=1
fi(xi, z) ⇐⇒ minz
∑gi(z)
the RHS is called the master problem
parallel computing: iterate
1. fix z, update xi and compute gi(z) in parallel
2. update z using a standard method (typically, subgradient descent)
• good for small-dimensional z
• fast if the master problem converges fast
43 / 74
Dual decomposition
Consider
minx,z
∑i
fi(xi, zi)
s.t.∑i
Aizi = b
Introduce
gi(z) = minxi
f(xi, z)
g∗i (y) = maxz
yT z− gi(z) //convex conjugate
Dualization:
⇔ minx,z
maxy
∑i
fi(xi, zi) + yT (b−∑i
Aizi)
(generalized minimax thm.) ⇔ maxyi
minxi,zi
∑i
(fi(xi, zi)− yTAizi
)+ yTb
⇔ maxyi
minzi
∑i
(gi(zi)− yTAizi
)+ yTb
⇔ min
y
∑i
g∗i (ATi y)− yTb
where gi(y) = minxi f(xi,y), g∗i (z) = miny gi(y)− zTy is the conjugate
function of gi(y).
44 / 74
Example: consensus and exchange problems
Consensus problem
minz
∑i
gi(z)
Reformulate as
minz
∑i
gi(zi) s.t. zi = z ∀i
Its dual problem is the exchange problem
miny
∑i
g∗i (yi) s.t.∑i
yi = 0
45 / 74
Primal-dual objective properties
Constrained separable problem
minz
∑i
gi(z) s.t.∑i
Aizi = b
Dual problem
miny
∑i
g∗i (ATi y)− bTy
If gi is proper, closed, convex, g∗i (aTi y) is sub-differentiable
If gi is strictly convex, g∗i (aTi y) is differentiable
If gi is strongly convex, g∗i (aTi y) is differentiable and has Lipschitz gradient
Obtaining z∗ from y∗ generally requires strict convexity of gi or strict
complementary slackness
Example:
• gi(z) = |zi| and g∗i (aTi y) = ι−1 ≤ aTi y ≤ 1
• gi(z) = 12|zi|2 and g∗i (aTi y) = 1
2|aTi y|2
• gi(z) = exp(zi) and g∗i (aTi y) = (aTi y) ln(aTi y)− aTi y + ιaTi y ≥ 046 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman5)• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
5Yin, Osher, Goldfarb, and Darbon [2008]
47 / 74
Example: augmented `1
Change |zi| into strongly convex gi(zi) = |zi|+ 12α|zi|2
Dual problem
miny
∑i
g∗i (ATi y)− bTy =
∑i
α
2| shrink(AT
i y)|2︸ ︷︷ ︸hi(y)
−bTy
Dual gradient iteration (equivalent to linearized Bregman):
yk+1 = yk − tk(∑
i
∇hi(yk)− b
), where ∇hi(y) = αAi shrink(AT
i y).
Dual is C1 and unconstrained. One can apply (accelerated) gradient
descent, line search, quasi-Newton method, etc.
Recover x∗ = α shrink(ATi y∗).
Easy to parallelize. Node i computes ∇hi(y). One collective
communication per iteration.
48 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM8
4. Parallel greedy coordinate descent
5. Numerical results with big data
8Bertsekas and Tsitsiklis [1997]
56 / 74
Alternating direction method of multipliers (ADMM)
step 1: turn problem into the form of
minx,y
f(x) + g(y)
s.t. Ax + By = b.
f and g are convex, maybe nonsmooth, can incorporate constraints
step 2: iterate
1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,
2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,
3. zk+1 ← zk − (Axk+1 + Byk+1 − b).
history: dates back to 60s, formalized 80s, parallel versions appeared late 80s,
revived very recently, new convergence results recently
57 / 74
ADMM and parallel/distributed computing
Two approaches:
• parallelize the serial algorithm(s) for ADMM subproblem(s),
• turn problem into a parallel-ready form9
minx,y
∑i fi(xi) + g(y)
s.t.
A1
. . .
AM
x1
...
xM
+
B1
...
BM
y =
b1
...
bM
consequently, computing xk+1 reduces to parallel subproblems
xk+1i ← min
xfi(xi) +
β
2‖Aixi + Biy
k − bi − zki ‖22, i = 1, . . . ,M
Issue: to have block diagonal A and simple B, additional variables are often
needed, thus increasing the problem size and parallel overhead.
9a recent survey Boyd, Parikh, Chu, Peleato, and Eckstein [2011]
58 / 74
Parallel Linearized Bregman vs ADMM on basis pursuit
Basis pursuit:
minx‖x‖1 s.t. Ax = b.
• example with dense A ∈ R1024×2048;
• distribute A to N computing nodes
A = [A1 A2 · · · AN ];
• compare
1. parallel ADMM in the survey paper Boyd, Parikh, Chu, Peleato, and
Eckstein [2011];
2. parallel linearized Bregman (dual gradient descent), un-accelerated.
• tested N = 1, 2, 4, . . . , 32 computing nodes;
59 / 74
parallel ADMM v.s. parallel linearized Bregman
parallel linearized Bregman scales much better
60 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
61 / 74
(Block)-coordinate descent/update
Definition: update one block of variables each time, keeping others fixed
Advantage: update is simple
Disadvantages:
• more iterations (with exceptions)
• may stuck at non-stationary points if problem is non-convex or
non-smooth (with exceptions)
Block selection: cycle (Gauss-Seidel), parallel (Jacobi), random, greedy
Specific for sparse optimization: greedy approach10 can be exceptionally
effective (because most time correct variables are selected to update; most zero
variables are never touched; “reducing” the problem to a much smaller one.)
10Li and Osher [2009]
62 / 74
GRock: greedy coordinate-block descent
Example:
minλ‖x‖1 + f(Ax− b)
decompose Ax =∑j Aixj ; block Aj and xj are kept on node j
Parallel GRock:
1. (parallel) compute a merit value for each coordinate i
di = arg mind
λ · r(xi + d) + gid+1
2d2, where gi = aT(i)∇f(Ax− b)
2. (parallel) compute a merit value for each block j
mj = max|d| : d is an element of dj
let sj be the index of the maximal coordinate within block j
3. (allreduce) P ← select P blocks with largest mj , 2 ≤ P ≤ N
4. (parallel) update xsj ← xsj + dsj for all j ∈ P
5. (allreduce) update Ax
63 / 74
How large P can be?
P depends on the block spectral radius
ρP = maxM∈M
ρ(M),
where M is the set of all P × P submatrices that we can obtain from ATA
corresponding to selecting exactly one column from each of the P blocks
Theorem
Assume each column of A has unit 2-norm. If ρP < 2, GRock with P parallel
updates per iteration gives
F(x + d)−F(x) ≤ ρP − 2
2β‖d‖22
moreover,
F(xk)−F(x∗) ≤2C2
(2L+ β
√NP
)2
(2− ρP )β· 1
k.
64 / 74
Compare different block selection rules
0 50 10010
-20
10-10
100
1010
nomalized # coordinates calculated
f-fm
in
Cyclic CDGreedy CDMixed CDGRock
0 50 10010
-15
10-10
10-5
100
105
Normalized # coordinates calculated
Relativeerror
Cyclic CDGreedy CDMixed CDGRock
• LASSO test with A ∈ R512×1024, N = 64 column blocks
• GRock uses P = 8 updates each iteration
• greedy CD11 uses P = 1
• mixed CD12 selects P = 8 random blocks and best coordinate in each
iteration
11Li and Osher [2009]12Scherrer, Tewari, Halappanavar, and Haglin [2012]
65 / 74
Outline
1. Background of parallel computing: benefits, speedup, and overhead
2. Parallelize existing algorithms
3. Primal decomposition / dual decomposition
• Parallel dual gradient ascent (linearized Bregman)
• Parallel ADMM
4. Parallel greedy coordinate descent
5. Numerical results with big data
66 / 74
Test on Rice cluster STIC
specs:
• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon
(Nahalem) CPUs
• each node has 12GB of memory shared by all 8 cores
• # of processes used on each node = 8
test dataset:
A type A size λ sparsity of x∗
dataset I Gaussian 1024× 2048 0.1 100
dataset II Gaussian 2048× 4096 0.01 200
67 / 74
iterations vs cores
100
101
102
102
103
104
number of cores
number
ofiterations
p D-ADMMp FISTAGRock
100
101
102
102
104
number of cores
number
ofiterations
p D-ADMMp FISTAGRock
dataset I dataset II
Note: we set P = number of cores
68 / 74
time vs cores
100
101
102
10-1
100
101
number of cores
time(s)
p D-ADMMp FISTAGRock
P = 16
100
101
102
100
102
number of cores
time(s)
p D-ADMMp FISTAGRock
dataset I dataset II
Note: we set P = number of cores
69 / 74
(% communication time) vs cores
100
101
102
0.1
0.2
0.3
0.4
0.5
0.6
0.7
number of cores
comm.timepercentage
p D-ADMMp FISTAGRock
100
101
102
0.1
0.2
0.3
0.4
number of cores
comm.timepercentage p D-ADMM
p FISTAGRock
dataset I dataset II
Note: we set P = number of cores
70 / 74
big data LASSO on Amazon EC2
Amazon EC2 is an elastic, pay-as-you-use cluster;
advantage: no hardware investment, everyone can have an account
test dataset
• A dense matrix, 20 billion entries, and 170GB size
• x: 200K entries, 4K nonzeros, Gaussian values
requested system
• 20 “high-memory quadruple extra-large instances”
• each instance has 8 cores and 60GB memory
code
• written in C using GSL (for matrix-vector multiplication) and MPI
71 / 74
parallel Dual-ADMM vs FISTA13 vs GRock
p D-ADMM p FISTA GRock
estimate stepsize (min.) n/a 1.6 n/a
matrix factorization (min.) 51 n/a n/a
iteration time (min.) 105 40 1.7
number of iterations 2500 2500 104
communication time (min.) 30.7 9.5 0.5
stopping relative error 1E-1 1E-3 1E-5
total time (min.) 156 41.6 1.7
Amazon charge $85 $22.6 $0.93
• ADMM’s performance depends on penalty parameter β
we picked β as the best out of only a few trials (we cannot afford more)
• parallel Dual ADMM and FISTA were capped at 2500 iterations
• GRock used adaptive P and stopped at relative error 1E-5
13Beck and Teboulle [2009]
72 / 74
Software codes can be found in instructor’s website.
Acknowledgements: Zhimin Peng (Rice), Ming Yan (UCLA). Also, Qing
Ling (USTC) and Zaiwen Wen (SJTU)
References:
B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on
Computing, 24:227–234, 1995.
D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal)
dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences,
100:2197–2202, 2003.
D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):
1289–1306, 2006.
E.J. Candes, E. Romberg, and T. Tao. Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information. Information Theory,
IEEE Transactions on, 52(2):489–509, 2006.
G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations
Research, 8:101–111, 1960.
J.E. Spingarn. Applications of the method of partial inverses to convex programming:
decomposition. Mathematical Programming, 32(2):199–223, 1985.
73 / 74
D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical
Methods, Second Edition. Athena Scientific, 1997.
D.P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility
maximization. IEEE Journal on Selected Areas in Communications, 24(8):
1439–1451, 2006.
W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for
l1-minimization with applications to compressed sensing. SIAM Journal on Imaging
Sciences, 1(1):143–168, 2008.
M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM
Journal on Optimization, 18(4):1326–1350, 2007.
W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM
Journal on Imaging Sciences, 3(4):856–877, 2010.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations
and Trends in Machine Learning, 3(1):1–122, 2011.
Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with
application to compressed sensing: a greedy algorithm. Inverse Problems and
Imaging, 3(3):487–503, 2009.
C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for
accelerating parallel coordinate descent. In NIPS, pages 28–36, 2012.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
74 / 74
top related