sparse optimization - lecture: parallel and distributed ...wotaoyin/summer2013/slides/lec...sparse...

Sparse Optimization

Lecture: Parallel and Distributed Sparse Optimization

Instructor: Wotao Yin

July 2013

online discussions on piazza.com

Those who complete this lecture will know

• basics of parallel computing

• how to parallel a bottleneck of existing sparse optimization method

• primal and dual decomposition

• GRock: greedy coordinate-block descent, and its parallel implementation

1 / 74

Outline

1. Background of parallel computing: benefits, speedup, and overhead

2. Parallelize existing algorithms

3. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

4. Parallel greedy coordinate descent

5. Numerical results with big data

2 / 74

Serial computing

Traditional computing is serial

• a single CPU, one instruction is executed at a time

• a problem is broken into a series of instructions, executed one after another

Problem

tN t1t2· · ·

7 / 74

Parallel computing

• a problem is broken into concurrent parts, executed simultaneously,

coordinated by a controller

• uses multiple cores/CPUs/networked computers, or their combinations

• expected to solve a problem in less time, or a larger problem

t1t2tN · · ·

Problem

8 / 74

Commercial applications

Examples: internet data mining, logistics/supply chain, finance, online ad,

recommendation, sales data analysis, virtual reality, computer graphics (cartoon

movies), medicine design, medical imaging, network video, ...

9 / 74

Applications in science and engineering

• simulation: many real-world events happen concurrently, interrelated

examples: galaxy, ocean, weather, traffic, assembly line, queues

• use in science/engineering: environment, physics, bioscience, chemistry,

geoscience, mechanical engineering, mathematics (computerized proofs),

defense, intelligence

10 / 74

Benefits of being parallel

• break single-CPU limits

• save time

• process big data

• access non-local resource

• handle concurrent data streams / enable collaboration

11 / 74

Parallel overhead

• computing overhead: start up time, synchronization wait time, data

communication time, termination (data collection time)

• I/O overhead: slow read and write of non-local data

• algorithm overhead: extra variables, data duplication

• coding overhead: language, library, operating system, debug, maintenance

12 / 74

Different parallel computers

• basic element is like a single computer, a 70-year old architecture

components: CPU (control/arithmetic units), memory, input/output

• SIMD: single instruction multiple data, GPU-like

applications: image filtering, PDE on grid, ...

• MISD: multiple instruction single data, rarely seen

conceivable applications: cryptography attack, global optimization

• MIMD: multiple instruction multiple data

most of today’s supercomputer, clusters, clouds, multi-core PCs

13 / 74

Tech jargons

• HPC: high-performance computing

• node/CPU/processor/core

• shared memory: either all processors access a common physical memory,

or parallel tasks have direct address of logical memory

• SMP: multiple processors share a single memory, shared-memory

computing

• distributed memory: parallel tasks see local memory and use

communication to access remote memory

• communication: exchange of data or synchronization controls; either

through shared memory or network

• synchronization: coordination, one processor awaits others

14 / 74

Tech jargons

• granularity: coarse – more work between communication events; fine –

less work between communication events

• Embarrassingly parallel: main tasks are independent, little need of

coordination

• speed scalability: how speedup can increases with additional resources

data scalability: how problem size can increase with additional resources

scalability affected by: problem model, algorithm, memory bandwidth,

network speed, parallel overhead

sometimes, adding more processors/memory may not save time!

• memory architectures: uniform or non-uniform shared, distributed, hybrid

• parallel models: shared memory, message passing, data parallel, SPMD,

MPMD, etc.

15 / 74

Parallel speedup

• definition:

speedup =serial time

parallel time

time is in the wall-clock sense

• Amdahl’s Law: assume infinite processors and no overhead

ideal max speedup =1

1− ρ

where ρ = percentage of parallelized code, can depend on input size

0 0.2 0.4 0.6 0.8 10

Parallel portion of code

Speedup

16 / 74

Parallel speedup

• assume N processors and no overhead

ideal speedup =1

(ρ/N) + (1− ρ)

• ρ may also depend on N

Number of processors

Speedup

25%50%90%95%

17 / 74

Parallel speedup

• ε := parallel overhead

• in the real world

actual speedup =1

(ρ/N) + (1− ρ) + ε

Speedup

25%50%90%95%

Speedup

25%50%90%95%

when ε = N when ε = log(N)

18 / 74

Basic types of communication

Broadcast Scatter

Gather Reduce

1 2 3 3

19 / 74

Allreduce

Allreduce sum via butterfly communication

2 3 6 4 8 1 7 5

5 5 10 10 9 9 12 12

15 15 15 15 21 21 21 21

36 36 36 36 36 36 36 36

0 1 2 3 4 5 6 7

log(N) layers, N parallel communications per layer

20 / 74

Additional important topics

• synchronization types: barrier, lock, synchronous communication

• load balancing: static or dynamic distribution of tasks/data so that all

processors are busy all time

• granularity: fine grain vs coarse grain

• I/O: a traditional parallelism killer, but modern database/parallel file

system alleviates the problem (e.g., Hadoop Distributed File System

(HDFS))

21 / 74

Example: π, area computation

-0.4 -0.2 0 0.2 0.4

105 W 100 W 95 W

• box area is 1× 1

• disc area is π/4

• ratio of area ≈ % inside points

• π = 4# points in the disc

# all the points

22 / 74

Serial pseudo-code

1: total = 100000

2: in disc = 0

3: for k = 1 to total do

4: x = a uniformly random point in [−1/2, 1/2]2

5: if x is in the disc then

6: in disc++

7: end if

8: end for

9: return 4*in disc/total

23 / 74

Parallel π computation

-0.4 -0.2 0 0.2 0.4

• each sample is independent of others

• use multiple processors to run the for-loop

• use SPMD, process #1 collects results and return π

24 / 74

SPMD pseudo-code

1: total = 100000

2: in disc = 0

3: P = find total parallel processors

4: i = find my processor id

5: total = total/P

6: for k = 1 to total do

7: x← a uniformly random point in [0, 1]2

8: if x is in the disc then

9: in disc++

10: end if

11: end for

12: if i == 1 then

13: total in disc = reduce(in disc, SUM)

14: return 4*total in disc/(total* P)

15: end if

• requires only one collective

communication;

• speedup = 11/N+O(log(N))

= N1+cN log(N)

, where c is a

very small number;

• main loop doesn’t require any

synchronization.

25 / 74

Outline

• Parallel dual gradient ascent (linearized Bregman)• Parallel ADMM

26 / 74

Decomposable objective and constraints enables parallelism

• variables x = [x1,x2, . . . ,xN ]T

• embarrassingly parallel (not an interesting case): min∑i fi(xi)

• consensus minimization, no constraints

min f(x) =N∑i=1

fi(x) + r(x)

LASSO example: f(x) = λ2

∑i ‖A(i)x− bi‖22 + ‖x‖1

• coupling variable z

minx,z

f(x, z) =

N∑i=1

[fi(xi, z) + ri(xi)] + r(z)

27 / 74

• coupling constraints

min f(x) =

N∑i=1

fi(xi) + ri(xi)

subject to

• equality constraints:∑Ni=1 Aixi = b

• inequality constraints:∑Ni=1 Aixi ≤ b

• graph-coupling constraints Ax = b, where A is sparse

• combinations of independent variables, coupling variables and constraints

example:

minx,w,z

∑Ni=1 fi(xi, z)

s.t.∑iAixi ≤ b

Bixi ≤ w, ∀iw ∈ Ω

• variable coupling ←→ constraint coupling:

N∑i=1

fi(xi, z) −→ min

N∑i=1

fi(xi, zi) s.t. zi = z ∀i

28 / 74

Approaches toward parallelism

Keep an existing serial algorithm, parallelize its expensive part(s)

Introduce a new parallel algorithm by

• formulating an equivalent model

• replacing the model by an approximate model

where the new model lands itself for parallel computing

29 / 74

Outline

• Parallel ADMM

30 / 74

Example: parallel ISTA

minimizex

λ‖x‖1 + f(x)

Prox-linear approximation:

arg minx

λ‖x‖1 + 〈∇f(x),x− xk〉+1

2δk‖x− xk‖22

Iterative soft-threshold algorithm:

xk+1 = shrink(xk − δk∇f(x), λδk

)• shrink(y, t) has a simple close form: max(abs(y)-t,0).*sign(y)

• the dominating computation is ∇f(x)

examples: f(x) = 12‖Ax− b‖22, logistic loss of Ax− b, hinge loss ... where

A is often dense and can go very large-scale

31 / 74

Parallel gradient computing of f(x) = g(Ax+ b): row partition

Assumption: A is big and dense; g(y) =∑gi(yi) decomposable

Then f(x) = g(Ax + b) =∑gi(A(i)x + bi)

• let A(i) and bi be the ith row block of A and b, respectively

• store A(i),bi on node i

• goal: to compute ∇f(x) =∑

AT(i)∇gi(A(i)x + bi), or similarly ∂f(x)

32 / 74

Parallel gradient computing of f(x) = g(Ax+ b): row partition

1. compute A(i)x + bi ∀i in parallel;

2. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

3. compute AT(i)gi ∀i in parallel;

4. allreduce∑Mi=1 A

T(i)gi.

• step 1, 2 and 3 are local computation and communication-free;

• step 4 requires an allreduce sum communication.

33 / 74

Parallel gradient computing of f(x) = g(Ax+ b): column partition

· · ·A1 A2 AN

column block partition

Ax =N∑i=1

Aixi =⇒ f(x) = g(Ax + b) = g(∑

Aixi + b)

∇f(x) =

1∇g(Ax + b)

AT2∇g(Ax + b)

ATN∇g(Ax + b)

, similar for ∂f(x)

34 / 74

Parallel computing of ∇f(x) = g(Ax+ b): column partition

1. compute Aixi + 1Nb ∀i in parallel

2. allreduce Ax + b =∑Ni=1 Aixi + 1

3. evaluate ∇g(Ax + b) ∀i in each process;

4. compute ATi ∇g(Ax + b) ∀i in parallel.

• each processor keeps Ai and xi, as well as a copy of b;

• step 2 requires an allreduce sum communication;

• step 3 involves duplicated computation in each process (hopefully, g is

simple).

35 / 74

Two-way partition

· · ·· · ·

· · ·

··· ··· ···

A1,1 A1,2 A1,N

A2,2A2,1

AM,NAM,2

···

f(x) =∑

gi(A(i)x + bi) =∑i

gi(∑j

Ai,jxj + bi)

∇f(x) =∑i

AT(i)∇gi(A(i)x + bi) =

AT(i)∇gi(

Ai,jxj + bi)

36 / 74

Two-way partition

1. compute Ai,jxj + 1Nbi ∀i in parallel;

2. allreduce∑Nj=1 Ai,jxj + bi for every i;

3. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

4. compute ATj,igi ∀i in parallel;

5. allreduce∑Ni=1 A

Tj,igi for every j.

• processor (i, j) keeps Ai,j , xj and bi;

• step 2 and 5 require parallel allreduce sum communication;

37 / 74

Example: parallel ISTA for LASSO

min f(x) = λ‖x‖1 +1

2‖Ax− b‖22

algorithm

xk+1 = shrink(xk − δkATAxk + δkA

Tb, λδk)

Serial pseudo-code

1: initialize x = 0 and δ;

2: pre-compute ATb

3: while not converged do

4: g = ATAx−ATb

5: x = shrink (x− δg, λδ);

6: end while

38 / 74

distribute blocks of rows to M nodes

3: processor i loads A(i)

4: pre-compute δATb

6: processor i computes ci = AT(i)A(i)x

7: y = allreduce(ci, SUM)

8: x = shrink(xk − δc + δATb, λδ

9: end while

• assume A ∈ Rm×n

• one allreduce per iteration

• speedup

≈ 1ρ/M+(1−ρ)+O(log(M))

• P is close to 1

• requires synchronization

39 / 74

distribute blocks of columns to N nodes

· · ·A1 A2 AN

3: processor i loads Ai

4: pre-compute δATi b

6: processor i computes yi = Aixi

7: y = allreduce(yi, SUM)

8: processor i computes zi = ATi y

9: xk+1i = shrink

(xki − δzi + δAT

i b, λδ);

10: end while

• one allreduce per iteration

• speedup

≈ 1ρ/N+(1−ρ)+O(m

nlog(N))

• P is close to 1

• requires synchronization

• if m n, this approach is

faster

40 / 74

Outline

3. Primal decomposition / dual decomposition3

• Parallel ADMM

3Dantzig and Wolfe [1960], Spingarn [1985], Bertsekas and Tsitsiklis [1997]

41 / 74

Unconstrained minimization with coupling variables

x = [x1,x2, . . . ,xN ]T .

z is the coupling/bridging/complicating variable.

minx,z

f(x, z) =

N∑i=1

fi(xi, z)

equivalently to separable objective with coupling constraints

minx,z

N∑i=1

fi(xi, zi) s.t. zi = z ∀i

Examples

• network utility maximization (NUM), extend to multiple layers4

• domain decomposition (z are overlapping boundary variables)

• system of equations: A is block-diagonal but with a few dense columns

4see tutorial: Palomar and Chiang [2006]

42 / 74

Primal decomposition (bi-level optimization)

Introduce easier subproblems

gi(z) = minxi

fi(xi, z), i = 1, . . . , N

minx,z

N∑i=1

fi(xi, z) ⇐⇒ minz

∑gi(z)

the RHS is called the master problem

parallel computing: iterate

1. fix z, update xi and compute gi(z) in parallel

2. update z using a standard method (typically, subgradient descent)

• good for small-dimensional z

• fast if the master problem converges fast

43 / 74

Dual decomposition

Consider

minx,z

fi(xi, zi)

s.t.∑i

Aizi = b

Introduce

gi(z) = minxi

f(xi, z)

g∗i (y) = maxz

yT z− gi(z) //convex conjugate

Dualization:

⇔ minx,z

fi(xi, zi) + yT (b−∑i

(generalized minimax thm.) ⇔ maxyi

minxi,zi

(fi(xi, zi)− yTAizi

)+ yTb

⇔ maxyi

(gi(zi)− yTAizi

)+ yTb

⇔ min

g∗i (ATi y)− yTb

where gi(y) = minxi f(xi,y), g∗i (z) = miny gi(y)− zTy is the conjugate

function of gi(y).

44 / 74

Example: consensus and exchange problems

Consensus problem

Reformulate as

gi(zi) s.t. zi = z ∀i

Its dual problem is the exchange problem

g∗i (yi) s.t.∑i

yi = 0

45 / 74

Primal-dual objective properties

Constrained separable problem

gi(z) s.t.∑i

Aizi = b

Dual problem

g∗i (ATi y)− bTy

If gi is proper, closed, convex, g∗i (aTi y) is sub-differentiable

If gi is strictly convex, g∗i (aTi y) is differentiable

If gi is strongly convex, g∗i (aTi y) is differentiable and has Lipschitz gradient

Obtaining z∗ from y∗ generally requires strict convexity of gi or strict

complementary slackness

Example:

• gi(z) = |zi| and g∗i (aTi y) = ι−1 ≤ aTi y ≤ 1

• gi(z) = 12|zi|2 and g∗i (aTi y) = 1

2|aTi y|2

• gi(z) = exp(zi) and g∗i (aTi y) = (aTi y) ln(aTi y)− aTi y + ιaTi y ≥ 046 / 74

Outline

• Parallel dual gradient ascent (linearized Bregman5)• Parallel ADMM

5Yin, Osher, Goldfarb, and Darbon [2008]

47 / 74

Example: augmented `1

Change |zi| into strongly convex gi(zi) = |zi|+ 12α|zi|2

Dual problem

g∗i (ATi y)− bTy =

2| shrink(AT

i y)|2︸︷︷︸hi(y)

−bTy

Dual gradient iteration (equivalent to linearized Bregman):

yk+1 = yk − tk(∑

∇hi(yk)− b

), where ∇hi(y) = αAi shrink(AT

Dual is C1 and unconstrained. One can apply (accelerated) gradient

descent, line search, quasi-Newton method, etc.

Recover x∗ = α shrink(ATi y∗).

Easy to parallelize. Node i computes ∇hi(y). One collective

communication per iteration.

48 / 74

Outline

• Parallel ADMM8

8Bertsekas and Tsitsiklis [1997]

56 / 74

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

step 2: iterate

1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,

2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,

3. zk+1 ← zk − (Axk+1 + Byk+1 − b).

history: dates back to 60s, formalized 80s, parallel versions appeared late 80s,

revived very recently, new convergence results recently

57 / 74

ADMM and parallel/distributed computing

Two approaches:

• parallelize the serial algorithm(s) for ADMM subproblem(s),

• turn problem into a parallel-ready form9

minx,y

∑i fi(xi) + g(y)

consequently, computing xk+1 reduces to parallel subproblems

xk+1i ← min

xfi(xi) +

2‖Aixi + Biy

k − bi − zki ‖22, i = 1, . . . ,M

Issue: to have block diagonal A and simple B, additional variables are often

needed, thus increasing the problem size and parallel overhead.

9a recent survey Boyd, Parikh, Chu, Peleato, and Eckstein [2011]

58 / 74

Parallel Linearized Bregman vs ADMM on basis pursuit

Basis pursuit:

minx‖x‖1 s.t. Ax = b.

• example with dense A ∈ R1024×2048;

• distribute A to N computing nodes

A = [A1 A2 · · · AN ];

• compare

1. parallel ADMM in the survey paper Boyd, Parikh, Chu, Peleato, and

Eckstein [2011];

2. parallel linearized Bregman (dual gradient descent), un-accelerated.

• tested N = 1, 2, 4, . . . , 32 computing nodes;

59 / 74

parallel ADMM v.s. parallel linearized Bregman

parallel linearized Bregman scales much better

60 / 74

Outline

• Parallel ADMM

61 / 74

(Block)-coordinate descent/update

Definition: update one block of variables each time, keeping others fixed

Advantage: update is simple

Disadvantages:

• more iterations (with exceptions)

• may stuck at non-stationary points if problem is non-convex or

non-smooth (with exceptions)

Block selection: cycle (Gauss-Seidel), parallel (Jacobi), random, greedy

Specific for sparse optimization: greedy approach10 can be exceptionally

effective (because most time correct variables are selected to update; most zero

variables are never touched; “reducing” the problem to a much smaller one.)

10Li and Osher [2009]

62 / 74

GRock: greedy coordinate-block descent

Example:

minλ‖x‖1 + f(Ax− b)

decompose Ax =∑j Aixj ; block Aj and xj are kept on node j

Parallel GRock:

1. (parallel) compute a merit value for each coordinate i

di = arg mind

λ · r(xi + d) + gid+1

2d2, where gi = aT(i)∇f(Ax− b)

2. (parallel) compute a merit value for each block j

mj = max|d| : d is an element of dj

let sj be the index of the maximal coordinate within block j

3. (allreduce) P ← select P blocks with largest mj , 2 ≤ P ≤ N

4. (parallel) update xsj ← xsj + dsj for all j ∈ P

5. (allreduce) update Ax

63 / 74

How large P can be?

P depends on the block spectral radius

ρP = maxM∈M

ρ(M),

where M is the set of all P × P submatrices that we can obtain from ATA

corresponding to selecting exactly one column from each of the P blocks

Theorem

Assume each column of A has unit 2-norm. If ρP < 2, GRock with P parallel

updates per iteration gives

F(x + d)−F(x) ≤ ρP − 2

2β‖d‖22

moreover,

F(xk)−F(x∗) ≤2C2

(2L+ β

(2− ρP )β· 1

64 / 74

Compare different block selection rules

0 50 10010

nomalized # coordinates calculated

Cyclic CDGreedy CDMixed CDGRock

0 50 10010

Normalized # coordinates calculated

Relativeerror

Cyclic CDGreedy CDMixed CDGRock

• LASSO test with A ∈ R512×1024, N = 64 column blocks

• GRock uses P = 8 updates each iteration

• greedy CD11 uses P = 1

• mixed CD12 selects P = 8 random blocks and best coordinate in each

iteration

11Li and Osher [2009]12Scherrer, Tewari, Halappanavar, and Haglin [2012]

65 / 74

Outline

• Parallel ADMM

66 / 74

Test on Rice cluster STIC

specs:

• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon

(Nahalem) CPUs

• each node has 12GB of memory shared by all 8 cores

• # of processes used on each node = 8

test dataset:

A type A size λ sparsity of x∗

dataset I Gaussian 1024× 2048 0.1 100

dataset II Gaussian 2048× 4096 0.01 200

67 / 74

iterations vs cores

number of cores

number

ofiterations

p D-ADMMp FISTAGRock

number of cores

number

ofiterations

dataset I dataset II

Note: we set P = number of cores

68 / 74

time vs cores

number of cores

time(s)

P = 16

number of cores

time(s)

69 / 74

(% communication time) vs cores

number of cores

comm.timepercentage

number of cores

comm.timepercentage p D-ADMM

p FISTAGRock

70 / 74

big data LASSO on Amazon EC2

Amazon EC2 is an elastic, pay-as-you-use cluster;

advantage: no hardware investment, everyone can have an account

test dataset

• A dense matrix, 20 billion entries, and 170GB size

• x: 200K entries, 4K nonzeros, Gaussian values

requested system

• 20 “high-memory quadruple extra-large instances”

• each instance has 8 cores and 60GB memory

• written in C using GSL (for matrix-vector multiplication) and MPI

71 / 74

parallel Dual-ADMM vs FISTA13 vs GRock

p D-ADMM p FISTA GRock

estimate stepsize (min.) n/a 1.6 n/a

matrix factorization (min.) 51 n/a n/a

iteration time (min.) 105 40 1.7

number of iterations 2500 2500 104

communication time (min.) 30.7 9.5 0.5

stopping relative error 1E-1 1E-3 1E-5

total time (min.) 156 41.6 1.7

Amazon charge $85 $22.6 $0.93

• ADMM’s performance depends on penalty parameter β

we picked β as the best out of only a few trials (we cannot afford more)

• parallel Dual ADMM and FISTA were capped at 2500 iterations

• GRock used adaptive P and stopped at relative error 1E-5

13Beck and Teboulle [2009]

72 / 74

Software codes can be found in instructor’s website.

Acknowledgements: Zhimin Peng (Rice), Ming Yan (UCLA). Also, Qing

Ling (USTC) and Zaiwen Wen (SJTU)

References:

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on

Computing, 24:227–234, 1995.

D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal)

dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences,

100:2197–2202, 2003.

D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):

1289–1306, 2006.

E.J. Candes, E. Romberg, and T. Tao. Robust uncertainty principles: Exact signal

reconstruction from highly incomplete frequency information. Information Theory,

IEEE Transactions on, 52(2):489–509, 2006.

G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations

Research, 8:101–111, 1960.

J.E. Spingarn. Applications of the method of partial inverses to convex programming:

decomposition. Mathematical Programming, 32(2):199–223, 1985.

73 / 74

D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical

Methods, Second Edition. Athena Scientific, 1997.

D.P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility

maximization. IEEE Journal on Selected Areas in Communications, 24(8):

1439–1451, 2006.

W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for

l1-minimization with applications to compressed sensing. SIAM Journal on Imaging

Sciences, 1(1):143–168, 2008.

M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM

Journal on Optimization, 18(4):1326–1350, 2007.

W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM

Journal on Imaging Sciences, 3(4):856–877, 2010.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations

and Trends in Machine Learning, 3(1):1–122, 2011.

Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with

application to compressed sensing: a greedy algorithm. Inverse Problems and

Imaging, 3(3):487–503, 2009.

C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for

accelerating parallel coordinate descent. In NIPS, pages 28–36, 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

74 / 74

sparse optimization - lecture: parallel and distributed ...wotaoyin/summer2013/slides/lec...sparse...

Documents

performance optimization for sparse matrix …

midpoint-based parallel sparse matrix-matrix multiplication...

parallel symmetry-breaking in sparse graphs

communication-avoiding parallel sparse-dense matrix-matrix

distributed sparse...

sparse optimization - lecture 1: review of convex...

sparse optimization - lecture: basic sparse optimization...

sparse coding - github pagesyiiwood.github.io/images/sparse...

sparse code optimization

parallel sparse matrix-vector and matrix-transpose...

sparse matrix sparse vector multiplication using parallel...

parallel combinatorial computing and sparse matrices

convex optimization based state estimation against sparse...

evaluation of parallel direct sparse linear solvers in

optimization of sparse matrix-vector multiplication...

sparse optimization - uw computer sciences user...

parallel and distributed sparse optimization

sparse sum of squares optimization for model updating...

high-performance sparse fourier transform on parallel...

sparse sum-of-squares optimization for model updating...