l.vandenberghe ece236c(spring2020) 10
TRANSCRIPT
L. Vandenberghe ECE236C (Spring 2020)
10. Dual proximal gradient method
• proximal gradient method applied to the dual
• examples
• alternating minimization method
10.1
Dual methods
Subgradient method: converges slowly, step size selection is difficult
Gradient method: requires differentiable dual cost function
• often the dual cost function is not differentiable, or has a nontrivial domain
• dual function can be smoothed by adding small strongly convex term to primal
Augmented Lagrangian method
• equivalent to gradient ascent on a smoothed dual problem
• quadratic penalty in augmented Lagrangian destroys separable primal structure
Proximal gradient method (this lecture): dual cost split in two terms
• one term is differentiable with Lipschitz continuous gradient
• other term has an inexpensive prox operator
Dual proximal gradient method 10.2
Composite primal and dual problem
primal: minimize f (x) + g(Ax)dual: maximize −g∗(z) − f ∗(−AT z)
the dual problem has the right structure for the proximal gradient method if
• f is strongly convex: this implies f ∗(−AT z) has a Lipschitz continuous gradient
A∇ f ∗(−ATu) − A∇ f ∗(−ATv)
2≤‖A‖22µ‖u − v‖2
µ is the strong convexity constant of f (see page 5.19)
• prox operator of g (or g∗) is inexpensive (closed form or simple algorithm)
Dual proximal gradient method 10.3
Dual proximal gradient update
minimize g∗(z) + f ∗(−AT z)
• proximal gradient update:
z+ = proxtg∗(z + t A∇ f ∗(−AT z))
• ∇ f ∗ can be computed by minimizing partial Lagrangian (from p. 5.15, p. 5.19):
x = argminx( f (x) + zT Ax)
z+ = proxtg∗(z + t Ax)
• partial Lagrangian is a separable function of x if f is separable
• step size t is constant (t ≤ µ/‖A‖22) or adjusted by backtracking
• faster variant uses accelerated proximal gradient method of lecture 7
Dual proximal gradient method 10.4
Dual proximal gradient update
x = argminx( f (x) + zT Ax)
z+ = proxtg∗(z + t Ax)
• Moreau decomposition gives alternate expression for z-update:
z+ = z + t Ax − tproxt−1g(t−1z + Ax)
• right-hand side can be written as z + t(Ax − y) where
y = proxt−1g(t−1z + Ax)
= argminy(g(y) + t
2‖Ax − t−1z − y‖22)
= argminy(g(y) + zT(Ax − y) + t
2‖Ax − y‖22)
Dual proximal gradient method 10.5
Alternating minimization interpretation
x = argminx( f (x) + zT Ax)
y = argminy(g(y) − zT y +
t2‖Ax − y‖22)
z+ = z + t(Ax − y)
• first minimize Lagrangian over x, then augmented Lagrangian over y
• compare with augmented Lagrangian method:
(x, y) = argminx,y( f (x) + g(y) + zT(Ax − y) + t
2‖Ax − y‖22)
• requires strongly convex f (in contrast to augmented Lagrangian method)
Dual proximal gradient method 10.6
Outline
• proximal gradient method applied to the dual
• examples
• alternating minimization method
Regularized norm approximation
primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)
subject to ‖z‖∗ ≤ 1
(see page 5.23)
• we assume f is strongly convex with constant µ, not necessarily differentiable
• we assume projections on unit ‖ · ‖∗-ball are simple
• this is a special case of the problem on page 10.3 with g(y) = ‖y − b‖:
g∗(z) ={
bT z ‖z‖∗ ≤ 1+∞ otherwise, proxtg∗(z) = PC(z − tb)
Dual proximal gradient method 10.7
Dual gradient projection
primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)
subject to ‖z‖∗ ≤ 1
• dual gradient projection update (C = {z | ‖z‖∗ ≤ 1}):
z+ = PC
(z + t(A∇ f ∗(−AT z) − b)
)• gradient of f ∗ can be computed by minimizing the partial Lagrangian:
x = argminx( f (x) + zT Ax)
z+ = PC(z + t(Ax − b))
Dual proximal gradient method 10.8
Example
primal: minimize f (x) +p∑
i=1‖Bix‖2
dual: maximize − f ∗(−BT1 z1 − · · · − BT
p zp)subject to ‖zi‖2 ≤ 1, i = 1, . . . , p
Dual gradient projection update (for strongly convex f ):
x = argminx( f (x) + (
p∑i=1
BTi zi)T x)
z+i = PCi (zi + tBi x) , i = 1, . . . , p
• Ci is unit Euclidean norm ball in Rmi, if Bi ∈ Rmi×n
• x-calculation decomposes if f is separable
Dual proximal gradient method 10.9
Example
• we take f (x) = (1/2)‖Cx − d‖22• each iteration requires solution of linear equation with coefficient CTC
• randomly generated C ∈ R2000×1000, Bi ∈ R10×1000, p = 500
0 100 200 300 400 50010−6
10−5
10−4
10−3
10−2
10−1
100
iteration
rela
tive
dual
subo
ptim
ality
projected gradientFISTA
Dual proximal gradient method 10.10
Minimization over intersection of convex sets
minimize f (x)subject to x ∈ C1 ∩ · · · ∩ Cp
• f is strongly convex with constant µ
• we assume each set Ci is closed, convex, and easy to project onto
• this is a special case of the problem on page 10.3 with
g(y1, . . . , yp) = δC1(y1) + · · · + δCp(yp)A =
[I I · · · I
]T
with this choice of g and A,
f (x) + g(Ax) = f (x) + δC1(x) + · · · + δCp(x)
Dual proximal gradient method 10.11
Dual problem
primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1
(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)
• proximal mapping of δ∗Ci: from Moreau decomposition (page 6.18),
proxtδ∗Ci(u) = u − tPCi(u/t)
• gradient of h(z1, . . . , zp) = f ∗(−z1 − · · · − zp):
∇h(z) = −A∇ f (−AT z) = −
I...I
∇ f ∗(−z1 − · · · − zp)
• ∇h(z) is Lipschitz continuous with constant ‖A‖22/µ = p/µ
Dual proximal gradient method 10.12
Dual proximal gradient method
primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1
(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)
• dual proximal gradient update
s = −z1 − · · · − zp
z+i = zi + t∇ f ∗(s) − tPCi(t−1zi + ∇ f ∗(s)), i = 1, . . . , p
• gradient of f ∗ can be computed by minimizing the partial Lagrangian
x = argminx( f (x) + (z1 + · · · + zp)T x)
z+i = zi + t x − tPCi (zi/t + x) , i = 1, . . . , p
• stepsize is fixed (t ≤ µ/p) or adjusted by backtracking
Dual proximal gradient method 10.13
Euclidean projection on intersection of convex sets
minimize 12‖x − a‖22
subject to x ∈ C1 ∩ · · · ∩ Cp
• special case of previous problem with
f (x) = 12‖x − a‖22, f ∗(u) = 1
2‖u‖22 + aTu
• strong convexity constant µ = 1; hence stepsize t = 1/p works
• dual proximal gradient update (with change of variable wi = pzi):
x = a − 1p(w1 + · · · + wp)
w+i = wi + x − PCi(wi + x), i = 1, . . . , p
• the p projections in the second step can be computed in parallel
Dual proximal gradient method 10.14
Nearest positive semidefinite unit-diagonal Z-matrix
projection in Frobenius norm of A ∈ S100 on the intersection of two sets:
C1 = S100+ , C2 = {X ∈ S100 | diag(X) = 1, Xi j ≤ 0 for i , j}
0 50 100 15010−6
10−5
10−4
10−3
10−2
10−1
100
iteration
rela
tive
dual
subo
ptim
ality
proximal gradientFISTA
Dual proximal gradient method 10.15
Euclidean projection on polyhedron
• intersection of p halfspaces Ci = {x | aTi x ≤ bi}
PCi(x) = x − max{aTi x − bi,0}‖ai‖22
ai
• example with p = 2000 inequalities and n = 1000 variables
0 1000 2000 3000 400010−4
10−3
10−2
10−1
100
iteration
rela
tive
dual
subo
ptim
ality
proximal gradientFISTA
Dual proximal gradient method 10.16
Decomposition of primal-dual separable problems
minimizen∑
j=1f j(x j) +
m∑i=1
gi(Ai1x1 + · · · + Ainxn)
• special case of f (x) + g(Ax) with (block-)separable f and g
• for example,minimize
n∑j=1
f j(x j)
subject ton∑
j=1A1 j x j ∈ C1
· · ·n∑
j=1Amj x j ∈ Cm
• we assume each fi is strongly convex; each gi has inexpensive prox operator
Dual proximal gradient method 10.17
Decomposition of primal-dual separable problems
primal: minimizen∑
j=1f j(x j) +
m∑i=1
gi(Ai1x1 + · · · + Ainxn)
dual: maximize −m∑
i=1g∗i (zi) −
n∑j=1
f ∗j (−AT1 j z1 − · · · − AT
mj z j)
Dual proximal gradient update
x j = argminx j
( f j(x j) +m∑
i=1zTi Ai j x j), j = 1, . . . ,n
z+i = proxtg∗i(zi + t
n∑j=1
Ai j x j), i = 1, . . . ,m
Dual proximal gradient method 10.18
Outline
• proximal gradient method applied to the dual
• examples
• alternating minimization method
Separable structure with one strongly convex term
minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)
• composite problem with separable f (two terms, for simplicity)
• if f1 and f2 are strongly convex, dual method of page 10.4 applies
x1 = argminx1
( f1(x1) + zT A1x1)
x2 = argminx2
( f2(x2) + zT A2x2)
z+ = proxtg∗(z + t(A1 x1 + A2 x2))
• we now assume that one function ( f2) is not strongly convex
Dual proximal gradient method 10.19
Separable structure with one strongly convex term
primal: minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)
dual: maximize −g∗(z) − f ∗1 (−AT1 z) − f ∗2 (−AT
2 z)
• we split dual objective in components − f ∗1 (−AT1 z) and −g∗(z) − f ∗2 (−AT
2 z)
• component f ∗1 (−AT1 z) is differentiable with Lipschitz continuous gradient
• proximal mapping of h(z) = g∗(z) + f ∗2 (−AT2 z) was discussed on page 8.7:
proxth(w) = w + t(A2 x2 − y)
where x2, y minimize a partial augmented Lagrangian
(x2, y) = argminx2,y
( f2(x2) + g(y) +t2‖A2x2 − y + w/t‖22)
Dual proximal gradient method 10.20
Dual proximal gradient method
z+ = proxth(z + t A1∇ f ∗1 (−AT1 z))
• evaluate ∇ f ∗1 by minimizing partial Lagrangian:
x1 = argminx1
( f1(x1) + zT A1x1)
z+ = proxth(z + t A1 x1)
• evaluate proxth(z + t A1 x1) by minimizing augmented Lagrangian:
(x2, y) = argminx2,y
( f2(x2) + g(y) +t2‖A2x2 − y + z/t + A1 x‖22)
z+ = z + t(A1 x1 + A2 x2 − y)
Dual proximal gradient method 10.21
Alternating minimization method
starting at some initial z, repeat the following iteration
1. minimize the Lagrangian over x1:
x1 = argminx1
( f1(x1) + zT A1x1)
2. minimize the augmented Lagrangian over x2, y:
(x2, y) = argminx2,y
(f2(x2) + g(y) +
t2‖A1 x1 + A2x2 − y + z/t‖22
)3. update dual variable:
z+ = z + t(A1 x1 + A2 x2 − y)
Dual proximal gradient method 10.22
Comparison with augmented Lagrangian method
Augmented Lagrangian method (for problem on page 10.19)
1. compute minimizer x1, x2, y of the augmented Lagrangian
f1(x1) + f2(x2) + g(y) +t2‖A1x1 + A2x2 − y + z/t‖22
2. update dual variable:
z+ = z + t(A1 x1 + A2 x2 − y)
Differences with alternating minimization (dual proximal gradient method)
• augmented Lagrangian method does not require strong convexity of f1
• there is no upper limit on the step size t in augmented Lagrangian method
• quadratic term in step 1 of AL method destroys separability of f1(x1) + f2(x2)
Dual proximal gradient method 10.23
Example
minimize 12xT
1 Px1 + qT1 x1 + qT
2 x2
subject to B1x1 � d1, B2x2 � d2A1x1 + A2x2 = b
• without equality constraint, problem would separate in independent QP and LP
• we assume P � 0
Formulation for dual decomposition
minimize f1(x1) + f2(x2)subject to A1x1 + A2x2 = b
• first function is strongly convex
f1(x) =12
xT1 Px1 + qT
1 x1, dom f1 = {x1 | B1x1 � d1}
• second function is not: f2(x) = qT2 x2 with domain {x2 | B2x2 � d2}
Dual proximal gradient method 10.24
Example
Alternating minimization algorithm
1. compute the solution x1 of the QP
minimize (1/2)xT1 P1x1 + (q1 + AT
1 z)T x1
subject to B1x1 � d1
2. compute the solution x2 of the QP
minimize (q2 + AT2 z)T x2 + (t/2)‖A1 x1 + A2x2 − b‖22
subject to B2x2 � d2
3. dual update:z+ = z + t(A1 x1 + A2 x2 − b)
Dual proximal gradient method 10.25
References
• P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization (1991).
• P. Tseng, Further applications of a splitting algorithm to decomposition in variational inequalitiesand convex programming, Mathematical Programming (1990).
Dual proximal gradient method 10.26