orie 6326: convex optimization [2ex] operators · orie 6326: convex optimization operators...
TRANSCRIPT
ORIE 6326: Convex Optimization
Operators
Professor Udell
Operations Research and Information EngineeringCornell
May 15, 2017
1 / 47
Beating the lower bound
problem: lower bound for subgradient method is slow!
solution: find a more powerful oracle
2 / 47
Beating the lower bound
problem: lower bound for subgradient method is slow!solution: find a more powerful oracle
2 / 47
Proximal operator
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I proxf : Rd → Rd
I generalized projection: if 1C is the indicator of set C ,
prox1C(w) = ΠC (w)
I implicit gradient step: if z = proxf (x)
∂f (z) + z − x = 0
z = x − ∂f (z)
3 / 47
Proximal operator
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I proxf : Rd → Rd
I generalized projection: if 1C is the indicator of set C ,
prox1C(w) = ΠC (w)
I implicit gradient step: if z = proxf (x)
∂f (z) + z − x = 0
z = x − ∂f (z)
3 / 47
Proximal operator
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I proxf : Rd → Rd
I generalized projection: if 1C is the indicator of set C ,
prox1C(w) = ΠC (w)
I implicit gradient step: if z = proxf (x)
∂f (z) + z − x = 0
z = x − ∂f (z)
3 / 47
Proximal operator
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I proxf : Rd → Rd
I generalized projection: if 1C is the indicator of set C ,
prox1C(w) = ΠC (w)
I implicit gradient step: if z = proxf (x)
∂f (z) + z − x = 0
z = x − ∂f (z)
3 / 47
Maps from functions to functions
no consistent notation for map from functions to functions.
for a function f : Rd → R,
I prox maps f to a new function proxf : Rd → Rd
I proxf (x) evaluates this function at the point x
I ∇ maps f to a new function ∇f : Rd → Rd
I ∇f (x) evaluates this function at the point x
4 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0
(identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2
(shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x |
(soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0)
(projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi )
(separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1
(soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗
(soft-thresholding on singular values)
5 / 47
Let’s evaluate some proximal operators!
define the proximal operator of the function f : Rd → R
proxf (x) = argminz
(f (z) +1
2‖z − x‖22)
I f (x) = 0 (identity)
I f (x) = x2 (shrinkage)
I f (x) = |x | (soft-thresholding)
I f (x) = 1(x ≥ 0) (projection)
I f (x) =∑d
i=1 fi (xi ) (separable)
I f (x) = ‖x‖1 (soft-thresholding on each index)
I f (X ) = ‖X‖∗ (soft-thresholding on singular values)
5 / 47
Proxable functions
we say a function f is proxable if it’s easy to evaluate proxf (x)
all examples from previous slide are proxable
6 / 47
Roadmap
suppose f is smooth, g is non-smooth. solve
minimize f (x) + g(x)
using proximal operators together with gradient steps?
I the proximal operator gives a fast method to step towards theminimum of g
I gradient method works well to step towards minimum of f
I put it together with gradients to make fast optimizationalgorithms
to do this elegantly, we will need more theory. . .
7 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
8 / 47
Functions
in much of what follows, we’ll need to assume functions are
I closed: epi(f ) is a closed set
I convex: f is convex
I proper: dom f is non-empty
which we abbreviate as CCP
9 / 47
Relations
(x , ∂f (x)) and (x ,proxf (x)) define relations on Rn
I a relation R on Rn is a subset of Rn × Rn
I domR = {x : (x , y) ∈ R}I let R(x) = {y : (x , y) ∈ R}I if R(x) is always empty or a singleton, we say R is a function
I any function f : R→ R is a relation
10 / 47
Relations: examples
I empty relation: ∅I full relation: Rn × Rn
I identity: {(x , x) : x ∈ Rn}I zero: {(x , 0) : x ∈ Rn}I subdifferential: {(x , g : x ∈ dom f , g ∈ ∂f (x)}
11 / 47
Operations on relations
if R and S are relations, define
I composition: RS = {(x , z) : (x , y) ∈ R, (y , z) ∈ S}I addition: R + S = {(x , y + z) : (x , y) ∈ R, (x , z) ∈ S}I inverses: R−1 = {(y , x) : (x , y) ∈ R}
use inequality on sets to mean the inequality holds for any element inthe set, e.g.,
f (y) ≥ f (x) + ∂f T (y − x)
12 / 47
Example: fenchel conjugates and the subdifferential
if f is CPP, (f ∗)∗ = f ∗∗ = f , so
(u, v) ∈ (∂f )−1 ⇐⇒ (v , u) ∈ ∂f⇐⇒ u ∈ ∂f (v)
⇐⇒ 0 ∈ ∂f (v)− u
⇐⇒ v ∈ argminx
(f (x)− uT x)
⇐⇒ v ∈ argmaxx
(uT x − f (x))
⇐⇒ f (v) + f ∗(u) = uT v
⇐⇒ u ∈ argmaxy
(yT v − f ∗(y))
⇐⇒ 0 ∈ v − ∂f ∗(u))
⇐⇒ (u, v) ∈ ∂f ∗
this shows ∂f ∗ = ∂f −1
13 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
14 / 47
Zeros of a relation
I x is a zero of R if 0 ∈ R(x)
I the zero set of R is R−1(0) = {x : (x , 0) ∈ R}
x is a zero of ∂f iff x solves minimize f (x)
15 / 47
Zeros of a relation
I x is a zero of R if 0 ∈ R(x)
I the zero set of R is R−1(0) = {x : (x , 0) ∈ R}
x is a zero of ∂f iff x solves minimize f (x)
15 / 47
Lipschitz operators
relation F has Lipschitz constant L if for all (x , u) ∈ F and (y , v) ∈ F ,
‖u − v‖ ≤ L‖x − y‖
fact: if F is Lipschitz, then F is a function.proof:
if (x , u) ∈ F and (x , v) ∈ F ,
‖u − v‖ ≤ L‖x − x‖ = 0
I the relation F is nonexpansive if L ≤ 1
I the relation F is contractive if L < 1
16 / 47
Lipschitz operators
relation F has Lipschitz constant L if for all (x , u) ∈ F and (y , v) ∈ F ,
‖u − v‖ ≤ L‖x − y‖
fact: if F is Lipschitz, then F is a function.proof: if (x , u) ∈ F and (x , v) ∈ F ,
‖u − v‖ ≤ L‖x − x‖ = 0
I the relation F is nonexpansive if L ≤ 1
I the relation F is contractive if L < 1
16 / 47
Fixed points
x is a fixed point of F if x = F (x)
examples:
I F (x) = x : every point is a fixed point
I F (x) = 0: only 0 is a fixed point
I a contractive operator on Rn can have at most one FPproof: if x and y are FPs, ‖x − y‖ = ‖F (x)− F (y)‖ < ‖x − y‖contradiction
I a nonexpansive operator F need not have a fixed pointproof: translation
17 / 47
Fixed points
x is a fixed point of F if x = F (x)
examples:
I F (x) = x : every point is a fixed point
I F (x) = 0: only 0 is a fixed point
I a contractive operator on Rn can have at most one FP
proof: if x and y are FPs, ‖x − y‖ = ‖F (x)− F (y)‖ < ‖x − y‖contradiction
I a nonexpansive operator F need not have a fixed pointproof: translation
17 / 47
Fixed points
x is a fixed point of F if x = F (x)
examples:
I F (x) = x : every point is a fixed point
I F (x) = 0: only 0 is a fixed point
I a contractive operator on Rn can have at most one FPproof: if x and y are FPs, ‖x − y‖ = ‖F (x)− F (y)‖ < ‖x − y‖contradiction
I a nonexpansive operator F need not have a fixed pointproof: translation
17 / 47
Fixed points
x is a fixed point of F if x = F (x)
examples:
I F (x) = x : every point is a fixed point
I F (x) = 0: only 0 is a fixed point
I a contractive operator on Rn can have at most one FPproof: if x and y are FPs, ‖x − y‖ = ‖F (x)− F (y)‖ < ‖x − y‖contradiction
I a nonexpansive operator F need not have a fixed point
proof: translation
17 / 47
Fixed points
x is a fixed point of F if x = F (x)
examples:
I F (x) = x : every point is a fixed point
I F (x) = 0: only 0 is a fixed point
I a contractive operator on Rn can have at most one FPproof: if x and y are FPs, ‖x − y‖ = ‖F (x)− F (y)‖ < ‖x − y‖contradiction
I a nonexpansive operator F need not have a fixed pointproof: translation
17 / 47
Fixed point iteration
to find a fixed point of F , try the fixed point iteration
x (k+1) = F (x (k))
Q: when does this converge?
18 / 47
Fixed point iteration
to find a fixed point of F , try the fixed point iteration
x (k+1) = F (x (k))
Q: when does this converge?
18 / 47
Fixed point iteration: contractive
Banach fixed point theorem: if F is a contraction, the iteration
x (k+1) = F (x (k))
converges to the unique fixed point of F
properties: if L is the Lipschitz constant of F ,
I distance to fixed point decreases monotonically:
‖x (k+1) − x?‖ = ‖F (x (k))− F (x?)‖ ≤ L‖x (k) − x?‖
(iteration is Fejer-monotone)
I linear convergence with rate L
19 / 47
Proof
if F is a contraction, x (k+1) = F (x (k)) converges to unique fixed point
proof:
if F has Lipschitz constant L < 1,
I sequence x (k) is Cauchy:
‖x (k+`) − x (k)‖ ≤ ‖x (k+`) − x (k+`−1)‖+ · · ·+ ‖x (k+1) − x (k)‖≤ (L`−1 + . . .+ 1)‖x (k+1) − x (k)‖
≤ 1
1− L‖x (k+1) − x (k)‖
≤ Lk
1− L‖x (1) − x (0)‖
I so it converges to a point x?, which must be the (unique) FP
I converges to x? linearly with rate L
‖x (k)−x?‖ = ‖F (x (k−1))−F (x?)‖ ≤ L‖x (k−1)−x?‖ ≤ Lk‖x (0)−x?‖
20 / 47
Proof
if F is a contraction, x (k+1) = F (x (k)) converges to unique fixed point
proof: if F has Lipschitz constant L < 1,
I sequence x (k) is Cauchy:
‖x (k+`) − x (k)‖ ≤ ‖x (k+`) − x (k+`−1)‖+ · · ·+ ‖x (k+1) − x (k)‖≤ (L`−1 + . . .+ 1)‖x (k+1) − x (k)‖
≤ 1
1− L‖x (k+1) − x (k)‖
≤ Lk
1− L‖x (1) − x (0)‖
I so it converges to a point x?, which must be the (unique) FP
I converges to x? linearly with rate L
‖x (k)−x?‖ = ‖F (x (k−1))−F (x?)‖ ≤ L‖x (k−1)−x?‖ ≤ Lk‖x (0)−x?‖
20 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
21 / 47
Fixed point iteration: nonexpansive
if F is nonexpansive, the iteration
x (k+1) = F (x (k))
need not converge to a fixed point even if one exists.
proof:
I let F rotate its argument by θ degrees around the origin.
I then F is nonexpansive and has a fixed point at x? = 0.
I but if ‖x (0)‖ = r , then ‖F (x (k))‖ = r for all k .
22 / 47
Fixed point iteration: nonexpansive
if F is nonexpansive, the iteration
x (k+1) = F (x (k))
need not converge to a fixed point even if one exists.
proof:
I let F rotate its argument by θ degrees around the origin.
I then F is nonexpansive and has a fixed point at x? = 0.
I but if ‖x (0)‖ = r , then ‖F (x (k))‖ = r for all k .
22 / 47
Averaged operators
an operator F is averaged if
F = θG + (1− θ)I
for θ ∈ (0, 1), G nonexpansive
fact: if F is averaged, then x if FP of F ⇐⇒ x is FP of Gproof:
x = Fx = θGx + (1− θ)Ix = θGx + (1− θ)x
θx = θGx
x = Gx
=⇒ if G is nonexpansive, F = 12 I + 1
2G is averaged with same FPs
23 / 47
Averaged operators
an operator F is averaged if
F = θG + (1− θ)I
for θ ∈ (0, 1), G nonexpansive
fact: if F is averaged, then x if FP of F ⇐⇒ x is FP of Gproof:
x = Fx = θGx + (1− θ)Ix = θGx + (1− θ)x
θx = θGx
x = Gx
=⇒ if G is nonexpansive, F = 12 I + 1
2G is averaged with same FPs
23 / 47
Averaged operators
an operator F is averaged if
F = θG + (1− θ)I
for θ ∈ (0, 1), G nonexpansive
fact: if F is averaged, then x if FP of F ⇐⇒ x is FP of Gproof:
x = Fx = θGx + (1− θ)Ix = θGx + (1− θ)x
θx = θGx
x = Gx
=⇒ if G is nonexpansive, F = 12 I + 1
2G is averaged with same FPs
23 / 47
Fixed point iteration: averaged
if F = θG + (1− θ)I is averaged (θ ∈ (0, 1), G nonexpansive),the iteration
x (k+1) = F (x (k))
converges to a fixed point if one exists.
(also called the damped, averaged, or Mann-Krasnosel’skii iteration.)
properties:
I distance to fixed point decreases monotonically (Fejer-monotone)
I sublinear convergence of fixed point residual
‖Gx (k) − x (k)‖2 ≤ 1
(k + 1)θ(1− θ)‖x (0) − x?‖2
24 / 47
Proof
proof follows [Ryu and Boyd, 2015]
use ‖(1− θ)a + θb‖2 = (1− θ)‖a‖2 + θ‖b‖2 − θ(1− θ)‖a− b‖2(proof by expanding), and set
x (k+1) − x? = (1− θ)(x (k) − x?) + θ(Gx (k) − x?) = (1− θ)a + θb
so a− b = x (k) − x? − (Gx (k) − x?) = x (k) − Gx (k). then
‖x (k+1) − x?‖2
= (1− θ)‖x (k) − x?‖2 + θ‖Gx (k) − x?‖2 − θ(1− θ)‖x (k) − Gx (k)‖2
≤ (1− θ)‖x (k) − x?‖2 + θ‖x (k) − x?‖2 − θ(1− θ)‖x (k) − Gx (k)‖2
= ‖x (k) − x?‖2 − θ(1− θ)‖x (k) − Gx (k)‖2,
(using the fact that G is nonexpansive for the first inequality)
this shows the iteration is Fejer monotone
25 / 47
Proof (II)
sum the last inequality over iterations k . it telescopes:
‖x (k+1) − x?‖2 ≤ ‖x (0) − x?‖2 − θ(1− θ)k∑
i=0
‖x (i) − Gx (i)‖2
k∑i=0
‖x (i) − Gx (i)‖2 ≤ 1
θ(1− θ)‖x (0) − x?‖2
mini=0,...,k
‖x (i) − Gx (i)‖2 ≤ 1
(k + 1)θ(1− θ)‖x (0) − x?‖2
‖x (k) − Gx (k)‖2 ≤ 1
(k + 1)θ(1− θ)‖x (0) − x?‖2
where the last inequality uses the fact that G is nonexpansive.
26 / 47
How to design an algorithm
I look for an operator FI whose fixed points solve optimization problemsI which is contractive or averaged
I iterate x (k+1) = F (x (k))
I get convergenceI if F is contractive, get linear convergenceI if F is averaged, get sublinear convergence
27 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
28 / 47
Properties of operators
we’ll need these properties of operators:
I LipschitzI contractiveI nonexpansiveI averaged
I monotoneI maximalI strongly monotone == coercive
I cocoercive
29 / 47
Properties of optimization operators
we’ll show connection to optimization:
I gradient of convex function is monotone
I gradient of strongly convex function is strongly monotone
I gradient of smooth function is cocoercive
I prox of convex function is 12 -averaged
I gradient step I − 2β∇f of smooth function is nonexpansive
(and so smaller stepsizes are averaged)
I prox of strongly convex function is contractive
I gradient step of SSC function is contractive
30 / 47
Monotone operators
I a relation F is monotone if, for all (x , u) ∈ F and (y , v) ∈ F
(u − v)T (x − y) ≥ 0.
example: ∂f is monotone
I a relation F is α-strongly monotone if, for all (x , u) ∈ F and(y , v) ∈ F
(u − v)T (x − y) ≥ α‖x − y‖2.
(also called α-coercive)example: ∂f is strongly monotone iff f is strongly convex
I a relation F is maximal monotone if there is no monotonerelation that properly contains it (as a subset of Rn × Rn)examples:
I if F : Rn → Rn is continuous and monotone, then F is maximalmonotone
I if f is CCP then ∂f is maximal monotone
useful: makes sure FP iteration doesn’t leave domain of F
31 / 47
Monotone operators
I a relation F is monotone if, for all (x , u) ∈ F and (y , v) ∈ F
(u − v)T (x − y) ≥ 0.
example: ∂f is monotone
I a relation F is α-strongly monotone if, for all (x , u) ∈ F and(y , v) ∈ F
(u − v)T (x − y) ≥ α‖x − y‖2.
(also called α-coercive)example: ∂f is strongly monotone iff f is strongly convex
I a relation F is maximal monotone if there is no monotonerelation that properly contains it (as a subset of Rn × Rn)examples:
I if F : Rn → Rn is continuous and monotone, then F is maximalmonotone
I if f is CCP then ∂f is maximal monotone
useful: makes sure FP iteration doesn’t leave domain of F
31 / 47
Monotone operators
I a relation F is monotone if, for all (x , u) ∈ F and (y , v) ∈ F
(u − v)T (x − y) ≥ 0.
example: ∂f is monotone
I a relation F is α-strongly monotone if, for all (x , u) ∈ F and(y , v) ∈ F
(u − v)T (x − y) ≥ α‖x − y‖2.
(also called α-coercive)example: ∂f is strongly monotone iff f is strongly convex
I a relation F is maximal monotone if there is no monotonerelation that properly contains it (as a subset of Rn × Rn)examples:
I if F : Rn → Rn is continuous and monotone, then F is maximalmonotone
I if f is CCP then ∂f is maximal monotone
useful: makes sure FP iteration doesn’t leave domain of F31 / 47
Cocoercive operators
an operator F is 1β -cocoercive if
(F (x)− F (y))T (x − y) ≥ 1
β‖F (x)− F (y)‖2.
example: we’ve already seen that ∇f is cocoercive if f is β-smooth
I F is 1β -cocoercive =⇒ F is β-Lipschitz
(proof by Cauchy-Schwarz)
I F is α-coercive ⇐⇒ F−1 is α-cocoercive(proof by interchanging x and F (x))
32 / 47
Cocoercive operators
an operator F is 1β -cocoercive if
(F (x)− F (y))T (x − y) ≥ 1
β‖F (x)− F (y)‖2.
example: we’ve already seen that ∇f is cocoercive if f is β-smooth
I F is 1β -cocoercive =⇒ F is β-Lipschitz
(proof by Cauchy-Schwarz)
I F is α-coercive ⇐⇒ F−1 is α-cocoercive(proof by interchanging x and F (x))
32 / 47
Cocoercive operators
an operator F is 1β -cocoercive if
(F (x)− F (y))T (x − y) ≥ 1
β‖F (x)− F (y)‖2.
example: we’ve already seen that ∇f is cocoercive if f is β-smooth
I F is 1β -cocoercive =⇒ F is β-Lipschitz
(proof by Cauchy-Schwarz)
I F is α-coercive ⇐⇒ F−1 is α-cocoercive(proof by interchanging x and F (x))
32 / 47
Cocoercive operators
an operator F is 1β -cocoercive if
(F (x)− F (y))T (x − y) ≥ 1
β‖F (x)− F (y)‖2.
example: we’ve already seen that ∇f is cocoercive if f is β-smooth
I F is 1β -cocoercive =⇒ F is β-Lipschitz
(proof by Cauchy-Schwarz)
I F is α-coercive ⇐⇒ F−1 is α-cocoercive
(proof by interchanging x and F (x))
32 / 47
Cocoercive operators
an operator F is 1β -cocoercive if
(F (x)− F (y))T (x − y) ≥ 1
β‖F (x)− F (y)‖2.
example: we’ve already seen that ∇f is cocoercive if f is β-smooth
I F is 1β -cocoercive =⇒ F is β-Lipschitz
(proof by Cauchy-Schwarz)
I F is α-coercive ⇐⇒ F−1 is α-cocoercive(proof by interchanging x and F (x))
32 / 47
Gradient step is nonexpansive (or averaged)
F is 1β -cocoercive ⇐⇒ I − 2
βF is nonexpansive
(and smaller step sizes are averaged)
proof:
‖(I − 2
βF )y − (I − 2
βF )x‖
= ‖y − 2
βFy − x − 2
βFx‖
= ‖y − x‖2 − 4
β(Fy − Fx)T (y − x) +
4
β2‖Fy − Fx‖2
= ‖y − x‖2 − 4
β
((Fy − Fx)T (y − x)− 1
β‖Fy − Fx‖2
)≤ ‖y − x‖2
inequality is definition of 1β -cocoercivity
33 / 47
Gradient step is nonexpansive (or averaged)
F is 1β -cocoercive ⇐⇒ I − 2
βF is nonexpansive
(and smaller step sizes are averaged)
proof:
‖(I − 2
βF )y − (I − 2
βF )x‖
= ‖y − 2
βFy − x − 2
βFx‖
= ‖y − x‖2 − 4
β(Fy − Fx)T (y − x) +
4
β2‖Fy − Fx‖2
= ‖y − x‖2 − 4
β
((Fy − Fx)T (y − x)− 1
β‖Fy − Fx‖2
)≤ ‖y − x‖2
inequality is definition of 1β -cocoercivity
33 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
34 / 47
∂f as an operator
if f is convex, ∂f is maximal monotone
if f is β-smooth,
I ∂f = ∇f is 1β -cocoercive
I ∂f = ∇f is β-Lipschitz
I
if f is α-strongly convex
I ∂f is α-strongly monotone
35 / 47
Gradient method converges
if f is convex and β-smooth,
I I − 2β∇f is nonexpansive,
I gradient mapping I − t∇f is averaged for t ∈ (0, 2β )
I so fixed point iteration
x (k+1) = (I − 2
β∇f )x (k)
converges
I from convergence guarantee for damped iteration,
1
(k + 1)θ(1− θ)‖x (0) − x?‖2 ≥ min
i=0,...,k‖(I − 2
β∇f )x (k) − x (k)‖2
≥ mini=0,...,k
(2
β)2‖∇f (x (k))‖2
36 / 47
Strong convexity and smoothness
f is α-strongly convex iff f ∗ is 1α -smooth
proof:
∂f −1 = ∂f ∗.∂f α-coercive ⇐⇒ ∂f ∗ α-cocoercive ⇐⇒ f ∗ 1
α -smooth.
moral: strong convexity and (strong) smoothness are dual
37 / 47
Strong convexity and smoothness
f is α-strongly convex iff f ∗ is 1α -smooth
proof: ∂f −1 = ∂f ∗.∂f α-coercive ⇐⇒ ∂f ∗ α-cocoercive ⇐⇒ f ∗ 1
α -smooth.
moral: strong convexity and (strong) smoothness are dual
37 / 47
Gradient update is contractive for SSC functions
suppose f is α-strongly convex and β-smooth
the relation
I − t∇f = {(x , x − t∇f (x)) : x ∈ dom f }
is contractive if t ∈ (0, 2αβ2 ). best contraction factor if t = αβ2
proof:
‖x − t∇f (x)− (y − t∇f (y))‖2
≤ ‖x − y‖2 + t2‖∇f (x)−∇f (y)‖2 − 2t(∇f (x)−∇f (y))T (x − y)
≤ ‖x − y‖2 + t2β2‖x − y‖2 − 2tα‖x − y‖2
≤ (1− 2tα + t2β2)‖x − y‖2
note: stronger proof (using co+cocoercive inequality from slide 28 ofGD lecture) shows I − t∇f is Lipschitz with parameterL = max{|1− tα|, |1− tβ|}. if t = 2
α+β , L = κ−1κ+1
38 / 47
Outline
Relations
Fixed points
Averaged operators
Cocoercive and coercive operators
Operators and functions
Resolvents and proxs
39 / 47
Resolvent operator
for relation F , define the resolvent of F
RF = (I + F )−1
consider resolvent of F
I (I + F ) = {(x , x + y) : (x , y) ∈ F}I RF = (I + F )−1 = {(x + y , x) : (x , y) ∈ F}I RF = {(u, v) : (u − v) ∈ F (v)}
40 / 47
Prox is the resolvent of ∂f
I proxf = R∂f = (I + ∂f )−1
proof: let z ∈ proxf (x),
z = argminz
f (z) +1
2‖z − x‖2
0 ∈ ∂f (z) + z − x
(x − z) ∈ ∂f (z)
I proxf = ∇h∗ where h(x) = f (x) + 12‖x‖
2
proof: h is CCP and ∂h = ∂f + I , so
∇h∗ = (∂h)−1 = (I + ∂f )−1
I proxf is a functionproof: h is strongly convex, so h∗ is smooth
41 / 47
Prox is the resolvent of ∂f
I proxf = R∂f = (I + ∂f )−1
proof: let z ∈ proxf (x),
z = argminz
f (z) +1
2‖z − x‖2
0 ∈ ∂f (z) + z − x
(x − z) ∈ ∂f (z)
I proxf = ∇h∗ where h(x) = f (x) + 12‖x‖
2
proof: h is CCP and ∂h = ∂f + I , so
∇h∗ = (∂h)−1 = (I + ∂f )−1
I proxf is a functionproof: h is strongly convex, so h∗ is smooth
41 / 47
Prox is the resolvent of ∂f
I proxf = R∂f = (I + ∂f )−1
proof: let z ∈ proxf (x),
z = argminz
f (z) +1
2‖z − x‖2
0 ∈ ∂f (z) + z − x
(x − z) ∈ ∂f (z)
I proxf = ∇h∗ where h(x) = f (x) + 12‖x‖
2
proof: h is CCP and ∂h = ∂f + I , so
∇h∗ = (∂h)−1 = (I + ∂f )−1
I proxf is a function
proof: h is strongly convex, so h∗ is smooth
41 / 47
Prox is the resolvent of ∂f
I proxf = R∂f = (I + ∂f )−1
proof: let z ∈ proxf (x),
z = argminz
f (z) +1
2‖z − x‖2
0 ∈ ∂f (z) + z − x
(x − z) ∈ ∂f (z)
I proxf = ∇h∗ where h(x) = f (x) + 12‖x‖
2
proof: h is CCP and ∂h = ∂f + I , so
∇h∗ = (∂h)−1 = (I + ∂f )−1
I proxf is a functionproof: h is strongly convex, so h∗ is smooth
41 / 47
Projection is resolvent of indicator function
let f = IC , the indicator function of the convex set C
I ∂f is the normal cone operator
∂f (x) = NC (x) =
{∅ x 6∈ C
{w : wT (z − x) ≤ 0, ∀z ∈ C} x ∈ C
I R∂f = proxf is
(I + ∂IC )−1(x) = argminu
(IC (u) +1
2‖u − x‖2 = ΠC (x)
42 / 47
Resolvent and Cayley operators
for relation F , recall the resolvent of F
RF = (I + F )−1
and define the Cayley operator (sometimes called reflection operator)
CF = 2RF − I = 2(I + F )−1 − I
properties:
I if F is monotone, then RF and CF are nonexpansive (and henceare functions)
I if F is maximal monotone, then RF and CF have full domain
I zeros of F are FPs of RF and CF
we’ll prove the first and last properties; second is Minty’s theorem
43 / 47
Fixed points of R and C are zeros of F
fact: fixed points of R and C are zeros of F
proof:
0 ∈ F (x) ⇐⇒ x ∈ (I + F )(x)
⇐⇒ (I + F )−1(x) 3 x
⇐⇒ x = R(x)
(using the fact that R is a function)
and if x = R(x),
C (x) = 2R(x)− I (x) = 2x − x = x
in particular, this means fixed points of proxf (x) are zeros of ∂f
44 / 47
Fixed points of R and C are zeros of F
fact: fixed points of R and C are zeros of F
proof:
0 ∈ F (x) ⇐⇒ x ∈ (I + F )(x)
⇐⇒ (I + F )−1(x) 3 x
⇐⇒ x = R(x)
(using the fact that R is a function)
and if x = R(x),
C (x) = 2R(x)− I (x) = 2x − x = x
in particular, this means fixed points of proxf (x) are zeros of ∂f
44 / 47
Fixed points of R and C are zeros of F
fact: fixed points of R and C are zeros of F
proof:
0 ∈ F (x) ⇐⇒ x ∈ (I + F )(x)
⇐⇒ (I + F )−1(x) 3 x
⇐⇒ x = R(x)
(using the fact that R is a function)
and if x = R(x),
C (x) = 2R(x)− I (x) = 2x − x = x
in particular, this means fixed points of proxf (x) are zeros of ∂f
44 / 47
Resolvent is nonexpansive
fact: R = RλF is nonexpansive
proof:
if (x , u) ∈ R and (y , v) ∈ R,
x ∈ u + F (u), y ∈ v + F (v)
so for some fu ∈ F (u), fv ∈ F (v),
x − y = u − v + fu − fv
(u − v)T (x − y) = ‖u − v‖2 + (u − v)T (fu − fv )
(u − v)T (x − y) ≥ ‖u − v‖2
so R is 1-cocoercive. this implies R is nonexpansive, too:
‖u − v‖‖x − y‖ ≥ ‖u − v‖2
‖x − y‖ ≥ ‖u − v‖
note: this also proves projections and proxs are nonexpansive
45 / 47
Resolvent is nonexpansive
fact: R = RλF is nonexpansive
proof: if (x , u) ∈ R and (y , v) ∈ R,
x ∈ u + F (u), y ∈ v + F (v)
so for some fu ∈ F (u), fv ∈ F (v),
x − y = u − v + fu − fv
(u − v)T (x − y) = ‖u − v‖2 + (u − v)T (fu − fv )
(u − v)T (x − y) ≥ ‖u − v‖2
so R is 1-cocoercive. this implies R is nonexpansive, too:
‖u − v‖‖x − y‖ ≥ ‖u − v‖2
‖x − y‖ ≥ ‖u − v‖
note: this also proves projections and proxs are nonexpansive45 / 47
Cayley operator is nonexpansive
fact: C = CF is nonexpansive
proof:
‖C (x)− C (y)‖2 = ‖2(u − v)− (x − y)‖2
= 4‖u − v‖2 − 4(x − y)T (u − v) + ‖x − y‖2
≤ ‖x − y‖2
using 1-cocoercivity of R (from above): ‖u − v‖2 ≤ (u − v)T (x − y).
note:
I this also shows R = 12C + 1
2 I is 12 -averaged
I more generally, θC + (1− θ)I is θ-averaged for any θ ∈ (0, 1)
46 / 47
Cayley operator is nonexpansive
fact: C = CF is nonexpansive
proof:
‖C (x)− C (y)‖2 = ‖2(u − v)− (x − y)‖2
= 4‖u − v‖2 − 4(x − y)T (u − v) + ‖x − y‖2
≤ ‖x − y‖2
using 1-cocoercivity of R (from above): ‖u − v‖2 ≤ (u − v)T (x − y).
note:
I this also shows R = 12C + 1
2 I is 12 -averaged
I more generally, θC + (1− θ)I is θ-averaged for any θ ∈ (0, 1)
46 / 47
More contractions!
I if F is α-strongly monotone
I then (I + F ) is 1 + α-strongly monotone
I so RF = (I + F )−1 is (1 + α)-cocoercive
I and hence RF is 11+α -Lipschitz
moral: we now have two ways to manufacture contractions:
I as a gradient update of an SSC function
I as a proximal update of a strongly convex function
47 / 47
More contractions!
I if F is α-strongly monotone
I then (I + F ) is 1 + α-strongly monotone
I so RF = (I + F )−1 is (1 + α)-cocoercive
I and hence RF is 11+α -Lipschitz
moral: we now have two ways to manufacture contractions:
I as a gradient update of an SSC function
I as a proximal update of a strongly convex function
47 / 47