lecture 15 newton method and self-concordanceangelia/l15_selfconcordant.pdf · • self-concordance...
TRANSCRIPT
Lecture 15
Newton Method and Self-Concordance
October 23, 2008
Lecture 15
Outline
• Self-concordance Notion
• Self-concordant Functions
• Operations Preserving Self-concordance
• Properties of Self-concordant Functions
• Implications for Newton’s Method
• Equality Constrained Minimization
Convex Optimization 1
Lecture 15
Classical Convergence Analysis of Newton’s Method
• Given a level set L0 = {x | f(x) ≤ f(x0)}, it requires for all x, y ∈ L0:
‖∇2f(x)−∇2f(y)‖ ≤ L‖x− y‖2 mI � ∇2f(x) � MI
for some constants L > 0, m > 0, and M > 0
• Given a desired error level ε, we are interested in an ε-solution of the
problem i.e., a vector x such that f(x) ≤ f ∗ + ε
• An upper bound on the number of iterations to generate an ε-solution
is given byf(x0)− f ∗
γ+ log2 log2
(ε0
ε
)where γ = σβη2 m
M2, η ∈ (0, m2/L), and ε0 = 2m3/L2
Convex Optimization 2
Lecture 15
• This follows from
L
2m2‖∇f(xk+K)‖ ≤
(L
2m2‖∇f(xk)‖
)2K
≤(1
2
)2K
where k is such that ‖∇f(xk)‖ ≤ η and η is small enough so that
η L2m2 ≤ 1
2. By the above relation and strong convexity of f , we have
f(xk)− f ∗ ≤1
2m‖∇f(xk)‖2 ≤
2m3
L2
(1
2
)2K
• The bound is conceptually informative, but not practical
• Furthermore, the constants L, m, and M change with affine transfor-
mations of the space, while Newton’s method is affine invariant
• Can a bound be obtained in terms of problem data that is affine invariant
and, moreover, practically verifiable?
Convex Optimization 3
Lecture 15
Self-concordance
• Nesterov and Nemirovski introduced a notion of self-concordance and a
class of self-concordant functions
• Importance of the self-concordance:
• Possesses affine invariant property
• Provides a new tool for analyzing Newton’s method that exploits the
affine invariance of the method
• Results in a practical upper bound on the Newton’s iterations
• Plays a crucial role in performance analysis of interior point method
Def. A function f : R 7→ R is self-concordant when f is convex and
|f ′′′(x)| ≤ 2f ′′(x)3/2 for all x ∈ domf
The rate of change in curvature of f is bounded by the curvature
Note: One can use a constant κ other than 2 in the definition
Convex Optimization 4
Lecture 15
Examples
• Linear and quadratic functions are self-concordant [f ′′′(x) = 0 for all x]
• Negative logarithm f(x) = − lnx, x > 0 is self-concordant:
f ′′(x) =1
x2, f ′′′(x) = −
2
x3,
|f ′′′(x)|f ′′(x)3/2
= 2 for all x > 0
• Exponential function ex is not self-concordant: f ′′(x) = f ′′′(x) = ex,|f ′′′(x)|f ′′(x)3/2
=ex
e3x/2= e−x/2,
|f ′′′(x)|f ′′(x)3/2
→∞ as x → −∞
• Even-power monomial x2p, p > 2 is not self-concordant:
f ′′(x) = 2p(2p− 1)x2p−2, f ′′′(x) = 2p(2p− 1)(2p− 2)x2p−3,
|f ′′′(x)|f ′′(x)3/2
=p3|x2p−3|p2x3(p−1)
,|f ′′′(x)|f ′′(x)3/2
→∞ as x ↓ 0
Convex Optimization 5
Lecture 15
Scaling and Affine Invariance
• In the definition of the self-concordant function, one can use any other κ
• Let f be self-concordant with κ, then f(x) = κ2
4f(x) is self-concordant
with κ = 2:|f ′′′(x)|f ′′(x)3/2
=8κ2
4κ3
|f ′′′(x)|f ′′(x)3/2
≤ 2
• Self-concordant function is affine invariant: for f self-concordant and
x = ay + b, we have f(y) = f(x) self-concordant with the same κ:
• Convexity is preserved: f is convex
• f ′′′(y) = a3f ′′′(x), f ′′(y) = a2f ′′(x)
|f ′′′(x)|f ′′(x)3/2
=|a|3
|a|3|f ′′′(x)|f ′′(x)3/2
≤ κ
Convex Optimization 6
Lecture 15
Self-concordant Function in Rn
Def. A function f : Rn → R is self-concordant when it is self-concordant
along every line, i.e.,
• f is convex
• g(t) = f(x + tv) is self-concordant for all x ∈ domf and all v,
Note: The constant κ of self-concordance is independent of the choice of
x and v, i.e.,|g′′′(t)|g′′(t)3/2
≤ 2
for all x ∈ domf , v ∈ Rn, and t ∈ R such that x + tv ∈ domf
Convex Optimization 7
Lecture 15
Operations Preserving Self-concordance
Note: To start with, these operations have to preserve convexity• Scaling with a positive factor of at least 1: when f is self-concordant
and a > 1, then af is also self-concordant
• The sum f1 + f2 of two self-concordant functions is self-concordant
(extends to any finite sum)
• Composition with affine mapping: when f(y) is self-concordant, then
f(Ax + b) is self-concordant
• Composition with ln-function: Let g : R → R be a convex function with
domg = R++, and
|g′′′(x)| ≤ 3g′′(x)
xfor all x > 0
Then:f(x) = − ln (−g(x))− ln(x) over {x > 0 | g(x) < 0}
is self-concordant
HW7
Convex Optimization 8
Lecture 15
Implications
• When f ′′(x) > 0 for all x (then f is strictly convex), the self-concordant
condition can be written as
∣∣∣∣ d
dx
(f ′′(x)−
12
)∣∣∣∣ ≤ 1 for all x ∈ domf.
• Assuming that for some t > 0, the interval [0, t] lies in domf , we can
integrate from 0 to t, and obtain
−t ≤∫ t
0
d
dx
(f ′′(x)−
12
)dx ≤ t.
Convex Optimization 9
Lecture 15
• This implies the following lower and upper bounds on f ′′(x),
f ′′(0)(1 + t
√f ′′(0)
)2 ≤ f ′′(t) ≤f ′′(0)(
1− t√
f ′′(0))2, (1)
where the lower bound is valid for all t ≥ 0 with t ∈ domf , and the
upper bound is valid for all t ∈ domf with 0 ≤ t < f ′′(0)−1/2.
Convex Optimization 10
Lecture 15
Bounds on f
Consider f : Rn → R with ∇2f(x) > 0. Let v be a descent direction, i.e.,
such that ∇f(x)Tv < 0. Consider g(t) = f(x + tv)− f(x) for t > 0.
Integrating the lower bound of Eq. (1) twice, we obtain
g(t) ≥ g(0) + tg′(0) + t√
g′′(0)− ln(1 + t
√g′′(0)
)Consider now v = Newton′s direction, i.e., v = [∇2f(x)]−1∇f(x). Let
h(t) = f(x + tv). We have
h′(0) = ∇f(x)v = −λ2(x), h′′(0) = vT∇2f(0)v = λ2(x).
Integrating the lower bound of Eq. (1) twice, we obtain for 0 ≤ t < 1λ(x)
h(t) ≤ h(0)− tλ2(x)− tλ(x)− ln (1− tλ(x)) (2)
Convex Optimization 11
Lecture 15
Newton Direction for Self-concordant Functions
Let f : Rn → R be self-concordant and ∇2f(x) � 0 for all x ∈ L0
• Using self-concordance it can be seen that
• For Newton’s decrement λ(x) and any v ∈ Rn:
λ(x) = supv 6=0
−vT∇f(x)
(vT∇2f(x)v)1/2
with the supremum attainment at v = −[∇2f(x)]−1∇f(x)
HW7
Convex Optimization 12
Lecture 15
Self-Concordant Functions: Newton Method Analysis
• Consider Newton’s method, started at x0, with backtracking line search.
• Assume that f : Rn → R is self-concordant and strictly convex.
(If domf 6= Rn, assume the level set {x | f(x) ≤ f(x0)} is closed.
• Also, assume that f is bounded from below.
Under the preceding assumptions an optimizer x∗ of f over Rn exists. HW7
• Analysis is similar to the classic one except
• Self-concordance replaces strict convexity and Lipschitz Hessian as-
sumptions
• Newton decrement replaces the role of the gradient norm
Convex Optimization 13
Lecture 15
Convergence Result
Theorem 1 Assume that f is self-concordant strictly convex function thatis bounded below over Rn. Then, there exist η ∈ (0,1/4) and γ suchthat
• Damped phase If λk > η, we have f(xk+1)− f(xk) ≤ −γ.
• Pure Newton If λk ≤ η, the backtracking line search selects α = 1 and
2λk+1 ≤ (2λk)2
where λk = λ(xk).
Convex Optimization 14
Lecture 15
Proof
Let h(α) = f(xk + αdk). As seen in Eq. (2), we have for 0 ≤ α < 1λk
h(α) ≤ h(0)− αλ2k − αλk − ln (1− αλk) (3)
We use this relation to show that the stepsize α = 11+λk
satisfies the exit
condition of the line search. In particular, letting α = α, we have
h(α) ≤ h(0)− αλ2k − αλk − ln(1− αλk)
= h(0)− λk + ln(1 + λk)
Using the inequality −t + ln(1 + t) ≤ − t2
2(1+t)for t ≥ 0, and σ ≤ 1/2,
we obtain
Convex Optimization 15
Lecture 15
h(α) ≤ h(0)− σλ2
k
1 + λk
= h(0)− σαλ2k.
Therefore, at the exit of the line search, we have αk ≥ β1+λk
, implying
h(αk) ≤ h(0)− σβλ2
k
1 + λk
.
When λk > η, since h(αk) = f(xk+1) and h(0) = f(xk), we have
f(xk+1) ≤ f(xk)− γ, γ = σβη2
1 + η.
Assume that λk ≤ 1−2σ2
, from Eq. (3) and the inequality
−x− ln(1− x) ≤ x2/2 + x3 for 0 ≤ x ≤ 0.81,
we have
Convex Optimization 16
Lecture 15
h(1) ≤ h(0)−1
2λ2
k + λ3k ≤ h(0)− σλ2
k,
where the last inequality follows from λk ≤ 1−2σ2
.
Thus, the unit step satisfies the sufficient decrease condition (no damping).
The proof that 2λk+1 ≤ (2λk)2 left for HW7.
The preceding material is from Chapter 9.6 of the book by Boyd and
Vandenberghe.
Convex Optimization 17
Lecture 15
Newton’s Method for Self-concordant Functions
Let f : Rn → R be self-concordant and ∇2f(x) � 0 for all x ∈ L0
• The analysis of Newton’s method with backtracking line search:
• For an ε-solution, i.e., x with f(x) ≤ f ∗ + ε
• An upper bound on the number of iterations is given by
f(x0)− f ∗
Γ+ log2 log2
1
ε, Γ = σβ
η2
1 + η, η =
1− 2σ
4
• Explicitly:
f(x0)− f ∗
Γ+log2 log2
1
ε=
20− 8σ
σβ(1− 2σ)2[f(x0)− f ∗]+log2 log2
1
ε
• The bound is practical (when f ∗ can be underestimated)
Convex Optimization 18
Lecture 15
Equality Constrained Minimizationminimize f(x)
subject to Ax = b
Assumption:• The function f is convex, twice continuously differentiable• The matrix A ∈ Rp×n has Rank A = p• Optimal value f ∗ is finite and attained
• If Ax = b for some x ∈ relint(domf), there is no duality gap and
there exists an optimal dual solution
• KKT Optimality Conditions state x∗ is optimal if and only if there
exists λ∗ such that
Ax∗ = b, ∇f(x∗) + ATλ∗ = 0
• Solving the problem is equivalent to solving the KKT conditions: n + p
equations in n + p variables, x∗ ∈ Rn and λ∗ ∈ Rp
Convex Optimization 19
Lecture 15
Equality Constrained Quadratic Minimization
minimize 12
xTPx + qTx + r
subject to Ax = b
with P ∈ Sn+
Important in extending the Newton’s method to equality constraints
KKT Optimality Conditions[P AT
A 0
] [x∗
λ∗
]=
[−q
b
]• Coefficient matrix is referred to as KKT matrix• The KKT matrix is nonsingular if and only if
Ax = 0, x 6= 0 =⇒ xTPx > 0 [P is psd over null space of A]
• Equivalent conditions for nonsingularity of A:
• P + ATA � 0
• N (A) ∩N (P ) = {0}
Convex Optimization 20
Lecture 15
Newton’s Method with Equality Constraints
Almost the same as Newton’s method, except for two differences:
• The initial iterate x0 has to be feasible, Ax0 = b
• The Newton’s directions dk have to be feasible, Adk = 0
Given iterate xk, Newton’s direction dk is determined as a solution to
minimize f(xk) +∇f(xk)d + 12
dT∇2f(xk)d
subject to A(xk + d) = b
• The KKT conditions:(wk is optimal dual for the quadratic minimization)
[∇2f(xk) AT
A 0
] [dk
wk
]=
[−∇f(xk)
0
]
Convex Optimization 21
Lecture 15
Newton’s Method with Equality Constraints
Given starting point x ∈ domf with Ax = b and tolerance ε > 0.
Repeat
1. Compute Newton’s direction dN(x) by solving the corresponding KKT
system.
2. Compute the decrement λ(x)
3. Stopping criterion. Quit if λ2/2 ≤ ε.
4. Line search. Choose stepsize α by backtracking line search.
5. Update. x := x + αdN(x).
• When f is strictly convex and self-concordant, the bound on the
number of iterations required to achieve an ε-accuracy is the same
as in the “unconstrained case”
Convex Optimization 22
Lecture 15
Newton’s Method with Infeasible PointsUseful when determining an initial feasible iterate x0 is difficult
Linearizing optimality conditions Ax∗ = b and ∇f(x∗) + ATλ∗ at some
x + d ≈ x∗, with x is possibly infeasible, and using w ≈ λ∗
• We have
∇f(x∗) ≈ ∇f(x + d) ≈ ∇f(x) +∇2f(x)d
• Approximate KKT
A(x + d) = b, ∇f(x) +∇2f(x)d + ATw = 0
• At infeasible xk, the set of linear inequalities determining dk and wk:[∇2f(xk) AT
A 0
] [dk
wk
]= −
[∇f(xk)
Axk − b
]
Note: When xk is feasible, the system of equations is the same as in the
feasible point Newton’s method
Convex Optimization 23
Lecture 15
Equality Constrained Analytic Centering
minimize f(x) = −∑n
i=1 lnxi
subject to Ax = b
Feasible point Newton’s method: g = ∇f(x), H = ∇2f(x)
[H AT
A 0
] [d
w
]=
[−g
0
], g =
− 1
x1...
− 1xn
, H = diag[1
x21
, . . . ,1
x2n
]
• The Hessian is positive definite
• KKT matrix first row: Hd+ATw = −g ⇒ d = −H−1(g+ATw) (1)
• KKT matrix second row, Ad = 0, and Eq. (1)⇒ AH−1(g+ATw) = 0
• The matrix A has full row rank, thus AH−1AT is invertible, hence
w = −(AH−1AT
)−1AH−1g, H−1 = diag
[x2
1, . . . , x2n
]• The matrix −AH−1AT is known as Schur complement of H (any H)
Convex Optimization 24
Lecture 15
Network Flow Optimization
minimize∑n
l=1 φl(xl)
subject to Ax = b
• Directed graph with n arcs and p + 1 nodes
• Variable xl: flow through arc l
• Cost φl: cost flow function for arc l, with φ′′l (t) > 0
• Node-incidence matrix A ∈ R(p+1)×n defined as
Ail =
1 arc j originates at node i
−1 arc j ends at node i
0 otherwise
• Reduced node-incidence matrix A ∈ Rp×n is A with last row removed
• b ∈ Rp is (reduced) source vector
• Rank A = p when the graph is connected
Convex Optimization 25
Lecture 15
KKT system for infeasible Newton’s method
[H AT
A 0
] [d
w
]= −
[g
h
]where h = Ax− b is a measure of infeasibility at the current point x
• g = [φ′1(x1), . . . , φn(xn)]T
• H = diag [φ′′1(x1), . . . , φ′′n(xn)] with positive diagonal entries
• Solve via elimination:
w = (AH−1AT)−1[h−AH−1g], d = −H−1(g + ATw)
Convex Optimization 26