numerical optimization - eistiet.perso.eisti.fr/pdfs/optim-num2012-02-22.pdf · examples of...

ReferencesIntroduction

Deterministic Methods; Optimization problems without constraints

Numerical Optimization

Erik Taflin, EISTI

MASEF, February 2012, Version 2012-01-22

Erik Taflin, EISTI Numerical Optimization



Outlines I

References

IntroductionRecall of some existence resultsExamples of Optimization Problems in Finance

Optimal portfoliosModel CalibrationVariance reduction and Monte-Carlo price calculation

Deterministic Methods; Optimization problems without constraintsErik Taflin, EISTI Numerical Optimization



Outlines II

Gradient MethodsSuccessive ApproximationsSteepest descent

Inverse mapping Th.Conjugate gradientStep Size; Line-SearchersNewton-Raphson MethodsThe Newton-Kontorovich theorem




Outlines IIIQuasi-Newton AlgorithmsConvergence of Quasi-Newton BFGS


References I

[1] Bonnans, J.F., Gilbert, J.C., Lemarechal, C. et Sagastizabal, C.A.: Numerical Optimization, Springer 2006

[2] Del Moral, Pierre et Doucet, Arnaud: Particle Methods: An introduction with applications HAL-INRIA RR-6991 [50p]

(2009), 2008 Machine Learning Summer School, Springer LNCS/LNAI Tutorial book no. 6368 (2010-2011).http://hal.inria.fr/docs/00/23/92/49/PDF/RR-6437.pdf

[3] Duflo, M.: Random Iterative Models. Springer-Verlag Berlin and New York, 1997.

[4] Ekeland, I. and Temam, R.: Convex Analysis and Variational Problems, Classics in Applied Mathematics 28, SIAM 1999.

[5] Hamida, S.B. et Cont, R.: Recovering volatility from option prices by evolutionary optimization, Journal of Computational

Finance, Vol 8, Number 4, Summer 2005

[6] Kortchemski, I.: Optimisation nonlineare, Algorithmes numeriques, Notes Cours, EISTI 2012

[7] Lelong, J.: Etude asymptotique des algorithmes stochastiques et calcul du prix des options Parisiennes, These ENPC 2007

http://tel.archives-ouvertes.fr/docs/00/20/13/73/PDF/these lelong.pdf

[8] Lelong, J.: Almost sure convergence of randomly truncated stochastic algorithms under verifiable conditions, Statistics &

Probability Letters, 78(16), 2008;http://hal.archives-ouvertes.fr/docs/00/15/22/55/PDF/chen ps.pdf

[9] Marti, K.: Stochastic Optimization Methods, 2nd ed., Springer 2010

[10] Nocedal, J. et Wright, S.J.: Numerical Optimization, 2nd ed., Springer 2006

[11] Ortega, J. M.: Newton-Kantorovich Theorem, Classroom Notes, The American Mathematical Monthly, 75, 658–660

(1968)

References II

[12] Rheinboldt, W. C.: A Unified Convergence Theory for a Class of Iterative Processes, SIAM Jour. Numer. Anal. 5, 42–63

(1968).

[13] Zhigljavsky, A. et Zilinskas, A.: Stochastic Global Optimization, Springer 2008

2. Introduction2.1 Some existence results

• Optimization Problem: Given a function

f : E → R ∪ ∞, s.t. f (x) <∞ for some x ∈ E , (1)

find x∗ such that

x∗ ∈ E and ∀x ∈ E , f (x∗) ≤ f (x). (PI)

• Typically E is a TVS (topological vector space), Banach space, Rn, Cn, . . .• Frequent conditions on f :a) Convex,b) l.s.c. (lower semi continuous),c) for some c ∈ R f −1(]−∞, c]) is a non-empty bounded subset of E ,d) Coercive (i.e. lim‖x‖→∞ f (x) =∞), when E is a Banach space.We note that d) is a particular case of c)



Recall of some existence resultsExamples of Optimization Problems in Finance

Theorem 2.1 (c.f. Proposition II.1.2 of [4])

Let E be a reflexive Banach space and let f be a convex, l.s.c. functionsatisfying (1). If for some c ∈ R, f −1(]−∞, c]) is a non-empty boundedsubset of E , then there exists a solution x∗ of (PI). Moreover, this solution isunique, if f is strictly convex.

When f is C 1 we have the following necessary condition

Theorem 2.2Let E be a Banach space, f ∈ C 1(E ,R) and x∗ satisfy (PI). Then f ′(x∗) = 0.

When E and E1 are Banach spaces, the following problem generalizes theequation in the necessary condition of Theorem 2.2:

Given g ∈ C (E ,E1), find x∗ ∈ E s.t. g(x∗) = 0. (PII)


Example 2.3

Let E = H1(Rn), y ∈ H1(Rn) and

f (x) =

∫Rn

(1

2

∑i

(∂x(t)

∂ti)2 +

1

2x(t)2 + x(t)y(t)

)dt. (2)

Then Theorem 2.1 applies and the unique solution x∗ satisfies−∆x∗(t) + x∗(t) + y(t) = 0.

Example 2.4 (cf. [4])

Without the convexity condition on f , Theorem 2.1 is no longer true, as isseen from the example

f (x) =

∫ 1

0((x ′(t)2 − 1)2 + x(t)2) dt. (3)

Here we define the Banach space E by the norm ‖x‖E = |x(0)|+ ‖x ′‖L4 .




In the finite dimensional case one can relax the convexity condition of Th. 2.1:

Theorem 2.5Let E be a finite dimensional vector space and let f be a l.s.c. functionsatisfying (1). If for some c ∈ R, f −1(]−∞, c]) is a non-empty boundedsubset of E , then there exists a solution x∗ of (PI). Moreover, this solution isunique, if f is strictly convex.


2.2 Examples of Optimization Problems in FinanceExample 1: Optimal portfolios.• Utility function: U : R→ −∞ ∪ R is an u.s.c, increasing, strictly concavefunction, which is C 1 on the interior ]x ,∞[ (x ≤ 0) of its effective domainand for which the Inada conditions are satisfied.• Consider for simplicity a mono-period market: t ∈ 0,T, r interest rate, Sprice vector of risky assets, (Ω,P,F), H ∈ RN risky part of the portfolio,xinitial investment. Portfolio problem: Find H ∈ RN s.t.

H ∈ RN and ∀x ∈ E , f (H) ≤ f (H) := E [U(x(1 + r) + H · (ST − S0))] .

• For “interior solutions”, if ∃, solve

f ′(H) = 0

Alg.: Successive approximations, Steepest descent, Newton-Raphson, . . .• Markowitz Portfolio: “Non-admissible” utility function

U(x) = −1

2x2 + ax , Alg.: Conjugate gradient.




Example 2: Model Calibration; In general not convex.• Generalized B-S model, (Heston, Local Volatility, Dupire, . . . )

dSt = St(r dt + σt(a) dWt),

where σt(a) is a r.v. dependent on a ∈ U ⊂ RN .• P(K ; a) is the price at t = 0, in this model, of a Call with strike K .• One observes at t = 0 the price Pi of a Call with strike Ki , i = 1, . . . , n.• Calibration problem: Minimize f (a) =

∑i |P(Ki ; a)− Pi |2wi (wights wi )

• Algorithms: Deterministic but mainly Stochastic Algorithms since

P(K ; a) = EQ

[e−rT (ST − K )+

].

(Robbins-Monro, Kiefer-Wolfowitz, Simulated Annealing (recuit simule),Evolutionary Algorithms, . . . )


Example 3: Variance reduction and Monte-Carlo price calculation. Ingeneral not convex.• Stock price S , dSt = Stσ(t,St) dWt , t ∈ [0,T ], dim W = n• Derivative Pay-off h(ST ) at T , where h : R+ → R+. Then h(ST ) = HT (W )for some real function defined on martingales.• We want to find a M-C approximation of the price at t = 0, p0 = E [h(ST )] .• Approximation by discretisation, 0 = t0 < t1 < · · · < tm = T :

p0 = E [HT (W )] ≈ Pm := E [φm(Wt1 , . . . ,Wtm )] , for some φm : Rm → R.

• Girsanov’s transf. dP ′/(dP) = exp (−a ·WT − 1/2 |a|2T ) gives:

Pm = E [Xa] , Xa = φm(Wt1 + at1, . . . ,Wtm + atm) exp (−a ·WT −1

2|a|2T )

• The variance v(a) of Xa is obtained after a minor calculation:

v(a) = E

[(φm(Wt1 + at1, . . . ,Wtm + atm))2 exp (−a ·WT +

1

2|a|2T )

].

Stochastic Algorithms to minimize v(a) : Robbins-Monro variants (c.f. [7])



Gradient MethodsInverse mapping Th.Conjugate gradientStep Size; Line-SearchersNewton-Raphson MethodsThe Newton-Kontorovich theoremQuasi-Newton AlgorithmsConvergence of Quasi-Newton BFGS

3. Deterministic Methods; Optimization problems withoutconstraints• The purpose here is to construct sequences (xk )k∈N converging to x∗,minimizing f in case of (PI) or being a zero of g in case of (PII). The mostsimple algorithms, xk+1 only depends of xk :

xk+1 = F (xk ), k ∈ N where F (x) = x + t(x) d(x). (4)

• Here d(x) ∈ E is the direction and t(x)‖d(x)‖ ∈ R is the step size atx ∈ E . We write d(x) = d(x)/‖d(x)‖, when d(x) 6= 0.• More generally xk+1 = Fk (x1, . . . , xk ). For ex. with x1, . . . , xk dependentdirection and step size

Fk (x1, . . . , xk ) = xk + tk (x1, . . . , xk ) dk (x1, . . . , xk ). (5)





• Let E be a Hilbert space and consider an algorithm as in (4). d(x) ∈ E iscalled a descent direction at x ∈ E in case of (PI) (resp. (PII)) when(f ′(x), d(x)) < 0 (resp (g(x), d(x)) < 0). The generalization to an algorithmas in (5) is obvious.

I Next we shall first consider Gradient Methods includingI Successive ApproximationsI Steepest descent

I and then More general descent algorithms includingI Conjugate gradientI Inverse mapping theorem algorithm (also called modified Newton method)I Newton-Raphson Method, with the convergence result:

Newton-Kontorovich theorem





3.1 Gradient Methods

• The iteration is called Gradient Method when d(x) ∼ f ′(x) for problem (PI)or d(x) ∼ g(x) for problem (PII), i.e

xk+1 = F (xk ), k ∈ N, where F (x) = x − t(x) g(x). (6)

If the sequence (xk )k∈N converges to x∗ then g(x∗) = 0, since

x∗ = F (x∗) = x∗ − t(x∗) g(x∗).





3.1.1 Successive Approximations

• Solve the problem (PI) of minimizing f or (PII) of finding a zero ofg ∈ C (E ,E ), with t(x) = t in (6):

F (x) = x − t g(x). (7)

The convergence is ensured if there is a neighborhood S of x0 s.t. F : S → Sis a contraction. However, in general @t 6= 0 s.t. F is a contraction. In factF (x)− F (y) = T (x , y)(x − y), where T (x , y) ∈ L(E ,E ) “can not be madeuniformly small” in general. Here (with I denoting the identity operator)

T (x , y) = I − t

∫ 1

0g ′(sx + (1− s)y) ds.





• In the case of (PI) in a Hilbert space, with f satisfying a condition of uniformstrict convexity, for some t0 > 0 the iteration (7) converges ∀t ∈]0, t0[ :

Theorem 3.1Let E be a Hilbert space, f ∈ C 2(E ) a strictly convex function, such that forsome c ∈ R, f −1(]−∞, c]) is a non-empty bounded subset of E , x∗ theunique solution of problem (PI) given by Theorem 2.1 and R > ‖x∗‖. Supposethat there exist m,M ∈ R s.t. 0 < m ≤ M and ∀x ∈ S = x ∈ E : ‖x‖ ≤ R

m I ≤ f ′′(x) ≤ M I , in L(E ,E ).

Let S∗ = x ∈ E : ‖x − x∗‖ ≤ R − ‖x∗‖. Then ∃t0 > 0 s.t. ∀t ∈]0, t0[ , Frestricted to S∗ is a contraction. So the iteration (7), with x0 ∈ S∗ converges.





Proof: ‖F (x)− F (y)‖ ≤ ‖T (x , y)‖‖x − y‖, for x , y ∈ E . Since g ′ = f ′′

∀ x , y ∈ S , m I ≤∫ 1

0g ′(sx + (1− s)y) ds ≤ M I .

For t > 0, it follows that (1− tM)I ≤ T (x , y) ≤ (1− tm)I . So for t ∈]0, 1M [,

0 < (1− tM)I ≤ T (x , y) ≤ (1− tm)I < I and ρ := ‖(1− tm)I‖ < 1. Hence,

∀x , y ∈ S , ‖F (x)− F (y)‖ ≤ ρ‖x − y‖.

Since F (x∗) = x∗, it follows now that ∀x∗ ∈ S∗

‖F (x)− x∗‖ ≤ ρ‖x − x∗‖ < (R − ‖x∗‖).

Consequently, F restricted to S∗ is a contraction.

Example 3.2

F : R→ R. a) |F ′(x∗)| < 1. b) |F ′(x∗)| > 1.


3.1.2 Steepest descent

I In Problem (PI), let E be a Hilbert space and let f ∈ C 1(E ).

I Iteration by F (x) := x + t(x)e(x), where t(x) and e(x) will be defined.

I Set also Ft(x) = x + te(x)

I If f ′(x) = 0 then t(x) = 0 and e(x) = 0

I If f ′(x) 6= 0, then e(x) ∈ E , ‖e(x)‖ = 1, is the Steepest DescentDirection of f at x :

e(x) = − f ′(x)

‖ f ′(x)‖

and t(x) the optimal step size in the sens that it solves inft>0 f (Ft(x)).So

∂

∂tf (Ft(x))⇒ (f ′(Ft(x)(x)), f ′(x)) = 0. (8)

I If x∗ exists: Iteration F not always convergent (follows using (8)).Remedy: Take sufficiently short step size t(x), but slow when convergent.

3.2 Inverse mapping Th.• The contraction mapping in the poof of the Inverse Mapping Theorem, is anexample of an algorithm being convergent (linearly) without supposing theexistence of a solution. Also, it points towards Newton Methods

Theorem 3.3 (cf. Inverse mapping Th.)

Suppose that E , E1 are Banach spaces, g ∈ C 1(E ,E1), c1 > 0, ‖g(0)‖ < c1

and g ′(0) ∈ L(E ,E1) invertible. Set F (x) = x − (g ′(0))−1g(x) andxk+1 = F (xk ). If c1 is sufficiently small, then ∃ an open neighborhood O of 0s.t.

g(x∗) = 0 has a unique solution x∗ ∈ O,

F [O] ⊂ O, the restriction of F to O is a contraction mapping and ∀x0 ∈ Olimk→∞ xk = x∗.

Proof: Replace g by (g ′(0))−1g in the proof of Th.3.1.

Remark 3.4 (cf. Th.3.1)

If we suppose the existence of x∗ and only c1 <∞, then the alg. defined byF (x) = x − t(g ′(0))−1g(x) with t > 0 sufficiently close to 0 converges to x∗.

3.3 Conjugate gradient• Conjugate Gradient algorithm solves (in its standard form) quadraticoptimization problems (PI) in E = Rn. Let A ∈ L(E ,E ) be invertible anda ∈ E . Define

f (x) =1

2‖Ax − a‖2. (9)

Note that f ′(x) = A∗(Ax − a), f ′(x + y) = f ′(x) + A∗Ay and

f ′(x∗) = 0 ⇔ x∗ = A−1a (10)

The solution x∗ of (PI) is given in maximum n iterations:

1. Take x0 ∈ E and define E−1 = 0. If f ′(x1) = 0 then x∗ = x1

2. For given xk and Ek−1, define Ek = lh(f ′(xk ) ∪ Ek−1), (lh for linearhull)

3. yk is the unique solution in Ek of (see Lemma 3.5 below)

f ′(xk ) + A∗Ayk ∈ E⊥k , (i.e. f ′(xk + yk ) ∈ E⊥k ) (11)

4. Iteration: xk+1 = xk + yk . If f ′(xk+1) = 0 then x∗ = xk+1. Iff ′(xk+1) 6= 0 then continue the iteration.

Lemma 3.5Let E be a Hilbert space, P be an orthogonal projection in E and A ∈ L(E ,E )be such that A−1 ∈ L(E ,E ). If v1 ∈ PE then there exists a unique v2 ∈ PEsuch that

v1 + A∗Av2 ∈ (I − P)E . (12)

Proof: Eq. (12) is equivalent to v1 + PA∗Av2 = 0. This eq. has a uniquesolution v2 ∈ PE , since PA∗A restricted to PE has a bounded inverse. In factif B is the restriction of PA∗A to PE and 0 6= x ∈ PE then

‖Bx‖ = sup‖y‖≤1

|(y ,Bx)| ≥ (x

‖x‖,Bx) =

1

‖x‖‖Ax‖2 ≥ ‖A−1‖−2‖x‖.

It follows that ‖B−1‖ ≤ ‖A−1‖2.

3.4 Step Size; Line-Searchers

I We here consider the problem to determine tk in xk+1 = xk + tk dk forgiven dk ∈ E

I Suppose that d is a descent direction at x , i.e. q′(0) < 0 whereq(t) = f (x + td).

I Example: As seen, in the case of Steepest Descent we can choosed(x) = −f ′(x) at x and the “step size” t∗(x) to be the solution ofinft>0 q(t).

I We shall here only consider the Wolfe’s Rule.Given 0 < m1 < m2 < 1 and initial data t = 0, t =∞ and t > 0,is:

I Wolfe’s Rule: For given t ∈ ]t, t[

1. If q(t) ≤ q(0) + m1tq′(0) and q′(t) ≥ m2q′(0), then t is the step size andthe alg. stops.

2. If q(t) > q(0) + m1tq′(0) set t = t.

3. If q(t) ≤ q(0) + m1tq′(0) and q′(t) < m2q′(0), set t = t.

4. If one of the points 2 or 3 is satisfied choose t ∈ ]t, t[ and go back topoint 1.

I Step nr. 4 can for example be realized by t = (t + t)/2.

I For E = Rn we have the following theorem (cf. [1] Theorem 3.7): If q isC 1 and bonded from below, then Wolfe’s line-search alg. terminates in afinite number of iterations.

3.5 Newton-Raphson Methods• Consider Problem (PII), i.e. for g ∈ C (E ,E1), find x∗ ∈ E s.t. g(x∗) = 0.Suppose moreover that g ∈ C 1(E ,E1) and O is an open convex neighborhoodin E s.t.

∀x ∈ O, (g ′(x))−1 ∈ L(E1,E ). (13)

The iteration algorithm (4) is here defined by xk+1 = N(xk ) where

N(x) = x − (g ′(x))−1g(x), x ∈ O. (14)

• Advantage of N-R Method: Often the convergence is quadratic; Understrong hypotheses this follows easily. For ex. if xk ⊂ O converges tox∗ ∈ O, ‖(g ′(x))−1‖ bounded in O, g : O → E1 and its derivatives up toorder 3 are bounded, then g(x∗) = 0, ‖(g ′(x∗))−1‖ <∞, N ′(x∗) = 0 give:

xk+1 − x∗ = N(xk )− N(x∗) =

∫ 1

0N ′′(sxk + (1− s)x∗); xk − x∗, xk − x∗) ds

Then for some C only depending on O

‖xk+1 − x∗‖ ≤ C‖xk − x∗‖2, k ∈ N.

(Still true under much weaker hypotheses; Newton-Kontorovich Th. 3.8).

Example 3.6

Apply Newton-Raphson Method to find the zeros of a quadratic polynomial p,p(t) = t2 − 2at + b for given a, b ∈ R satisfying 0 < b ≤ a2.

Solution:The zeros t∗ and t∗∗ satisfy

0 < t∗ = a−√

a2 − b ≤ t∗∗ = a +√

a2 − b. (15)

Let O− = ]−∞, a[, O+ = ]a,∞[ and O = O− ∪ O+. Then p′(t) < 0 fort ∈ O− and p′(t) > 0 for t ∈ O+.With n instead of N, the iteration formula (13) reads for t ∈ O :

n(t) = t − p(t)p′(t) . If b = a2, define by continuity n(a) = a. One has

n(t)− a = t − a− p(t)

p′(t)=

(t − a)2 + a2 − b

2(t − a), t ∈ O. (16)

So, if b = a2 then ∀t ∈ R n(t)− a = 12 (t − a) and if 0 < b ≤ a2 then

n : O → O is C∞ and n(Oε) ⊂ Oε, ε = ±. (17)

p(t) = (t − t∗)(t − t∗∗) ⇒ n(t)− t∗ = t − t∗ − (t−t∗)(t−t∗∗)2(t−a) . We have

n(t)−t∗ = C ∗(t)(t−t∗), t 6= a where C ∗(t) = 1− t − t∗∗

2(t − a)=

t∗ − t

2(a− t). (18)

C ∗(t)(t − t∗) < 0 for t ∈ O− and C ∗(t) ≤ 12 for t ∈ ]−∞, t∗[ give

n(O−) ⊂ ]−∞, t∗[ and 0 < t∗ − n(t) ≤ 1

2(t∗ − t), for t < t∗. (19)

Similarly,

n(t)− t∗∗ = C ∗∗(t)(t − t∗∗), t 6= a where C ∗∗(t) = 1− t − t∗

2(t − a)=

t − t∗∗

2(t − a).

(20)C ∗∗(t)(t − t∗∗) > 0 for t ∈ O+ and C ∗∗(t) ≤ 1

2 for t ∈ ]t∗∗,∞[ give

n(O+) ⊂ ]t∗∗,∞[ and 0 < n(t)− t∗∗ ≤ 1

2(t − t∗∗), for t > t∗∗. (21)

The sequence (tk ) is defined by

t0 ∈ R and tk+1 = n(tk ), k ≥ 1. (22)

We sum up this discussion, supplemented by convergence speed:

Proposition 3.7

Let 0 < b ≤ a2 and denote t− := t∗ and t+ := t∗∗. The sequence (tk ) definedby (22) converges if and only if t0 ∈ R \ a (resp. t0 ∈ R) when b < a2

(resp. b = a2). For ε = ±, tε is a fixed point of n and if t0 ∈ Oε the sequenceis strictly monotone and converges to tε. If b < a2 then the convergence rateis quadratic, i.e. for ε = ±

∃C > 0 depending on t0 ∈ Oε s.t. |tk+1 − tε| ≤ C |tk − tε|2, k ≥ 0. (23)

If b = a2 then the convergence rate is linear. In fact |tk+1 − tε| = 12 |tk − tε|

Proof: Inequality (23) follows directly from the expressions of C ∗(t) andC ∗∗(t).

• For further reference we note that

p(n(t)) =1

2p′′(t)(n(t)− t)2 (24)

and

n2(t)− n(t) = − 1

p′(n(t))

1

2p′′(t)(n(t)− t)2. (25)

In fact, n(t) = t − p(t)p′(t) gives p(t) + p′(t)(n(t)− t) = 0. Then using that p is

of second degree, formula (24) follows:

p(n(t)) = p(n(t))− (p(t) + p′(t)(n(t)− t))

= p′(t)(n(t)− t) +1

2p′′(t)(n(t)− t)2 − p′(t)(n(t)− t) =

1

2p′′(t)(n(t)− t)2.

By the def. of n, n2(t)− n(t) = − p(n(t))p′(n(t)) . Then (24) give

n2(t)− n(t) = − 1

p′(n(t))

1

2p′′(t)(n(t)− t)2,

which proves formula (25). End of Example 3.6.

3.6 The Newton-Kontorovich theorem

• Notations in this section:

I E and E1 are Banach spaces

I O ⊂ E is an open convex set; x0 ∈ O.I g ∈ C 1(E ,E1) and g ′ : O → L(E ,E1) is the Lpschitz continuous:

∃K > 0 such that ∀x , y ∈ O, ‖g ′(x)− g ′(y)‖ ≤ K‖x − y‖. (26)

I A0 = g ′(x0).

Theorem 3.8 (Newton-Kontorovich)

Suppose that A−10 ∈ L(E1,E ) and that γ := αβK ≤ 1

2 , where α, β ∈ R+ ares.t.

‖A−10 ‖ ≤ α and ‖A−1

0 g(x0)‖ ≤ β.

Let the roots of t2 − 2αK t + 2β

αK = 0 be t∗ ≤ t∗∗, so

0 < t∗ =1

αK(1−

√1− 2γ) ≤ t∗∗ =

1

αK(1 +

√1− 2γ), (27)

and suppose that S ⊂ O, where S := BE (x0, t∗) ⊂ O.

Then the sequence (xk )k∈N of Newton iterates xk+1 = N(xk ), where

N(x) = x − g ′(x)−1g(x),

is well-defined, xkk∈N ⊂ S and it converges to an element x∗ ∈ S . Moreoverx∗ is the unique element in O ∩ BE (x0, t

∗∗) satisfying g(x∗) = 0 and if γ < 12

then (xk )k∈N converges quadratically to x∗.

We shall give a proof which closely follows [11] and [12].

Lemma 3.9Let the hypotheses of Theorem 3.8 be satisfied. For allx ∈ Q := B(x0,

1αK ) ∩ O one has g ′(x)−1 ∈ L(E1,E ) and

‖g ′(x)−1‖ ≤ α

1− αK‖x − x0‖. (28)

If x ,N(x) ∈ Q then

‖N2(x)− N(x)‖ ≤ αK/2

1− αK‖N(x)− x0‖‖N(x)− x‖2. (29)

Proof: The subset of invertible operators in L(E ,E1), endowed with theoperator norm, is open, so ∃ε > 0 s.t. g ′(x)−1 ∈ L(E1,E ), forx ∈ B(x0, ε) ∩ O. For such x ,g ′(x)−1 − g ′(x0)−1 = g ′(x)−1(g ′(x0)− g ′(x))g ′(x0)−1 shows that

‖g ′(x)−1 − g ′(x0)−1‖ ≤ ‖g ′(x)−1‖‖g ′(x0)−1‖‖g ′(x)− g ′(x0)‖.

By (26) and hyp. Th. 3.8: ‖g ′(x)−1 − g ′(x0)−1‖ ≤ αK‖g ′(x)−1‖‖x − x0‖.

This gives ‖g ′(x)−1‖ − ‖g ′(x0)−1‖ ≤ αK‖g ′(x)−1‖‖x − x0‖, which combinedwith ‖g ′(x0)−1‖ ≤ α proves (28). We can now take ε = 1

αK .

To prove (29) let x ,N(x) ∈ Q. The def. of N gives

‖N2(x)− N(x)‖ ≤ ‖g ′(N(x))−1‖‖g(N(x))‖. (30)

The first factor on the r.h.s. satisfies according to (28)‖g ′(N(x))−1‖ ≤ α

1−αK‖N(x)−x0‖ .

To estimate the second factor, note that g(x) + g ′(x)(N(x)− x) = 0 impliesg(N(x)) = g(N(x))− g(x)− g ′(x)(N(x)− x). For y ∈ O,g(y)− g(x)− g ′(x)(y − x) =

∫ 10 (g ′(sy + (1− s)x)− g ′(x))(y − x) ds

According to (26) ‖g ′(sy + (1− s)x)− g ′(x)‖ ≤ Ks‖y − x‖, so‖g(N(x))‖ ≤ 1

2 K‖N(x)− x‖2.These results and ineq. (30) give

‖N2(x)− N(x)‖ ≤ αK/2

1− αK‖N(x)− x0‖‖N(x)− x‖2,

which proves (29).

Lemma 3.10Let the hypotheses of Theorem 3.8 be satisfied. The sequence (xk )k∈Nsatisfies xk : k ∈ N ⊂ S and is majorized by the sequence (tk )k∈N in thesens that

‖xk+1 − xk‖ ≤ tk+1 − tk , k ∈ N,

where tk+1 = n(tk ), n(t) = t − p(t)p′(t) , p(t) = t2 − 2at + b, a = 1

αK , b = 2βαK

and t0 = 0.

Proof: We can apply Proposition 3.7, since 0 < b = 2γa2 ≤ a2. As t0 < t∗,(tk )k∈N is strictly increasing and converges to t∗. For k ≥ 1 we make thefollowing induction hypothesis:

Hk : x0, . . . , xk ∈ S and ‖xi − xi−1‖ ≤ ti − ti−1, for i = 1, . . . , k . (31)

H1 is true, since according to the hypotheses of Theorem 3.8, x0 ∈ S and‖x1 − x0‖ = ‖A−1

0 g(x0)‖ ≤ β = t1 − t0 < t∗. In particular x1 ∈ S .

If Hk is true, then according to Lemma 3.9 and formula (25)

‖xk+1 − xk‖ = ‖N2(xk−1)− N(xk−1)‖ ≤ αK/2

1− αK‖xk − x0‖‖xk − xk−1‖2

≤ αK/2

1− αKtk(tk − tk−1)2 = tk+1 − tk .

(32)

This proves that the inequality in Hk+1 is satisfied. Moreover, since t0 = 0and (tk ) is strictly increasing:

‖xk+1 − x0‖ ≤k∑0

‖xi+1 − xi‖ = tk+1 ≤ t∗.

So xk+1 ∈ S .

In the proof of the uniqueness in Newton-Kontorovich th., we will use thefollowing result on the iteration in the Inverse Mapping Th.:

Exercise 3.11 (c.f. [12], Corollary 3.3)

Let the hypotheses of Theorem 3.8 be satisfied and let F : O → E be given by

F (x) = x − A−10 g(x), x ∈ O.

Then the sequence (yk )k∈N of iterates yk+1 = F (yk ), where y0 = x0, satisfiesykk∈N ⊂ S and it converges to an element y∗ ∈ S . Moreover y∗ is theunique element in O ∩ BE (x0, t

∗∗) satisfying g(y∗) = 0.

Proof Theorem 3.8: For 0 ≤ m ≤ n, Lemma 3.10 gives

‖xn − xm‖ ≤ tn − tm < t∗ − tm → 0 as m→∞,

(xk )k∈N is a Cauchy sequence in S ⊂ E , which is a Banach space E . Theunique limit x∗ ∈ S .Since g(xk ) = −g ′(xk )(xk+1 − xk ) = −(g ′(x0) + g ′(xk )− g ′(x0))(xk+1 − xk ),we obtain according to (26)

‖g(xk )‖ ≤ (‖g ′(x0)‖+ ‖g ′(xk )− g ′(x0)‖)‖xk+1 − xk‖≤ (‖g ′(x0)‖+ K‖xk − x0‖)‖xk+1 − xk‖.

The r.h.s. converges to 0, which proves that g(x∗) = 0.When γ < 1

2 then (tk ) converges at least quadratically to t∗, sinceαK/2

1−αKt∗ <∞. Since ‖xk − x∗‖ ≤∑∞

k ‖xi+1 − xi‖ = t∗ − tk also (xk )converges at least quadratically to x∗.The uniqueness property follows from Exercise 3.11.

3.7 Quasi-Newton AlgorithmsI A problem with the Newton Algorithm: The Hessian f ′′(x) and its

inverse (f ′′(x))−1 has to be calculated at each step.

I Quasi-Newton Algorithms: Replace f ′′(xk ) by an approximation Mk anduse the algorithm

xk+1 = xk + tk dk , where dk = −(Mk )−1g(xk ) and limk→∞

tk = 1. (QN-1)

Mk satisfies the Quasi-Newton equation

g(xk )− g(xk−1) = Mk (xk − xk−1). (QN-2)

andMk is symmetric. (QN-3)

I Motivation of (QN-2) and (QN-3):

I The average Gk =∫ 1

0g ′(xk−1 + s(xk − xk−1)), of g ′ over the line-segment

[xk−1, xk ], satisfies g(xk )− g(xk−1) = Gk (xk − xk−1). Mk shall also satisfythis equation.

I g ′(x) is symmetric when g = f ′.

I Updating of M:Mk+1 = Mk + Bk ,

where one chooses rank(Bk ) ≤ 2, just in order to satisfy (QN-2) and tokeep some freedom in the definition of xk+2 by (QN-1).

I Here we shall only consider the BFGS method (Broyden, Fletcher,Goldfarb, Shanno):

Mk+1 = M+(Mk , xk+1 − xk , g(xk+1)− g(xk )), (33)

where for all symmetric positive definite M ∈ Rn2and s, y ∈ Rn such

that (s, y) 6= 0 (here AT is the transpose of the matrix A)

M+(M, s, y) = M +1

(s, y)yy T − 1

(s,Ms)MssT M. (34)

Often, when no risk for confusion, the argument M in M+(M, s, y) willbe omitted.

Proposition 3.12

Let M ∈ Rn2be symmetric positive definite and let s, y ∈ Rn satisfy

(s, y) 6= 0. Then

i) M+(s, y)s = y ,

ii) M+(s, y) is symmetric,

iii) M+(s, y) is positive definite iff (s, y) > 0.

Proof: i) and ii) are trivial.iii) A. Suppose that M+(s, y) is positive definite. Since s 6= 0, it then followsfrom i) that (y , s) = (M+(s, y)s, s) > 0.B. Suppose that (y , s) > 0. With x = M−1y we have (s,Mx) > 0 and

M+(s, y) = M +1

(s,Mx)MxxT M − 1

(s,Ms)MssT M.

So ∀ u ∈ Rn, (u,M+(s, y)u) = (u,Mu) + 1(s,Mx) (Mx , u)2 − 1

(s,Ms) (Ms, u)2 ≥(u,Mu)− 1

(s,Ms) (Ms, u)2.

Since M is sym. pos. def. a scalar product (·, ·)M is defined by(u, v)M = (Mu, v). The last ineq. and Schwarz inequality then give

(u,M+(s, y)u) ≥ (u, u)M −(s, u)2

M

(s, s)M=

1

(s, s)M((s, s)M(u, uM)− (s, u)2

M) ≥ 0.

If there is equality then u = ks and in this case it follows from i) that(u,M+(s, y)u) = k2(s,M+(s, y)s) = k2(s, y). So by hypothesis, if u 6= 0 then(u,M+(s, y)u) > 0.

Remark 3.13Let s = xk+1 − xk and y = g(xk+1)− g(xk ).i) If (s, y) 6= 0 then (QN-2) is satified.ii) If dk is a descent direction, xk+1 = xk + tdk and t is given by Wolfe’sline-search, then (s, y) > 0.

Proof of ii): Let q(t) = f (xk + tdk ), gk = g(xk ) and gk+1 = g(xk+1).

I dk is a descent direction iff q′(0) = (gk , dk ) < 0.

I According to Wolfe’s Rule 1, q′(t) ≥ m2q′(0) for some 0 < m2 < 1.

I Then (y , s) = t((gk+1, dk )− (gk , dk )) = t(q′(t)− q′(0))≥ t(m2 − 1)q′(0) > 0.

3.8 Convergence of Quasi-Newton BFGS

We will here closely follow Reference [1], pp. 57–66. We have the followingglobal convergence result (c.f. Th. 4.9 of [1]):

Theorem 3.14Let f ∈ C 2(Rn) be coercive, convex and bounded below. Then the BFGSalgorithm, with Wolfe’s line-search and M1 symmetric positive definite,satisfies limk |g(xk )| = 0.

For the proof see [1], which we here complete with the following

Lemma 3.15Let M ∈ Rn2

be positive definite and symmetric, let s, y ∈ Rn satisfy(s, y) 6= 0 and let M+(M, s, y) be given by (34). Then

i) tr(M+(M, s, y)) = tr(M) + |y |2(y ,s) −

|Ms|2(Ms,s)

ii) det(M+(M, s, y)) = det(M) (y ,s)(Ms,s)

iii) If f ∈ C 2(Rn) is convex, then|f ′(x + s)− f ′(x)|2 ≤ C (s, f ′(x + s)− f ′(x)),where C is bounded when x and s stays in a bonded set in Rn.

Proof:i) Trivial;ii) • Introduce 〈x , z〉 = (Mx , z), which defines a scalar product since M ispositive definite and symmetric. Set E = Rn, e = M−1y , f = s,|x |2 = (x , x)), ‖x‖2 = 〈x , x〉 and M+ = M+(M, s, y). This gives, since〈e, f 〉 = (y , s) 6= 0 :

M+ = M +1

(s, y)yy T − 1

(s,Ms)MssT M = M +

1

〈e, f 〉MeeT M− 1

〈f , f 〉Mff T M.

So M+ = MA, with Ax = x + 〈e,x〉〈e,f 〉e −

〈f ,x〉‖f ‖2 f . We shall prove that

det(A) = (y ,s)〈f ,f 〉 .

• Eigenvalues λ of A. Let E1 = x ∈ E : 〈f , x〉 = 〈e, x〉 = 0 andE2 = lhe, f (orth. compl. w.r.t. 〈·, ·〉). Then

E = E1 ⊕ E2 and AEi = Ei , i = 1, 2.

1. For x ∈ E1, Ax = x , so E1 is an eigenspace with eigenvalue λ = 1.2. Let x ∈ E2. Firstly suppose that e, f is lin. indep. and set x = ae + bf .Then

0 = Ax − λx = (1− λ)(ae + bf ) +〈e, ae + bf 〉〈e, f 〉

e − 〈f , ae + bf 〉‖f ‖2

f .

⇔

((1− λ)〈e, f 〉+ ‖e‖2)a + 〈e, f 〉b = 0

〈e, f 〉a + λ|f ‖2b = 0.

This linear syst. in (a, b) has vanishing determinant iff

λ2 − (1 +‖e‖2

〈e, f 〉)λ+

〈e, f 〉‖f ‖2

= 0.

Using Schwarz inequality, it follows that the two roots λε, ε = ±1, are real:

λε =1

2(1 +

‖e‖2

〈e, f 〉) + ε

√1

4(1 +

‖e‖2

〈e, f 〉)2 − 〈e, f 〉

‖f ‖2.

A restricted to E2 has the two distinct real eigenvalues λε, ε = ±1,whene, f is lin. indep.Secondly suppose that e = kf , k ∈ R. Then 〈e, f 〉 = k〈f , f 〉 and Af = kf , so

A restricted to E2 has the eigenvalue k = 〈e,f 〉‖f ‖2 .

• It now follows from 1. and 2. in the first case thatdet(A) = λ1λ−1 = 〈e,f 〉

‖f ‖2 = (y ,s)〈f ,f 〉 and in the second case that det(A) = k ,

which proves statement ii) of the lemma.

iii) Let 0 6= s ∈ Rn and set a(x , s) = f ′(x + s)− f ′(x). Since f ∈ C 2(Rn),

a(x , s) = A(x , s)s, where A(x , s) =

∫ 1

0f ′′(x + us) du.

The operator norm |A(x , s)| is bounded when x and s stays in a bounded set.A(x , s) is positive semi definite and symmetric. In fact

(b,A(x)b) =∫ 1

0 (b, f ′′(x + us)b) du ≥ 0.It follows that|a(x , s)|2 = (A(x , s)s,A(x , s)s) = (A(x , s)(A(x , s))1/2s, (A(x , s))1/2s)≤ |A(x , s)(A(x , s))1/2s||(A(x , s))1/2s| ≤ |A(x , s)||(A(x , s))1/2s|2= |A(x , s)|((A(x , s))1/2s, (A(x , s))1/2s) = |A(x , s)|(A(x , s)s, s)= |A(x , s)|(a(x , s), s).

Concerning local convergence we have the following result (c.f. Th. 4.11 [1])

Theorem 3.16 (Dennis-More criterion)

Let E = Rn, g ∈ C 1(E ,E ), g(x∗) = 0 and g ′(x∗)−1 ∈ L(E ,E ). Let thesequence (xk ) be defined by invertible linear operators Ak and the iteration

x0 ∈ E and xk+1 = xk − A−1k g(xk ).

Then (xk ) converges super-linearly to x∗, i.e. limk|xk+1−x∗||xk−x∗| = 0, iff

limk

(Ak − g ′(x∗))xk+1 − x∗

|xk − x∗|= 0.

This criterion leads to the following result (c.f. Th 4.17 [1]):

Theorem 3.17Let E = Rn, O be an open neighborhood of x∗ ∈ E . Suppose that f ∈ C 2(O)has Lipshitzian f ′′, f ′(x∗) = 0 and f ′′(x∗)−1 ∈ L(E ,E ). Let the sequence (xk )in O be generated by the BFGS algorithm together with Wolfe’s line-searchwith 0 < m1 <

12 < m2 < 1 and assume that (xk ) converges to x∗. Then the

convergence is super-linear.

numerical optimization - eistiet.perso.eisti.fr/pdfs/optim-num2012-02-22.pdf · examples of...

Documents