an introduction to numerical optimization - tcdsigmedia/pmwiki/uploads/main... · an introduction...

An Introduction to Numerical Optimization

General form of the optimization problem

m., . . . 1, i ib xig to subjectf(x) minimize

Problems onOptimizati Convex 3.

b-Ax minimize

Problems Squares-Least 2.

m., . . . 1, i ib xtia to subject

nR x and c xtc minimize

Problem gProgrammin Linear The 1.

s subclasseudes the This incl

nE space someof subseta is and setfeasible the is

function. cost orobjective an called often function scalarvalued real a is f(x) Here

x tot subjecf(x) minimize

=≤

=≤

∈

Ω

Ω∈

)(

22

1

Rosenbrock Banana function 2

122

1 )1() −+−= xx10(x f(x) 2

2

Today we are going to focus on the specific problem of unconstrained optimization with vectors of real numbers. So our problem can be written as

)()1())

)(

0.

....

))(

0)..,..,

))(

))(

)

2

21

2

1

2

1

2

21

2

xff(x )x-(1 xf(

10 for meansf(x) Convex

minimum global a is x then hold (d) & (c) and convex is f(x) If e

H(x)

x (at positive is ,derivative secondHessian, The d

xf

xf

xf( f(x) g(x)

x (at zero is f(x) of gradient The c

R x all for f(x f(x) if minimum global a is It b

x-x all for f(x f(x) if minimum local a is It (a)

: minimum a be to x for Conditions

function scalara is f(x) e wherR x f(x) minimize

121

*

xf

xxf

xxf

xf

*

t

ni1

*

n*

**

*

n

nn

n

αααα

α

ε

−+≤+

<≤

>⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

=

=∂∂

∂∂

∂∂

=∇=

∈≥

<≥

∈

∂∂

∂∂∂

∂∂∂

∂∂

3

These conditions are important to understanding optimization problems and algorithms. But usually they do not give a means for solution. Consider the Rosenbrock Banana function 2

122

1 )1() −+−= xx10(x f(x) 2 1. Is it convex?

-2-1

01

2

-10

12

3

0

20

40

60

80

100

120

2. Zero gradient at minimum gives –

02020

0124040

221

12131

=+−=∂∂

=

=−+−=∂∂

=

xxxf

g

xxxxxf

g

22

11

4

The Quadratic Optimization Problem An understanding of computational techniques to solve the quadratic problem is important

0 A A withc xb x A x 21 xf ttt >=++=)(

Because

1. This formulation describes the Least Squares problem 2. There are important engineering problems and that can be

formulated as quadratic problems 3. The general optimization problem can be expressed as a

quadratic problem in the vicinity of the minimum

hxHh21 h f(x f(x hxf tt** )()))( ** +∇+=+

So the development of algorithms to solve the general problem has usually sought to be efficient on quadratic problems as a start. Note that for the quadratic problem an analytical solution exists

bA- x us give optimality for conditions The

A HisHessiantheandbAx f(x) g(x)is gradient The

1-* =

=+=∇=

5

Iterative solutions, line searches and convergence Algorithmic solutions to our optimization problem can usually be expressed as

β

β

σσ

ratio econvergenc with

econvergenc linear has algorithm the then

,large'' gets k as 1 xkx

x - 1)x(k if example, For

behaves. xkx

x - 1)x(k

ratio the how problem, quadratic a for exploretypically willalgorithms suchof eperformanc the of Analyses

level} supplieduser someis {here x(k)- 1)x(k say until continue to iteration this allow willweTypically

f(x(k)). 1))f(x(k that ch su 1)x(k an yields that procedure nalcomputatio a is] [ wherekxkx

*

*

<=−

+

−

+

<+

≤++ΦΦ=+

*

*

)(

)(

)]([)1(

6

The algorithm ɸ[ ] will usually of the form

d(k). direction searcha computes and x(k) value current the take to is does algorithm the whatSo

direction. searcha is d(k) and scalarpositive a is (k) Here

(k)d(k) x(k) kx

α

α+=+ )1(

minimum. a is d(k))f(x(k) whichfor (k) value scalarthe find to directionthisinperformedthenis search,scalari.e. ,univariate A

αα +

iteration. another do and 1kk setthen (k)d(k) x(k)1)x(k set wesatisfiednotisthisifandcriterionnterminatio somecheck then We

+=+=+ α

7

Univariate Searches

Consider a search along a feasible direction d

problem onoptimizati scalaror univariate a is This . d)f(x(k)minimizetoas sod directiontheingoshouldwefarhow is how outfind to wantWe

α+

8

To explore this problem imagine a plot of f(α) =f(x(k)+ αd) along the direction d

Assume we have determined that the minimum is in the interval [0,2], we then evaluate at 2 internal points u1 and v1 . Here a1 = 0 and b1 = 2. a1

f(v1 ) > f(u1 ) → minimum is within [a1 , v1 ]

u 1 v1

b1

We can now reset the analysis interval a2 u 2 v 2 b2

9

We would like a sub-division strategy which only requires one new calculation at each reset and this can be achieved using the Golden Section.

Known to the early Greeks, the number 618.12

51≈

+=Γ called the

Golden Section Ratio, can be obtained from the Fibonacci sequence.

.iterations N in )/(1

of factor aby interval searchthe reduces on subdivisi SectionGolden The

perform. to evaluation function new one haveonly weand1.236] 0.764, 0.472, [0, now is interval Our

X a v and Xauwithand

ab( X and vbaa

then2] 1.236, 0.764, [0, is 1 interval first the oS

X a vX a u2 b 0 a

a(b X withgivesthis search sectionour For

1

and 0.382 1- withX and X into X interval an of division A

N

21

2

21

11

2

Γ

=+==+=

=−=====

=+==+===

−=

=Γ

=

=Γ

Γ=

764.0472.0

236.1)236.1,0

)2(618.0,)2(382.0

)

.618.0

222222

221212

11111111

1

2

11

ττ

ττ

τ

τττ

10

d. direction searchthe along minimum the as function quadratic this of minimum the takeweThenminimumthe encompassknowwewhich

dxdxdx bvu e say wher ,f

points three the through quadratic a fit eg. fit, polynomial final a withons subdivisisectionofnumberacombine willsearchesunivariate Most

Nk

Nk

NkNNN

iN

iN

iN

.

})(,)(),{(},,{)](,[ 321 αααχχχ +++==

22

312231123

312231123

)()(,

21

Nj

Niji

Nj

Niji

*

where

ffffff

at occurs minimum This

ααβααδ

δδδβββα

−=−=

++++

=

11

Methods for deciding a feasible direction

Direct search methods – These are methods not requiring evaluation of any derivatives of the objective function either explicitly or by approximation. 1. Tabulation Methods For example

grid. the on minimum the the finding in

confidence 90% yield to s selectionr 2.3 requiremay searchrandom a Suchdate. to

searchtheby achieved minimum the of record a maintain and point the of choicerandom a use wepoints grid the all at function the evaluating of Instead (ii) Method

minimum. the as value function smallestthe take and points these all at function the evaluate can We (i) Method

f(x). evaluate to whichon points 1)(r

of grid a gives this and intervals- subequal r into range each divide can weNowminimum. the seekcan we which withinx

component each of d X x X range the know weassume lsoA minimum. a seek we whichforR x withf(x) function objective an haveyou Assume

i

n

1i

i

n

1i

i

i

iiii

n

=

=

Π

+Π

+≤≤∈

12

2. Compass Search This is essentially a series of univariate searches starting at iteration k with x(k) with x1(k) a search is done on x1(k) + α to find a minimum. Call this point x(k+1). Then we move onto x2(k+1) to search on x2(k+1) + α . This continues until we search along xn(k+n-1) + α when on completion we have x(k+n) and we return to search along x1(k+n).

x(k)

Note that the ‘Compass Search’ is a sequential search along the axes.

13

3. The Simplex Method Originally proposed by Spendley et al in 1962, this method evaluates f(x) at (n+1) mutually equidistant points in Rn . These points are said to form a simplex. The basic iteration proceeds according rules such as determine the vertex at which f(x) takes the largest value and reflect this vertex in the centroid of the remaining n vertices and form a new simplex. There is also an extended set of rules to deal with outcomes such as (i) the new point obtained by reflection gives a function value greater than the maximum in the original simplex, and (ii) if one vertex remains unchanged in a run of reflections then shrink the size of the simplex. In 1965 Nelder and Mead modified the method to allow expansion in the direction of the reflection.

14

Methods for deciding a feasible direction

Gradient search methods – These are methods requiring evaluation of the gradient the objective function in determining the search direction d. The gradient direction at any point is the direction whose components are proportional to the 1. Steepest Descent, SD, algorithm The SD algorithm is defined by the iterative algorithm

. g(k)) - f(x(k) minimizing scalarenonnegativ a is (k) where

g(k) (k) - x(k) kx

αα

α=+ )1(

On a quadratic optimization space the SD algorithm exhibits a well known and understood zig-zag path.

The convergence rate of the SD algorithm, when applied to a quadratic problem is depends on the ratio r=A/a of the largest eigenvalue of H to

the smallest. The convergence is bounded by 2

11⎟⎠⎞

⎜⎝⎛

+−

rr so if r is large

convergence will be slow.

15

2. Conjugate Gradient, CG, algorithm Definition: Given a symmetric matrix Q, two vectors d1 , d2 are said to be Q-orthogonal or conjugate with respect to Q if . We will abbreviate this to saying they are conjugate.

021 =dQd t

If Q>0 then a finite set of conjugate vectors { d1, …., dn }are linearly independent. The CG algorithm, when applied to a quadratic optimization, generates a sequence of search directions { d1, . , dk , . } that are conjugate. It is not too difficult to show that the CG algorithm optimizes the quadratic over the expanding sub-space described by the linear variety

d{dby spanned spacethe is where x k1kk },..,)0( ϕϕ+ The CG algorithm is defined by the iterative process At k = 0

)1()(

)1(

)1(

)1(

−=

−

−+=

+=+≥

=

kg1)-g(kkgg(k) 1)-(k

is k scalarthe

1)-d(k k g(k)- d(k) is direction searchthe where

d(k) (k) x(k) 1) x(kuse we1 k for for then

STEP]SDAWITHSTARTWE [ g(0)(0) - x(0) x

t

t

β

β

β

α

α

16

3. Newton and Quasi-Newton algorithms The idea behind Newton’s method is that if the function f(x) is a quadratic then the iteration

function. the minimize willkgH(k) - x(k) kx -1 )()1( =+ If f(x) is not quadratic we are not assured that the Hessian H(k) is positive definite (or indeed invertible). Also on large problems this algorithm requires we compute the inverse Hessian which can be a computational challenge. Davidon in 1959 proposed a clever scheme, later refined by Fletcher and Powell, that generates successive approximations to the inverse Hessian. In the case where the function is quadratic the sequence converges to the true Hessian and furthermore the sequence of search directions are the same as those generated by the Conjugate Gradient algorithm. This Quasi-Newton, QN, algorithm is described by

11

)()()()(

)(

step to return and kk set

kqkQkqkQk)Q(k)q(k)q( -

kqp(k)p(k)p(k) H(k) 1)Q(k

and (k)d(k) - p(k) g(k), - 1)g(k q(k) Set3. step1)g(k (k), 1),x(k obtain to 0 to respect withd(k)) f(x(k) Minimize2. step

Q(k)g(k)- d(k) Set1. stepI] Q(0) [eg. definite positive

and symmetricbe should whichH to Q(0) estimate initial an select0 k At

t

t

t

t

-1

+=

+=+

=+=++≥+

==

=

αααα

Both CG and QN methods offer the theoretical possibility of finding the minimum of quadratic function on Rn in at most n-steps. However, to achieve this requires exact scalar searches along each search direction. But for ill-conditioned systems these methods should be substantially faster than SD.

17

Discussion 1. Gradient based methods are much faster especially when the problem is or becomes approximately quadratic. However, a mix of early searches using a direct search method and then swap to a gradient method may be a possibility. 2. To use a gradient based algorithm one should have analytical expressions for the components of the gradient. It is however, possible to use numerical approximations to the derivatives but this is at the expense of extra function evaluations and uncertainty about the accuracy of the estimates. Things we did not explore 1. Constraints

0 (x)c function scalara is f(x) e wherR x f(x) minimize

i

n

≥∈

18

2. Non-stationary problems and stochastic problems. There are many engineering problems where the cost function of an optimization problem changes say as we move through a scene. That is, the problem has a cost function fk(x) and we have to solve a sequence of optimization problems.

h(k)

-

+ e(k)

y(k)

x(k)

The adaptive noise canceller – Minimize E{e(k)2 } with respect to the FIR h(k) 3. Recursive least squares estimators and the Kalman Filter 4. Global optimization using Simulated Annealing or Genetic Algorithms

19

Exercises 1. Explore how to select a fixed step size implementation of the Steepest Descent algorithm for computing the minimum of the quadratic function

0 A A withc xb x A x 21 xf ttt >=++=)(

Find the allowable range for the constant α and investigate if there is an optimum choice for α in this range. 2. Polak and Ribiere proposed a variation on the Conjugate Gradient

algorithm with )()(

)1()(kgkg

kgg(k)) - 1)(g(k k t

t ++=β . This variant has been

found to be effective in non-quadratic problems, especially with large numbers of variables, i.e. large n. References Luenberger, David. Linear and Nonlinear Programming, Addison-Wesley, 2nd Edition 1984. Boyd, Stephen. Convex Optimization , Lecture Notes and Videos, http://www.stanford.edu/class/ee364a/index.html O'Leary, Dianne P.” SURVIVAL GUIDE for OPTIMIZATION” , September 2008, http://www.cs.umd.edu/~oleary/a607/survivalo.html Kolda, G.,Lewis,M., Torczon, V. “Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods”, SIAM Review, Vol. 45, No. 3, pp. 385–482., 2003 Gould, N., Leyffer,S. “An introduction to algorithms for nonlinear optimization”. In J. F. Blowey, A. W. Craig, and T. Shardlow, Frontiers in Numerical Analysis, pages 109-197. Springer Verlag, Berlin, 2003.

20

http://www.stanford.edu/class/ee364a/index.html

http://www.cs.umd.edu/%7Eoleary/a607/survivalo.html

an introduction to numerical optimization - tcdsigmedia/pmwiki/uploads/main... · an introduction...

Documents