1 computacion inteligente derivative-based optimization

Computacion Inteligente

Derivative-Based Optimization

Contents

• Optimization problems

• Mathematical background

• Descent Methods

• The Method of Steepest Descent

• Conjugate Gradient

OPTIMIZATION PROBLEMS

4Terms in Mathematical Optimization

1. Objective function – mathematical function which is optimized by changing the

values of the design variables.

2. Design Variables – Those variables which we, as designers, can change.

3. Constraints – Functions of the design variables which establish limits in individual

variables or combinations of design variables.

5Problem Formulation

3 basic ingredients…– an objective function,– a set of decision variables,– a set of equality/inequality constraints.

The problem is

to search for the values of the decision variables that minimize the objective function while satisfying the constraints…

6Mathematical Definition

– Design Variables: decision and objective vector

– Constraints: equality and inequality

– Bounds: feasible ranges for variables

– Objective Function: maximization can be converted to minimization due to the duality principle

max minf x f x

min : , 0, 0L U

xy f x x x x h x g x

Obective Decision vector

Bounds constrains

7Steps in the Optimization Process

1. Identify the quantity or function, f, to be optimized.

2. Identify the design variables: x1, x2, x3, …,xn.

3. Identify the constraints if any exist

a. Equalities

b. Inequalities

4. Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

8Local and Global Optimum Designs

1. Objective functions may be unimodal or multimodal.

a. Unimodal – only one optimumb. Multimodal – more than one optimum

2. Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.

3. The global optimum is the best of all local optimum designs.

9Weierstrass Theorem

• Existence of global minimum

• If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

– A set S is closed if it contains all its boundary pts.

– A set S is bounded if it is contained in the interior of some circle

compact = closed and bounded

)numberfinite:,( ccxxT

10Example of an Objective Function

-1 -0.5 0 0.5 1-1

11Multimodal Objective Function

0 0.5 1 1.50

local maxsaddle point

12Optimization Approaches

• Derivative-based optimization (gradient based)

– Capable of determining “search directions” according to an objective function’s derivative information

• steepest descent method;

• Newton’s method; Newton-Raphson method;

• Conjugate gradient, etc.

• Derivative-free optimization

• random search method;

• genetic algorithm;

• simulated annealing; etc.

MATHEMATICAL BACKGROUND

14Positive Definite Matrices

• A square matrix M is positive definite if

• It is positive semidefinite if

0Tx Mx for all x ≠ 0

0Tx Mx for all x

The scalar xTMx = is called a quadratic form.,x Mx

• A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

– Proof (→): Let vi the eigenvector for the i-th eigenvalue λi

– Then,

– which implies λi > 0,

i i iMv v

20 T T

i i i i i i iv Mv v v v

prove that positive eigenvalues imply positive definiteness.

• Theorem: If a matrix M = UTU then it is positive definite

• Proof. Let’s f be defined as

• If we can show that f is always positive then M must be positive definite. We can write this as

• Provided that Ux always gives a non zero vector for all values of x except when x = 0 we can write b = U x, i.e.

• so f must always be positive

T T Tf x Mx x U Ux

f b b b

Tf Ux Ux

17Quadratic Functions

• f: Rn → R is a quadratic function if

– where Q is symmetric.

2T Tf x x Qx b x c

• It is no necessary for Q be symmetric.– Suposse matrix P non-symmetric

Q is symmetric

11 12 1 1

21 21 2

ij i ji j

f x p x x x Px

p p p x

p xx x

1 1( )

ij ij jix Q x where q p p

– Suposse matrix P non-symmetric. Example

Q is symmetric

2 2 21 1 2 1 3 2 2 3 3

1( ) 2 2 4 6 4 5

2f x x x x x x x x x x

2 2 41

( ) , 0 6 42

Tf x x Px P

2 1 21

, 1 6 22

Tx Qx Q

20Quadratic functions

• Given the quadratic function

2T Tf x x Qx b x c

If Q is positive definite, then f is a parabolic “bowl.”

21Quadratic functions

• Two other shapes can result from the quadratic form.

– If Q is negative definite, then f is a parabolic “bowl” up side down.

– If Q is indefinite then f describes a saddle.

• Quadratics are useful in the study of optimization.

– Often, objective functions are “close to” quadratic near the solution.

– It is easier to analyze the behavior of algorithms when applied to quadratics.

– Analysis of algorithms for quadratics gives insight into their behavior in general.

23One Dimension Derivative

• The derivative of f: R → R is a function f ′: R → R given by

• if the limit exists.

' limh

df x f x h f xf x

24Directional Derivatives

• Along the Axes…

• In general direction…

• Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives

• exist for each x in Rn and are continuous functions of x.

• In this case, we say f C1 (a smooth function C1)

,...,n

28The Gradient vector

• Definition: The gradient of f: in R2 → R:

It is a function ∇f: R2 → R2 given by

( , ) :T

f ff x y

),( yxfIn the plane

29The Gradient vector

• Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

( ,..., ) : ,...,

f ff x x

30The Gradient Properties

• The gradient defines (hyper) plane approximating the function infinitesimally

• By the chain rule

fp ,)(

• Proposition 1:

is maximal choosing

intuitive: the gradient points at the greatest change direction

Prove it!

33The Gradient properties

• Proof:

– Assign:

– by chain rule:

( , ) 1( ) ( ) , ( )

f x yp f f

ff f f

34The Gradient properties

• Proof:

– On the other hand for general v:

( , )( ) ,

f x yp f v f v

f x yf p

• Proposition 2: let f: Rn → R be a smooth function C1 around p,

• if f has local minimum (maximum) at p then,

Intuitive: necessary for local min(max)

• Proof: intuitive

• We found the best INFINITESIMAL DIRECTION at each point,

• Looking for minimum: “blind man” procedure

• How can we derive the way to the minimum using this knowledge?

38Jacobian

• The gradient of f: Rn → Rm is a function Df: Rn → Rm×n given by

called Jacobian

Note that for f: Rn → R , we have ∇f(x) = Df(x)T.

39Derivatives

• If the derivative of ∇f exists, we say that f is twice differentiable.

– Write the second derivative as D2f (or F), and call it the Hessian of f.

40Level Sets and Gradients

• The level set of a function f: Rn → R at level c is the set of points S = {x: f(x) = c}.

• Fact: ∇f(x0) is orthogonal to the level set at x0

• Proof of fact:

– Imagine a particle traveling along the level set.

– Let g(t) be the position of the particle at time t, with g(0) = x0.

– Note that f(g(t)) = constant for all t.

– Velocity vector g′(t) is tangent to the level set.

– Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,

– Hence, ∇f(x0) and g′(0) are orthogonal.

' 0 ' 0 0 0T

F g f g

43Taylor's Formula

• Suppose f: R → R is in C1. Then,

– o(h) is a term such that o(h) = h → 0 as h → 0.

– At x0, f can be approximated by a linear function, and the approximation gets better the closer we are to x0.

0 0 00'f x f x f x x ox x x

44Taylor's Formula

• Suppose f: R → R is in C2. Then,

0 0 0 0 0

f x f x f x x x f x x

– At x0, f can be approximated by a quadratic function.

45Taylor's Formula

• Suppose f: Rn → R.

– If f in C1, then

– If f in C2, then

0 00 0

Tf x f x f x x x xx o

T Tf x f x f x x x x x F x x

46In What Direction does a Gradient Point?

• We already know that ∇f(x0) is orthogonal to the level set at x0.

– Suppose ∇f(x0) ≠ 0.

• Fact: ∇f points in the direction of increasing f.

47Proof of Fact

• Consider xα = x0 + α∇f(x0), α > 0.

– By Taylor's formula,

• Therefore, for sufficiently small ,

f(xα) > f(x0)

00 0 0

Tf x f x x x f x

f x f x

DESCENT METHODS

49The Wolfe Theorem

• This theorem is the link from the previous gradient properties to the constructive algorithm.

• The problem:

)(min xfx

50The Wolfe Theorem

• We introduce a model for algorithm:nRx 0

0)( ixfn

)(minarg0

iii hxf

iiii hxx 1

Step 0: set i = 0

Step 1: if stop,

Step 2: compute the step-size

Step 3: set go to step 1

else, compute search direction

51The Wolfe Theorem

• The Theorem:

– Suppose f: Rn → R C1 smooth, and exist continuous function: k: Rn → [0,1], and,

– And, the search vectors constructed by the model algorithm satisfy:

0)(0)(: xkxfx

iiiii hxfxkhxf )()(),(

52The Wolfe Theorem

– And

• Then

– if is the sequence constructed by the algorithm model,

– then any accumulation point y of this sequence satisfy:

00)( ihyf

0)( yf

0}{ iix

53The Wolfe Theorem

• The theorem has very intuitive interpretation:

• Always go in descent direction.

)( ixf

The principal differences between various descent algorithms lie in the first procedure for determining successive directions

STEEPEST DESCENT

55The Method of Steepest Descent

• We now use what we have learned to implement the most basic minimization technique.

• First we introduce the algorithm, which is a version of the model algorithm.

• The problem: )(min xfx

• Steepest descent algorithm:nRx 0

0)( ixf

)(minarg0

iii hxf

iiii hxx 1

Step 0: set i = 0

Step 1: if stop,

Step 2: compute the step-size

Step 3: set go to step 1

else, compute search direction )( ii xfh

• Theorem:

– If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:

– Proof: from Wolfe theorem

0)( yf

0}{ iix

Remark: Wolfe theorem gives us numerical stability if the derivatives aren’t given (are calculated numerically).

• How long a step to take?

1i i ix x h

Note search direction is if x

– We are limited to a line search

• Choose λ to minimize f .

. . . directional derivative is equal to zero.

• How long a step to take?

– From the chain rule:

• Therefore the method of steepest descent looks like this:

0),()( iiiii hhxfhxfd

1( )i if x h They are orthogonal !

61Gradient Descent Example

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

1 2 1 2 1 2, 2sin 1.47 sin 0.34 sin sin 1.9f x x x x x x

λ arbitrary

62Optimum Steepest Descent Example

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

1 2 1 2 1 2, 2sin 1.47 sin 0.34 sin sin 1.9f x x x x x x

CONJUGATE GRADIENT

64Conjugate Gradient

• We from now on assume we want to minimize the quadratic function:

• This is equivalent to solve linear problem:

cxbAxxxf TT 2

( ) 1 10

2 2Tf x

f x A x Ax bx

Ax bIf A symmetric

65Sample: 2D lineal system

• La solucion es la interseccion de las lineas

66Sample: 2D lineal system

– Cada elipsoide tiene f(x) constante

In general, the solution x lies at the intersection pointof n hyperplanes, each having dimension n – 1.

• What is the problem with steepest descent?

– We can repeat the same directions over and over…

• Wouldn’t it be better if, every time we took a step, we got it right the first time?

• What is the problem with steepest descent?

– We can repeat the same directions over and over…

• Conjugate gradient requires n gradient evaluations and n line searches.

• First, let’s define de error as

xxe ii~

• ei is a vector that indicates how far we are from the solution.

solution

Start point

• Let’s pick a set of orthogonal search directions

0 1 1, ,..., ,...,j nd d d d

iiii dxx 1

(should span Rn)

– In each search direction, we’ll take exactly one step,

that step will be just the right length to line up evenly with x

– Unfortunately, this method only works if you already know the answer.

• Using the coordinate axes as search directions…

• We have

iiii dxx 1

( )f x Ax b Ax Ax

xxe ii~

( ) ( )i i if x A x x Ae

• Given , how do we calculate ?

iiii dxx 1

• ei+1 should be orthogonal to di

0 1d e

• Given , how do we calculate ?

– That is

1( ) 0Ti id f x

( )T Ti i i i

i T Ti i i i

d Ae d f x

d Ad d Ad

1 0Ti id Ae

( ) 0Ti i i id A e d

• How do we find ?

– Since search vectors form a basis

0110020010 ...

iiij deddedee

On the other hand

iiij dde

• We want that after n step the error will be 0:

– Here an idea: if then:jj

iiij ddddde

nj 0ne

So if:

• So we look for such that

– Simple calculation shows that if we take

0iTj Add

The correct choice is

( )i id f x

78Conjugate gradient

• Conjugate gradient algorithm for minimizing f:

Step 4: and repeat n times

Step 1:

nRx 0Data

Step 0:

Step 3:

)(: 000 xfrd

ii Add

iiii dxx 1

rr 111

iiii drd 111

)(: ii xfr

Step 2:

Sources

• J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.

• Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005

• Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004

• Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

Sources

• Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000

• Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996

• Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994

• Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004

1 computacion inteligente derivative-based optimization

Documents

computacion movil 1

potafolio de computacion

computacion 2do primaria

materia de computacion

computacion iii año.pdf

computacion basica

computacion informatica

trabajo de computacion

libro de computacion

cuba computacion

apple macbook_13in computacion

computacion inteligente least-square methods for system...

examen de computacion

computacion grafica 2

inteligente -...

listados computacion

manual de computacion

computacion para ingenieria

programas de computacion

computacion avanzada