application of matrix differential calculus for optimization in statistical algorithm by dr. md....

43
Application of Matrix Differential Application of Matrix Differential Calculus for Optimization in Statistical Calculus for Optimization in Statistical Algorithm Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University of Rajshahi, Bangladesh 01/10/2011 1 Dr. M. N. H. Mollah

Upload: lauren-powers

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Application of Matrix Differential Calculus Application of Matrix Differential Calculus for Optimization in Statistical Algorithmfor Optimization in Statistical Algorithm

By

Dr. Md. Nurul Haque Mollah,Professor,

Dept. of Statistics,University of Rajshahi,

Bangladesh

01/10/2011 1Dr. M. N. H. Mollah

Page 2: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Introduction to the Optimization Problems in StatisticsIntroduction to the Optimization Problems in Statistics

Test of Convexity of Objective function for OptimizationTest of Convexity of Objective function for Optimization

Nonlinear Optimization for Regression AnalysisNonlinear Optimization for Regression Analysis

Nonlinear Optimization for Maximum Likelihood Nonlinear Optimization for Maximum Likelihood EstimationEstimation

Nonlinear Optimization for PCANonlinear Optimization for PCA

Nonlinear Objective Function for Information TheoryNonlinear Objective Function for Information Theory

OutlineOutline

01/10/2011 2Dr. M. N. H. Mollah

Page 3: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

The main problem in parametric statistics is to find the unconditional or conditional optimizer of the nonlinear objective functions.

Two classical nonlinear conditional optimization problems are the equality constrained problem

(ECP) optimize f(x)

subject to h(x) = 0

and its inequality constrained problem

(ICP) optimize f(x)

subject to g(x) ≤ 0 ( ≥0)

where f : Rn→R, h : Rn →Rm, g : Rn →Rr are given functions.

1. Introduction to the Optimization Problems in Statistics1. Introduction to the Optimization Problems in Statistics

01/10/2011 3Dr. M. N. H. Mollah

Page 4: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

If the objective function is the profit function, then optimizers are the maximizers of the objective function.

If the objective function is the loss function, then optimizers are the minimizers of the objective function.

For Example, likelihood function for the statistical model can be considered as the profit function, while any error or distance functions like sum of square errors (SSE) in regression problems, K-L or any other divergence between two distribution s can be considered as the loss function.

Matrix derivatives play the key role to detect the optimizer of the objective function

1. Introduction to the Optimization Problems in Statistics (Cont.)1. Introduction to the Optimization Problems in Statistics (Cont.)

01/10/2011 4Dr. M. N. H. Mollah

Page 5: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

An overview of analytical and computational methods for solution of the unconstraint problem (UP)

optimize f(x)

where f: Rn →R, xє Rn, is a given functions.

A vector x* is a local minimum for UP if there exists an ε>0 such that

f(x*) ≤ f(x) , for all x є S(x*; ε)

It is a strict local minimum if there exists an ε >0 such that

f(x*) < f(x) , for all x є S(x*; ε), x ≠ x*

1.1 Unconstrained Optimization1.1 Unconstrained Optimization

01/10/2011 5Dr. M. N. H. Mollah

Page 6: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Note: S(x*; ε) is an open sphere centered at x* є Rn with radius ε > 0 . That is

S(x*; ε)={x: |x-x*|< ε}, where x є Rn

A vector x* is a local maximum for UP if there exists an ε>0 such that

f(x) ≤ f(x*) , for all x є S(x*; ε)

It is a strict local minimum if there exists an ε >0 such that

f(x) < f(x*) , for all x є S(x*; ε), x ≠ x*

1.1 Unconstrained Optimization (Cont.) 1.1 Unconstrained Optimization (Cont.)

01/10/2011 6Dr. M. N. H. Mollah

Page 7: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

We have the following well-known optimally conditions.

If f: Rn →R, x є Rn is differentiable (i.e., fC1 on S(x*; ε) for ε>0 )and

is a (local) minimum, then

i.e. the Jacobian Df(x) =0.

If f is twice differentiable (i.e., fC2 on S(x*; ε) for ε>0 ) and is a (local) minimum, then

i.e. the Hessian D2f(x) is positive definite and f(x) is a convex function.

If x* is the maximizer of f(x) , then

i.e. the Hessian D2f(x) is negative definite and f(x) is a concave function.

1.1 Unconstrained Optimization (Cont.)1.1 Unconstrained Optimization (Cont.)

0)(Df *x

02 xxx )(fD *

02 xxx )(fD *

01/10/2011 7Dr. M. N. H. Mollah

Page 8: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

We consider first the following equality constrained optimization problem

(ECP) minimize f(x)

subject to h(x)=0,

where f : Rn→R and h: Rn →Rm are given functions and m≤n. the

components of h are denoted h1, h2,…. ,hm.

Let x* be a vector such that h(x*)=0 and, for some ε>0, hC1

on S(x*; ε). We say that x* is regular point if the gradients

Dh1(x*),…,Dhm(x*) are linearly independent.

Consider the Lagrangain function L:Rn+m→R defined by

L(x,λ)=f(x)+λ´h(x).

Defination:

1.2 Constrained Optimization1.2 Constrained Optimization

01/10/2011 8Dr. M. N. H. Mollah

Page 9: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Let x* be a local minimum for (ECP), and assume that, for some ε>0, fC1 , hC1 on S(x*; ε), and x* is regular point. Then there exist a unique vector λ* Rm such that

If in addition fC2 and hC2 on S(x*; ε) then

1. 2 Constrained Optimization (Cont.)1. 2 Constrained Optimization (Cont.)

0*)*,(LDx x

0(D with 0D2 xxxxxx /nxx *)hR,*)*,(L

01/10/2011 9Dr. M. N. H. Mollah

Page 10: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

We consider first the following inequality constrained optimization problem

(ICP) minimize f(x)

subject to h(x)=0, g(x)≤0

where f : Rn→R, h: Rn →Rm and g: Rn →Rr are given functions and m ≤ n.

Let x* be a vector such that h(x*)=0, g(x*) ≤0 and, for some

ε>0, hC1 and gC1 on S(x*; ε). We say that x* is regular

point if the gradients Dh1(x*),…,Dhm(x*) and Dgj(x*), jЄA(x*)

are linearly independent.

Consider the Lagrangain function L:Rn+m+r→R defined by

L(x,λ, µ)=f(x)+λ´h(x)+ µ ´g(x).

Defination

1. 2 Constrained Optimization (Cont.)1. 2 Constrained Optimization (Cont.)

01/10/2011 10Dr. M. N. H. Mollah

Page 11: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Let x* be a local minimum for (ICP), and assume that, for some ε>0, fC1 , hC1 , gC1 on S(x*; ε), and x* is regular point. Then there exist a unique vectors λ* Rm , µ * Rr such that

,

If in addition fC2 , hC2 and gC2 on S(x*; ε), then

1. 2 Constrained Optimization (Cont.)1. 2 Constrained Optimization (Cont.)

0*)*,*,(LDx μλx

0(D and 0(Dwith

0D2

xxxx

xxxx//

nxx

*)g*)h

R,*)*,(L

01/10/2011 11Dr. M. N. H. Mollah

Page 12: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

For optimization problem, convexity of the objective function f(x) is necessary.

Hessian matrix H=D2f(x) can be used to test the convexity of the objective function.

If all leading principal minors of H are non-negative (non-positive), then function f(x) is convex (concave).

If all leading principal minors of H are positive (negative), then f(x) is strictly convex (concave).

Note: H will always be a square symmetric matrix

2. Test of Convexity of Objective function for Optimization2. Test of Convexity of Objective function for Optimization

01/10/2011 12Dr. M. N. H. Mollah

Page 13: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

2. Test of Convexity of Objective function for Optimization (Cont.)2. Test of Convexity of Objective function for Optimization (Cont.)

,,

,x

)x(fH

f

xxxxxxf

2|| and 5|| 2||Then

202

021

212

:Ans

. ofconvexity theExamine

2

321

2

2

213122

21

HHH

x

x

Example:

Since all principal minors are not positive, so f(x) is concave.

01/10/2011 13Dr. M. N. H. Mollah

Page 14: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

3. Nonlinear Optimization for Regression Analysis3. Nonlinear Optimization for Regression Analysis

Unconditional Optimization

Let us consider a regression model

Where,Y: Vector of responseX: Design matrix (matrix of explanatory variable)β: Vector of regression parameterE: Vector of error terms

Objective is to find OLSE of β by minimizing error sum of squares with respect to β.

111 nmmnn EβXY

XβYXβYEE TT

01/10/2011 14Dr. M. N. H. Mollah

Page 15: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Solution:

Which is also known as minimizer of Necessary condition:

Hessian matrix should be positive

definite. That is

It is an example of unconditional optimization.

0

d

EEd T

YXXXˆ T1T

EEXY,β T)|(f

ββ

β

EEH ˆ

T

2

2

0

2

2

βββ

EE

ˆ

T

01/10/2011 15Dr. M. N. H. Mollah

3. Nonlinear Optimization for Regression Analysis 3. Nonlinear Optimization for Regression Analysis (Cont.)(Cont.)

Page 16: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Conditional Optimization

Recall the previous linear regression model as

Objective is to find OLSE of β by minimizing error sum of squares with respect to β

such that

1n1mmn1n EXY

mk,dC 1K1mmk

XYXYEE TT

01/10/2011 16Dr. M. N. H. Mollah

3. Nonlinear Optimization for Regression Analysis 3. Nonlinear Optimization for Regression Analysis (Cont.)(Cont.)

Page 17: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Solution: Since the restriction one of the equality form, we

can use the method of Lagrange’s undetermined multipliers

and minimize the Lagrangian function

with respect to β and λ. Then

Sufficient condition: should be positive definite

matrix. That is

TK,...,,TT ,)(F 21λCβdλXβYXβYλβ,

βCdCXXCCXXββ

ˆˆ~

Fand

F

TTTTOLSE

11

2

2

β

F

H

mm21

T R)u,...,u,u(0HUU U

01/10/2011 17Dr. M. N. H. Mollah

3. Nonlinear Optimization for Regression Analysis 3. Nonlinear Optimization for Regression Analysis (Cont.)(Cont.)

Page 18: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

The likelihood function for the parameter vector θ=(θ1, θ2 , …,θd ) formed from the observed data X as

The problem is to estimate the vector θ by maximizing

likelihood function.

Ans: The normal equations

produce

0 ly,equivalentor 0 θXθθXθ /|Llog/|L

definite. negative is matrix Hessian theif MLE be will Where 2 T/Llog θθθHθ

01/10/2011 18Dr. M. N. H. Mollah

4. Nonlinear Optimization for Maximum Likelihood 4. Nonlinear Optimization for Maximum Likelihood Estimation (MLE)Estimation (MLE)

n

j

xf)|(L1

θ;Xθ j

?)x,...,x,x(fˆn 21

Page 19: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Under regularity conditions, the expected (Fisher) information

matrix is given by

The standard error for the estimate of θi is given by

θθθy

θ

θθθ

θ

θ

/Llog;S

,Y;IE

;YS;YSEI T

where

e)conveniencour (for 1

or

1

2

11

2

11

,d,...,i,)]Y;ˆ(I[)ˆ(SE

d,...,i,)]ˆ(I[)ˆ(SE

iii

iii

θθ

θθ

01/10/2011 19Dr. M. N. H. Mollah

4. Nonlinear Optimization for MLE (Cont.)4. Nonlinear Optimization for MLE (Cont.)

θθθ /Llog)Y,(I 2 Let,

Page 20: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

If the normal equations

becomes as implicit form, then the Newton-Raphson methodor other iterative procedure is necessary for solving thenormal equations. However, the Newton-Raphson iteration In this case is updated as follows

for k=0,1,2,3,…, until converge. The stopping rule is as follows

kkkk ;S;I θyyθθθ 11

01/10/2011 20Dr. M. N. H. Mollah

4. Nonlinear Optimization for MLE (Cont.)4. Nonlinear Optimization for MLE (Cont.)

0 θ;YS/)Y|(Llog

(say)000501 .|||| kk θθ

Page 21: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

In statistics, an expectation-maximization (EM) algorithm is a method for finding MLE or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

EM is an iterative method which alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.

These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

2101/10/2011 Dr. M. N. H. Mollah

4. 1 Nonlinear Optimization for MLE using EM 4. 1 Nonlinear Optimization for MLE using EM algorithmalgorithm

Page 22: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Given a statistical model consisting of a set x of observed data, a set of unobserved latent data or missing values z ,and a vector of unknown parameters θ , along with a likelihood function,

the MLE of the unknown parameters θ is determined by the marginal likelihood of the observed data

However, this quantity is often intractable. The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

2201/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

)(p);(L θ|ZX,ZX,θ

Z

)(p)(p);(L θ|ZX,θ|XXθ

Page 23: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

Expectation step (E-step): Calculate the expected value of the log-likelihood function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ(t):

Maximization step (M-step): Find the parameter that maximizes this quantity:

Note that in typical models to which EM is applied 

2301/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

z

Z

θx,zzx,θ

θXZX,θθθ

)|(g);

,|);)|(Q)t(

)t()t(

logL(

logL(E

)|(Q )t()t( θθθθ

argmax1

Page 24: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Let x = (x1,x2,…,xn) be a sample of n independent observations from a mixture of two multivariate normal distribution of dimension d, and let z=(z1,z2,…,zn) be the latent variables that determine the component from which the observation originates.

and

i=1,2,…,n.

24

Example: Gaussian mixtureExample: Gaussian mixture

01/10/2011 Dr. M. N. H. Mollah

4. 1 Nonlinear Optimization for MLE using EM algorithm 4. 1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 25: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Where

The aim is to estimate the unknown parameters representing the "mixing" value between the Gaussians and the means and covariances of each:

where the likelihood function is:

2501/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

211 12 and 1 )Z(p)Z(p ii

Page 26: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

This may be rewritten in exponential family form as

E-stepGiven our current estimate of the parameters θ(t), the conditional distribution of the Zi is determined by Bayes theorem to be the proportional height of the normal density weighted by as follows

Thus, the E-step results in the function: 2601/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 27: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

27

M-stepMaximize Q(θ|θ(t)) w.r. to θ which contains Note that may be all

maximized independently of each other since they all appear in separate linear terms. Firstly, consider which

has the constraint

This has the same form as the MLE for the binomial distribution. So

01/10/2011 Dr. M. N. H. Mollah

).,),, 2211 ( and ( ΣμΣμ

,.121

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 28: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

For the next estimates of (μ1,Σ1):

This has the same form as a weighted MLE for a normal distribution, so

2801/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 29: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

And

and, by symmetry:

and

2901/10/2011 Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 30: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Louis(1982) showed that the observed information lobs is the

difference of complete loc and missing lom information. That is,

,

where

and

θ* denotes the MLE of θ estimated using EM.

omocobs*

obs III Y

jj,obsjjj,com

n

jj,obs

j,comoc ),(,,

|LlogEI

*

xyzxyθyθ

θθ

and where

12

2

n

ji

T

j,obsj,com

i,obsi,com

n

jj,obs

Tj,comj,com

om

**

*

,y|Llog

E,y|Llog

E

,|Llog|Llog

EI

θθθθ

θθ

yθθ

θ

θyθ

θ

1

01/10/2011 30Dr. M. N. H. Mollah

4.1 Nonlinear Optimization for MLE using EM algorithm 4.1 Nonlinear Optimization for MLE using EM algorithm (Cont.)(Cont.)

Page 31: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Let Cov (X p × 1 )= ∑ of order p × p.

Let β be the p-component column vector such that β' β=1.

Then the variance of β'X is

(1)

To determine the normalized liner combination β'X with maximization variance, we must find a vector β satisfying

β'β=1 which maximize (1). Let

(2)

Where λ is a Lagrange multiplier. The vector of partial derivatives is

(3)

ΣβββXXβXβ )(E)(E 2

j,i i

ijiji )()( 11 2ββΣββ

βΣβ

22

01/10/2011 31Dr. M. N. H. Mollah

5. Nonlinear Optimization for PCA5. Nonlinear Optimization for PCA

Page 32: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Since β'Σβ and β'β have derivatives every where in a region containing β'β=1, a vector β maximizing β'Σβ must satisfy the expression (3) set equal to 0; that is

(4)

In order to gate a solution of (4) with β'β=1we must have

Σ-λI singular; in other words, λ must satisfy

(5)

The function |Σ - λI| = 0 is a polynomial in λ of degree p. therefore (5) has p roots; let these be λ1 λ2 …λp . If we multiply (4) on the left by ; we obtain

(6)

0=βIΣ )-( λ

0=IΣ λ-

λλ ββΣββ

01/10/2011 32Dr. M. N. H. Mollah

5. Nonlinear Optimization for PCA (Cont.)5. Nonlinear Optimization for PCA (Cont.)

Page 33: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

This shows that if β satisfies (4) (and ββ=1 ) , then the variance of βX [given by (1)] is λ. Thus for the maximum variance we should use in (4) the largest root λ1. Let β(1) be a normalized solution of (Σ – λ1I) β = 0 Then U1= β(1)X is a normalized linear combination with maximum variance. [if Σ – λ1I is of rank p-1, then there is only one solution to (Σ – λ1I) β =0 and ββ=1.

Now let us find a normalized combination βX that has maximum variance if all linear combinations uncorrelated with U1. Lack of correlation means

(7)

Since Σβ(1)= λ1β(1) . Thus βX is orthogonal to U in both the statistical sense (of lack of correlation) and the geometric sense(of the inner product of the vectors β and β(1) being zero). (that is λ1ββ(1) = 0 only if ββ(1) = 0 when λ1 ≠ 0, and λ1 ≠ 0 if Σ ≠ 0 ; the case of Σ = 0 is trivial and is not treated.)

)1(1

)1()1(1 )()(0 ββΣβββXXβXβ EUE

01/10/2011 33Dr. M. N. H. Mollah

5. Nonlinear Optimization for PCA (Cont.)5. Nonlinear Optimization for PCA (Cont.)

Page 34: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

We now want to maximized

(8)

 

Where λ and ν1 are Lagrangain multipliers. The vector of partial derivatives is

  (9)

And we set this equal to 0. From (9) we obtain by multip-lying on the left by β(1)

(10)

By (7) . therefore ν1=0 and must satisfy (4), and therefore λ must satisfy (5). Let λ2 be the maximum of λ1, λ2, … λ2, such that there is a vector β satisfying (Σ – λ2I) β =0 , ββ=1 and (7); call this vector β(2) and the corresponding linear combination U2 = β(2)X . (it will be shown eventually that λ(2) = λ2. We define λ(1) = λ1.)

)1(12 2)1( ΣββββλΣββ v

)1(1

2 222 ΣβλβΣββ

v

11)1()1()1()1( 22220 vv

ΣββββΣββ

01/10/2011 34Dr. M. N. H. Mollah

5. Nonlinear Optimization for PCA (Cont.)5. Nonlinear Optimization for PCA (Cont.)

Page 35: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

This procedure is continued; alt the (r+1)st step, we want to find a vector β such that βX has maximum variance of all normalized linear combinations which are uncorrelated with U1,U2 … Ur, that is , such that

i=1,2,…..r (11)

We want to maximized

(12)

Where λ and ν1, ν1,…, νr are Lagrangain multipliers. The vector of partial derivatives is

(13)

And we set this equal to 0. Multiplying (13) on the left by β(j), we obtain  

(14)

If λ(j) ≠ 0, this gives -2 νj λ(j) = 0 and νj =0. If λ(j) = 0, then Σ β(j)= λ(j) β(j) =0 and the j-th term in the sum in (13) vanishes. Thus β must satisfy (4), and therefore λ must satisfy (5). Thus we can compute all principal and minor components sequentially by nonlinear optimization using lagrangian multiplier.

)()(

)()( )()(0 ii

iii EUE ββΣβββXXβXβ

r

i

iir v

1

)(1 2)1( ΣββββλΣββ

r

i

ii

r v1

)(1 222 ΣβλβΣββ

)()()()( 2220 jjj

jj v ΣββββΣββ

01/10/2011 35Dr. M. N. H. Mollah

5. Nonlinear Optimization for PCA (Cont.)5. Nonlinear Optimization for PCA (Cont.)

Page 36: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Differential Entropy:

The differential entropy H of a random vector y with density p(y) is defined as:

A normalized version of entropy is given by negentropy J,

which is defined as follows:

where . Equality hold only for Gaussian

random vectors.

yyyy dplogpH

0HYHJ yy gauss

yVYV gauss

01/10/2011 36Dr. M. N. H. Mollah

6. Nonlinear Objective Function for Information Theory6. Nonlinear Objective Function for Information Theory

Page 37: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Mutual Information: Mutual Information between m random variables yi ,i=1,2,

…,m is defined as follows:

Equality holds if are independent.

where the constant term does not depend on B

0HyH

yq,pD

y,...,y,y,dyq

plogpy,...,y,yI

m

1ii

m

1iiKL

Tm21m

1ii

m21

y

y

yyy

y

m21 y,...,y,y

m

iim

mmmm

yJ.Consty,...,y,yI

XBY

121

11

Then

Let

01/10/2011 37Dr. M. N. H. Mollah

6. Nonlinear Objective function for Information Theory 6. Nonlinear Objective function for Information Theory (Cont.) (Cont.)

Page 38: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

In signal processing, blind source signals (BSS) are generally follow non-Gaussian distribution and independent of each other.

So maximization of negentropy or equivalently minimization of mutual information both are popular techniques for recovering BSS image processing and audio signal processing.

01/10/2011 38Dr. M. N. H. Mollah

6. Nonlinear 6. Nonlinear Objective functionObjective function for Information Theory for Information Theory (Cont.)(Cont.)

Page 39: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Gradient Descent:

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent.

In gradient descent, we minimize a function J(w) iteratively by starting form some initial point w(0), computing the gradient of J(w) at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance. Once there, we repeat the same procedure at the new point. And so on.

01/10/2011 39Dr. M. N. H. Mollah

6. 1 An Unconstrained Optimization for Information 6. 1 An Unconstrained Optimization for Information TheoryTheory

Page 40: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Gradient Descent:

In gradient descent, we minimize a function J(w) iteratively by starting form some initial point w(0), computing the gradient of J(w) at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance. Once there, we repeat the same procedure at the new point. And so on. For t=1,2,… we have the update rule

(4.1)

We can then write the rule (3.22) either as

Or even shorter as

)1(

)()()1()(

t

Jttt www

www

w

ww

)(J

x

xx

∂∝

)(J

01/10/2011 40Dr. M. N. H. Mollah

6. 1 An Unconstrained Optimization for Information Theory6. 1 An Unconstrained Optimization for Information Theory

Page 41: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

The symbol is read “is proportional to” it is then understood that the vector on the left-hand side, ∆w, has the same dimension as the gradient vector on the right-hand side, but there is a positive scalar coefficient by which the length can be adjusted.

In the upper version of the update rule, this coefficient is denoted by α. In many cases, this learning rate can and should in fact be time dependent. Yet a third very convenient way to write such update rules, in conformity with programming language, is

Where the symbol ← means substitution, i.e., the value of

the right-hand side is computed and substituted in w.

w

www

)(J

01/10/2011 41Dr. M. N. H. Mollah

6. 1 An Unconstrained Optimization for Information 6. 1 An Unconstrained Optimization for Information Theory (Cont.) Theory (Cont.)

Page 42: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Generally, the speed of convergence can be quite low close to the minimum point, because the gradient approaches zero there. The speed can be analyzed as follows. Let us denote by w* the local or global minimum point where the algorithm will eventually converge. From (3.22) we have

Let us expand the gradient vector (∂J(w) /∂w) element by element as a Taylor series around the point w*, using only

the zeroth and first-order terms, we have for the ith element

)1(** )(

)()1()(

t

Jttt www

wwwww

m

jjj

jii

t

i

tJJJ

1

*2

)1( ])1([)()()(

** wwwwww

www

01/10/2011 42Dr. M. N. H. Mollah

6. 1 An Unconstrained Optimization for Information 6. 1 An Unconstrained Optimization for Information Theory (Cont.)Theory (Cont.)

Page 43: Application of Matrix Differential Calculus for Optimization in Statistical Algorithm By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University

Now because w* is the point of convergence, the partial derivatives of the cost function must be zero at w*.Using this result and compiling the above expansion into vector form, yields

Where H(w*)is the Hessian matrix computed at the point w = w*.Substituting this in (3.24) gives

This kind of convergence, which is essentially equivalent to multiplying a matrix many times with itself, is called linear. The speed of convergence depends on the learning rate and the size of Hessian matrix.

])1()[(

)( **

)1( wwwHw

www t

Jt

])1()][()([)( *** wwwHIww ttt

01/10/2011 43Dr. M. N. H. Mollah

6. 1 An Unconstrained Optimization for Information 6. 1 An Unconstrained Optimization for Information Theory (Cont.)Theory (Cont.)