part 3: estimation of parameters

27
Part 3: Estimation of Parameters M ETU Inform atics Institute Min720 Pattern Classification w ith Bio-M edicalApplications Lecture N otes by Neşe Yalabık Spring 2011

Upload: gad

Post on 23-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Part 3: Estimation of Parameters. Estimation of Parameters. Most of the time, we have random samples but not the densities given. If the parametric form of the densities are given or assumed, then, using the labeled samples, the parameters can be estimated. (supervised learning) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Part  3:  Estimation  of  Parameters

Part 3: Estimation of Parameters

METU Informatics Institute

Min720

Pattern Classification with Bio-Medical Applications

Lecture Notes

by

Neşe Yalabık

Spring 2011

Page 2: Part  3:  Estimation  of  Parameters

Estimation of Parameters

• Most of the time, we have random samples but not the densities given.

• If the parametric form of the densities are given or assumed, then, using the labeled samples, the parameters can be estimated. (supervised learning)

Maximum Likelihood Estimation of Parameters

• Assume we have a sample set:

• as belonging to a given class. Drawn from

(independently drawn from identically distributed r.v.)

samples

nXXXD ,...,, 21

)( jXP iid

Page 3: Part  3:  Estimation  of  Parameters

(unknown parameter vector)

for gaussian

The density function - assumed to be of known form

So our problem: estimate using sample set:

Now drop and assume a single density function.

: estimate of

Anything can be an estimate. What is a good estimate?

• Should converge to actual values

• Unbiased etc

Consider the mixture density

(due to statistical independence)

is called “likelihood function”

that maximizes

(Best agrees with drawn samples.)

,...],...,,[),(

],...,,[

1121

21

jjjT

jjj

Tpj ttt

j

)()()(1

i

n

i

XPDPL

ˆ)(L

iidXXXD jnjjj },...,,{ 21j

)( jXP

)(L

Page 4: Part  3:  Estimation  of  Parameters

if is a singular,

Then find such that and for solving for .

When is a vector, then

: gradient of L wrt

Where:

Therefore

or (log-likelihood)

(Be careful not to find the minimum with derivatives)

0..2

1

ptL

tL

tL

)(maxarg)(lnmaxargˆ

)(maxargˆ

lL

L

0

),...,,( 21

L

tttLL p

0ddL

Page 5: Part  3:  Estimation  of  Parameters

Example 1:

Consider an exponential distribution

otherwise

(single feature, single parameter)

With a random sample

: valid for

(inverse of average)

)( Xf

n

iin

n

ii

n

ii

ndLd

ddl

n

ii

n

ii

n

i

xn

in

xx

n

x

xnxLl

eXXXfL i

1

11

1

)(ln

111

.

121

1ˆˆ

0

lnln)(ln)(

.),...,,()(

},...,,{ 21 nXXX

0x

)(L

0

0);(

xeXf

x

X

Page 6: Part  3:  Estimation  of  Parameters

Example 2:

• Multivariate Gaussian with unknown mean vector M. Assume is known.

• k samples from the same distribution:

(iid)

(linear algebra)

kXXX .,,........., 21

)()(2

1

12/12/

1

)2(

1)|(

MXMXk

in

iT

i

eMXL

)()(2

1

12/12/

1

)2(

1loglog

MXMXk

inMM

iT

i

eLl

))()(2

1||log

2

1)2log(

2( 1

1

MXMXn

iT

i

k

iM

))((1

1

k

ii MX

Page 7: Part  3:  Estimation  of  Parameters

• (sample average or sample mean)

• Estimation of when it is unknown.

• (Do it yourself: not so simple)

• :sample covariance

• where is the same as above.

• Biased estimate :

• use for an unbiased estimate.

k

ii MkX

1

1 )(0

k

iiXk

M1

1

n

k

Tkk MXMX

n 1

))((1

M

22 )(

E

21n

n

.......

1

1

n

Page 8: Part  3:  Estimation  of  Parameters

Example 3:

Binary variables with unknown parameters

(n parameters)

So,

k samples

here is the element of sample .

nipi 1,

n

i

n

iiiii pxpxXP

1 1

)1log()1(log)(log

k

jjXPLl

1

)(loglog

k

j

n

i

n

iiijiij pxpx

1 1 1

)1log()1(log(

ijxthi thj jX

Page 9: Part  3:  Estimation  of  Parameters

• So,

is the sample average of the feature.

Lp

Lp

Lp

L

n

pi

log

:

log

log

log2

1

k

jiij

i

ij

i

pxp

xL

p 1

))1)(1(((log

k

j

k

jij

iij

i

xp

xp 1 1

)1(ˆ1

1ˆ1

0

k

jiji x

kp

1

ip

Page 10: Part  3:  Estimation  of  Parameters

• Since is binary, will be the same as counting the occurances of ‘1’.

• Consider character recognition problem with binary matrices.

• For each pixel, count the number of 1’s and this is the estimate of .

iX

k

jijx

1

A Bip

Page 11: Part  3:  Estimation  of  Parameters

Part 4: Features and Feature Extraction

METU Informatics Institute

Min720

Pattern Classification with Bio-Medical Applications

Lecture Notes

by

Neşe Yalabık

Spring 2011

Page 12: Part  3:  Estimation  of  Parameters

Problems of Dimensionality and Feature Selection

• Are all features independent? Especially in binary features, we might have >100.

• The classification accuracy vs. size of feature set.

• Consider the Gaussian case with same for both categories.

(assuming a priori probabilities are the same) (e:error)

• where is the square of mahalonobis distance between class means.

Mahalonobis distance

between and (the means)

2/

2/2

2

1)(

r

u dueeP

2r

)()( 211

212 Tr

1 2

Page 13: Part  3:  Estimation  of  Parameters

• P(e) decreases as r increases (the distance between the means).

• If (all features statistically

• independent.)

• then

• We conclude from here that

• 1-Most useful features are the ones with large distance and small variance.

• 2-Each feature contributes to reduce the probability of error.

2

22

21

0

.

0

d

d

i

d

i i

ii

i

ii mmmmr

1 12

2212212 )(

)(

Page 14: Part  3:  Estimation  of  Parameters

• When r increases, probability of error decreases.

• Best features are the ones with distant means and small variances.

So add new features if the ones we already have are not adequate (more features, decreasing prob. of error.)

But it was shown that adding new features after some point leads to worse performance.

Page 15: Part  3:  Estimation  of  Parameters

Find statistically independent features Find discriminating features Computationally feasible features

• Principal Component Analysis (PCA) (Karhunen-Loeve Transform)

• Finds (reduces the set) to statistically independent features.

• vectors

• Find a representative

• Squared error criterion

nXXX ,......,, 21

0X

Page 16: Part  3:  Estimation  of  Parameters

Eliminating Redundant Features

is to be found using a larger set

So we either Throw one away Generate a new feature using and (ex:projections of the

points to a line) Form a linear combination of features.

TdxxX ,,.........1

TmyyY ,,.........1

new sample in 1-d space

21, yy

1y

2yFeatures that are linearly dependent

1y 2y

Page 17: Part  3:  Estimation  of  Parameters

• A linear transformation

• W? Can be found by: K-L expansion, Principal Component Analysis

• W :above are satisfied (class discrimination and independent x1, x2,..).

• represented with a single vector .

• -Find a vector so that sum of the squared distances to is minimum(Zero degree representation).

),.......,(

),.......,(

),.......,(

1

122

111

mdd

m

m

yyfx

yyfx

yyfx

Linear functions

WYX

nXX ..,,.........1 0X

0X0X

Page 18: Part  3:  Estimation  of  Parameters

• squared-error criterion

• Find that maximizes .

• Solution is given by the sample mean.

2

1000 )(

n

kkXXXJ

0X 0J

kXnM

1

2

000 )()()( MXMXXJ k

2

0

2

0 )()(2 MXMXMXMX kKT

2

0

2

0 )()(2 MXMXMXMX kKT

0

Page 19: Part  3:  Estimation  of  Parameters

• Where ,• this expression is minimized.

• Consider now 1-d representation from 2-d.

• -The line should pass through the sample mean.

• • unit vector in the direction of line

22

0 MXMX k

Independent of 0X

MX 0

aeMX

e

XkX

MXka

2)( kk XeaM

Distance of Xk from M

Page 20: Part  3:  Estimation  of  Parameters

• Now how to find best e that minimizes

• It turns out that given the scatter matrix

• e must be the eigenvector of the scatter matrix with the largest eigenvalue lambda .

• That is, we project the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix with largest eigenvalue.

• Now consider d dimensional projection

• Here are d eigenvectors of the scatter matrix having largest eigenvalues.

1J

n

k

Tkk MXMXS

1

))((

eSe

d

iiieaMX

1

dee ,.....,1

2)( kk XeaM

Page 21: Part  3:  Estimation  of  Parameters

• Coefficients are called principal components.

• So each m dimensional feature vector is transferred to d dimensional space since the components are given as

• Now represent our new feature vector’s elements

• So

ia

ia

)( MXea kTiki

)(

:

)(

)(

'

22

11

MXea

MXea

MXea

kTdid

kT

i

kT

i

Page 22: Part  3:  Estimation  of  Parameters

• FISHER’S LINEAR DISCRIMINANT

• Curse of dimensionality. More features, more samples needed.

• We would like to choose features with more discriminating ability.

• Reduces the dimension of the problem to one in simplest form.

• Seperates samples from different categories.

• Consider samples from 2 different categories now.

Page 23: Part  3:  Estimation  of  Parameters

• -Find a line so that the projection separates the samples best.

• Same as: • Apply a transformation to samples X to result with a scalar

• such that

• Fisher’s criterion function

• is maximized, where

Line 1(bad solution)

Line 2 (good solution)

Blue (original data)Red(projection onto Line 1)Yellow(projection onto Line 2)

XWy T

22

21

221 )(

)(

WJ

Page 24: Part  3:  Estimation  of  Parameters

• This reduces the problem to 1d, by keeping the classes most distant from each other.

• But if we write and in terms of and

iCyi

i yn

1

22 )(1

iCyi

i

i

yn

i2i iM i

iCXi

i Xn

M1

MWXWn

TT

ii 1

ii CX

iT

iCXi

TT

ii MXW

nMWXW

n222 ))((

1)(

1

))))(((1())((

1WMXMX

nWWMXMXW

nT

iii

TTii

T

i

y

Page 25: Part  3:  Estimation  of  Parameters

• Then,

• - within class scatter matrix

• - between class scatter matrix

• Then , maximize

WSW iT

2212

212

21 )()()( MMWMWMW TTT

WSWWMMMMW BTTT ))(( 2121

BS

WSWWSSW WTT )( 21

22

21

WS

BS

WS

WSW

WSWWJ

WT

BT

)(Rayleigh quotient

Page 26: Part  3:  Estimation  of  Parameters

• It can be shown that W that maximizes J can be found by solving the eigenvalue problem again:

• and the solution is given by

• Optimal if the densities are gaussians with equal covariance matrices. That means reducing the dimension does not cause any loss.

• Multiple Discriminant Analysis: c category problem.

• A generalization of 2-category problem.

WWSS BW 1

)( 211 MMSW W

Page 27: Part  3:  Estimation  of  Parameters

• Non-Parametric Techniques• Density Estimation

• Use samples directly for classification

– - Nearest Neighbor Rule

– - 1-NN

– - k-NN

• Linear Discriminant Functions: gi(X) is linear.