0 pattern classification, chapter 3 0 pattern classification all materials in these slides were...

31
Pattern Classification, Chapter 3 1 1 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Upload: pearl-rodgers

Post on 27-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 1

1

Pattern Classification

All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Page 2: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

2

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General

Estimation Problems of Dimensionality Computational Complexity Component Analysis and Discriminants Hidden Markov Models

Chapter 3:Maximum-Likelihood and Bayesian

Parameter Estimation (part 2)

Page 3: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 3

3Bayesian Estimation (Bayesian learning

to pattern classification problems)In MLE was supposed fixIn BE is a random variableThe computation of posterior probabilities

P(i | x) lies at the heart of Bayesian classification

Goal: compute P(i | x, D)

Given the sample D, Bayes formula can be written

c

jjj

iii

DPDp

DPDpDP

1

)|(),|(

)|(),|(),|(

x

xx

3

Page 4: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 4

4

To demonstrate the preceding equation, use:

)(),|(

)(),|(),|(

:Thus

tindependen are classesdifferent in samples that assume We

) this!provides sample (Training )|()(

prob.) totalof law (from )|,()|(

prob.) cond. of def. (from )|(),|()|,(

1

c

jjj

iiii

ii

jj

iii

PDp

PDpDP

DPP

DPDP

DPDPDP

x

xx

xx

xx

3

Page 5: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 5

5

Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate using the a-posteriori density P( | D)

The univariate case: P( | D)

is the only unknown parameter

(0 and 0 are known!)

),N( ~ )P(

),N( ~ ) | P(x200

2

4

Page 6: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 6

6

But we know

Plugging in their gaussian expressions and extracting out factors not depending on yields:

nk

kk PxP

dPP

PPP

1

)()|(

Formula Bayes )()|(

)()|()|(

D

DD

) ,(~)( and ),(~)|( 200

2 NpNxp k

93) page 29 eq. (from

12

1

2

1exp)|(

20

0

12

220

2

n

kkx

nDp

4

Page 7: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 7

7

Observation: p(|D) is an exponential of a quadratic

It is again normal! It is called a reproducing density),(~)|( 2nnNP D

4

20

0

12

220

2

12

1

2

1exp)|(

n

kkx

nDp

Page 8: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 8

8

Identifying coefficients in the top equation with that of the generic Gaussian

Yields expressions for n and n2

20

0

12

220

2

12

1

2

1exp)|(

n

kkx

nDp

2

1exp

2

1)|(

2

n

n

n

Dp

4

20

0222

022

ˆ and 11

n

n

n

n

nn

Page 9: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 9

9

Solving for n and n2 yields:

From these equations we see as n increases: the variance decreases monotonically the estimate of p(|D) becomes more peaked

220

2202

0220

2

2200

20

ˆ

nand

nn

n

n

nn

4

Page 10: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 10

10

4

Page 11: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 11

11 The univariate case P(x | D)

P( | D) computed (in preceding discussion)P(x | D) remains to be computed!

It provides:

We know 2 and how to compute

(Desired class-conditional density P(x | Dj, j))

Therefore: using P(x | Dj, j) together with P(j)

And using Bayes formula, we obtain the

Bayesian classification rule:

Gaussian is )|()|()|( dPxPxP DD ),(~)|( 22

nnNxP D

)(),|(,|( jjjj PxPMaxxPMaxjj

DD

4

2 and nn σ

Page 12: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 12

12

3.5 Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:

The form of P(x | ) is assumed known, but the value of is not known exactly

Our knowledge about is assumed to be contained in a known prior density P()

The rest of our knowledge is contained in a set D of n random variables x1, x2, …, xn that follows P(x)

5

Page 13: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 13

13The basic problem is:

“Compute the posterior density P( | D)”

then “Derive P(x | D)”, where

Using Bayes formula, we have:

And by the independence assumption:

)|()|(1

θxθ k

n

k

pp

D

,)()|(

)()|()|(

θθθ

θθθ

dPP

PPP

D

DD

5

θθθxx dDppDp )|()|()|(

Page 14: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 14

14

Convergence (from notes)

Page 15: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 15

15

Problems of DimensionalityProblems involving 50 or 100 features are

common (usually binary valued)Note: microarray data might entail ~20000

real-valued featuresClassification accuracy dependant on

dimensionalityamount of training datadiscrete vs continuous

7

Page 16: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 16

16

Case of two class multivariate normal with the same covarianceP(x|j) ~N(j,), j=1,2Statistically independent featuresIf the priors are equal then:

0)(lim

distance sMahalanobi squared theis

)()( :

error) (Bayes 2

1)(

2

211

212

2/

22

errorP

r

rwhere

dueerrorP

r

t

r

u

7

Page 17: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 17

17

If features are conditionally independent then:

Do we remember what conditional independence is?Example for binary features:

Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

2di

1i i

2i1i2

2d

22

21

r

),...,,(diag

7

Page 18: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 18

18

Most useful features are the ones for which the difference between the means is large relative to the standard deviation Doesn’t require independence

Adding independent features helps increase r reduce error

Caution: adding features increases cost & complexity of feature extractor and classifier

It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! we don’t have enough training data to support the additional

dimensions

7

Page 19: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 19

19

77

7

Page 20: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 20

20 Computational Complexity

Our design methodology is affected by the computational difficulty

“big oh” notation

f(x) = O(h(x)) “big oh of h(x)”

If:

(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)

f(x) = 2+3x+4x2

g(x) = x2

f(x) = O(x2)

)x(hc)x(f ;)x,c( 02

00

7

Page 21: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 21

21

“big oh” is not unique!

f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notationf(x) = (h(x))If:

f(x) = (x2) but f(x) (x3)

)x(gc)x(f)x(gc0

xx ;)c,c,x(

21

03

210

7

Page 22: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 22

22Complexity of the ML Estimation

Gaussian priors in d dimensions classifier with n training samples for each of c classes

For each category, we have to compute the discriminant function

Total = O(d2n)Total for c classes = O(cd2n) O(d2n)

Cost increase when d and n are large!

)()(

)1()()(

)(lnˆln2

12ln

2)ˆ(ˆ)ˆ(

2

1)(

2

2

nOndO

OndO

t

dnO

Pd

g ΣμxΣμxx 1

7

Page 23: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 23

23

Overfitting

Dimensionality of model vs size of training dataIssue: not enough data to support the modelPossible solutions:

Reduce model dimensionalityMake (possibly incorrect) assumptions to better estimate

Page 24: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 24

24

Overfitting

Estimate better use data pooled from all classes

normalization issues

use pseudo-Bayesian form 0 + (1-)n

“doctor” by thresholding entriesreduces chance correlations

assume statistical independencezero all off-diagonal elements

Page 25: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 25

25

Shrinkage

Shrinkage: weighted combination of common and individual covariances

We can also shrink the estimate common covariances toward the identity matrix

10for )1(

)1()(

nn

nn

i

iii

ΣΣΣ

10for ) 1()( IΣΣ

Page 26: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 26

26 Component Analysis and Discriminants

Combine features in order to reduce the dimension of the feature space

Linear combinations are simple to compute and tractable

Project high dimensional data onto a lower dimensional space

Two classical approaches for finding “optimal” linear transformation

PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”

MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”

8

Page 27: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 27

27

PCA (from notes)

Page 28: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 28

28 Hidden Markov Models: Markov Chains

Goal: make a sequence of decisions

Processes that unfold in time, states at time t are influenced by a state at time t-1

Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing,

Any temporal process without memoryT = {(1), (2), (3), …, (T)} sequence of statesWe might have 6 = {1, 4, 2, 2, 1, 4}

The system can revisit a state at different steps and not every state need to be visited

10

Page 29: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 29

29

First-order Markov models

Our productions of any sequence is described by the transition probabilities

P(j(t + 1) | i (t)) = aij

10

Page 30: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 30

30

10

Page 31: 0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Pattern Classification, Chapter 3 31

31

= (aij, T)

P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”

Production of the word: “pattern” represented by phonemes

/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)

Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state

10