pattern classification all materials in these slides* were taken from pattern classification (2nd...

23
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher * Some slides are modified 1

Upload: loren-lawrence

Post on 20-Jan-2018

228 views

Category:

Documents


0 download

DESCRIPTION

Reading Assignment  R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classification. John Wiley & Sons, 2nd ed.,  Chapter 3, Sections 4,5 and 7 3

TRANSCRIPT

Page 1: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification

All materials in these slides* were taken from

Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

& Sons, 2000 with the permission of the authors and

the publisher

* Some slides are modified

1

Page 2: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Maximum-Likelihood And Bayesian Parameter Estimation

(2)

2

Page 3: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Reading Assignment

R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classification. John Wiley & Sons, 2nd ed., 2001. Chapter 3, Sections 4,5 and 7

3

Page 4: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Chapter 3:Maximum-Likelihood & Bayesian Parameter Estimation (part 2)

Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General

Estimation Problems of Dimensionality Computational Complexity

Page 5: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 5

Goal: Estimate p(x) by computing p(x|D)

This formula links the desired class conditional density p(x|D) to the posterior density p( |D) .

3

The Parameter Distribution

θθθxxθ

dDppDp )|()|()|(

Page 6: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 6

Bayesian Parameter Estimation: Gaussian Case

The univariate case:

( is the only unknown parameter)

(0 and 0 are known!)

Therefore, we need to find:

),N( ~ )P(),N( ~ ) | P(x

200

2

4

θθθxxθ

dDppDp )|()|()|(

θxxθ

dDppDp )|()|()|(

Page 7: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 7

Let D = {x1, x2, …, xn}

nk

kk PxP

dPPPPP

1

)().|(

(1) )().|()().|()|(

DDD

4

Page 8: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 8

Reproducing density

220

2202

0220

2

2200

20

nand

nnn

n

nn

4

Page 9: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 94

Page 10: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 10

The univariate case P(x | D) P( | D) computed P(x | D) remains to be computed!

Which is equal to

Where

Gaussian is d)|(P).|x(P)|x(P DD

4

Page 11: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 11

It provides:

(Desired class-conditional density P(x | Dj, j))

Therefore: P(x | Dj, j) together with P(j)And using Bayes formula, we obtain theBayesian classification rule:

),(N~)|x(P 2n

2n D

)(P).,|x(PMax,x|(PMax jjjj

jj

DD

4

Page 12: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 12

The Multivariate case:

4

Page 13: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 13

Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:

The form of P(x | ) is assumed known, but the value of is not known exactly

Our knowledge about is assumed to be contained in a known prior density P()

The rest of our knowledge is contained in a set D of n random variables x1, x2, …, xn that follows P(x)

5

Page 14: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 14

The basic problem is:“Compute the posterior density P( | D)”then “Derive P(x | D)”

Using Bayes formula, we have:

And by independence assumption:

)|x(P)|(P knk

1k

D

,d)(P).|(P)(P).|(P)|(P

D

DD

5

Page 15: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 15

Bayesian Parameter Estimation: Incremental Learning

To indicate explicitly the number of samples in a set for a single category, we shall write Dn = {x1, ..., xn}. Then, if n > 1

Using

we get

5

Page 16: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 16

Problems of Dimensionality Problems involving 50 or 100 features (binary valued)

Classification accuracy depends upon the dimensionality and the amount of training data

Case of two classes multivariate normal with the same covariance

0)error(Plim

)()(r :where

due21)error(P

r

211t

212

2

2u

2/r

7

Page 17: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 17

If features are independent then:

Most useful features are the ones for which the difference between the means is large relative to the standard deviation

It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model !

2di

1i i

2i1i2

2d

22

21

r

),...,,(diag

7

Page 18: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 18

77

7

Page 19: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 19

Computational Complexity

Our design methodology is affected by the computational difficulty

“big oh” notationf(x) = O(h(x)) “big oh of h(x)”If:

(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)

f(x) = 2+3x+4x2

g(x) = x2

f(x) = O(x2)

)x(hc)x(f ;)x,c( 02

00

7

Page 20: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 20

“big oh” is not unique!

f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notationf(x) = (h(x))If:

f(x) = (x2) but f(x) (x3)

)x(gc)x(f)x(gc0xx ;)c,c,x(

21

03

210

7

Page 21: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 21

Complexity of the ML Estimation

Gaussian priors in d dimensions classifier with n training samples for each of c classes

For each category, we have to compute the discriminant function

Total = O(d2..n)Total for c classes = O(cd2.n) O(d2.n)

Cost increase when d and n are large!

)n(O)n.2d(O

)1(O)2d.n(O1t

)n.d(O

)(Plnˆln212ln

2d)ˆx()ˆx(

21)x(g

7

Page 22: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 22

Overfitting

Frequently, no of available samples usually inadequate, how to proceed?

Solutions: Reduce dimensionality: redesign feature extractor, or

use a subset of features Assume all c classes share same covariance matrix:

pool available data Assume statistical independence: all off-diagonal

elements of covariance matrix are zero

7

Page 23: Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed)…

Pattern Classification, Chapter 1 23

Overfitting

Classifier performs better than if we overfit the data, why?

The “training data” (black dots) were selected from a quadratic function plus Gaussian noise. The 10th degree polynomial shown fits the data perfectly, but we desire instead the second-order function f(x), since it would lead to better predictions for new samples.

7