1 e. fatemizadeh statistical pattern recognition

67
1 E. Fatemizadeh Statistical Pattern Recognition

Upload: gervase-gregory

Post on 04-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 E. Fatemizadeh Statistical Pattern Recognition

1

E. Fatemizadeh

Statistical Pattern

Recognition

Page 2: 1 E. Fatemizadeh Statistical Pattern Recognition

2

Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformatics

The task: Assign unknown objects – patterns – into the correct class. This is known as classification.

PATTERN RECOGNITION

Page 3: 1 E. Fatemizadeh Statistical Pattern Recognition

3

Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values.

Feature vectors: A number of features

constitute the feature vector

Feature vectors are treated as random vectors.

,,...,1 lxx

lTl Rxxx ,...,1

Page 4: 1 E. Fatemizadeh Statistical Pattern Recognition

4

An example:

Page 5: 1 E. Fatemizadeh Statistical Pattern Recognition

5

The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs

Classification system overview

x

sensor

featuregeneration

feature selection

classifierdesign

systemevaluation

Patterns

Page 6: 1 E. Fatemizadeh Statistical Pattern Recognition

6

Supervised – unsupervised pattern recognition:

The two major directions Supervised: Patterns whose class is known a-

priori are used for training. Unsupervised: The number of classes is (in

general) unknown and no training patterns are available.

Page 7: 1 E. Fatemizadeh Statistical Pattern Recognition

7

CLASSIFIERS BASED ON BAYES DECISION THEORY

Statistical nature of feature vectors

Assign the pattern represented by feature vector to the most probable of the available classes

That is maximum

Tl21 x,...,x,xx

x

M ,...,, 21

)(: xPx ii

Page 8: 1 E. Fatemizadeh Statistical Pattern Recognition

8

Computation of a-posteriori probabilities Assume known

• a-priori probabilities

• This is also known as the likelihood of

)()...,(),( 21 MPPP

Mixp i ...2,1,)(

... i to rw x

Page 9: 1 E. Fatemizadeh Statistical Pattern Recognition

9

2

1

( ) ( ) ( ) ( )

( ) ( ) Likelihood×Priori

( ) Evidence

( ) ( ) ( )

i i i

i ii

i ii

p x P x p x P

p x PP x

p x

p x p x P

The Bayes rule (Μ=2)

where

Page 10: 1 E. Fatemizadeh Statistical Pattern Recognition

10

The Bayes classification rule (for two classes M=2) Given classify it according to the rule

Equivalently: classify according to the rule

For equiprobable classes the test becomes

x

212

121

)()(

)()(

xxPxP

xxPxP

If

If

)()()()()( 2211 PxpPxp

)()()( 21 xPxp

x

Page 11: 1 E. Fatemizadeh Statistical Pattern Recognition

11

)()( 2211 RR and

Page 12: 1 E. Fatemizadeh Statistical Pattern Recognition

12

Equivalently in words: Divide space in two regions

Probability of error Total shaded area

Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!!

22

11

in If

in If

xRx

xRx

0

0

x

1

x

2e dx)x(pdx)x(pP

Page 13: 1 E. Fatemizadeh Statistical Pattern Recognition

13

Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.

Page 14: 1 E. Fatemizadeh Statistical Pattern Recognition

Minimizing Classification Error ProbabilityMinimizing Classification Error Probability

Our Claim: Bayesian Classifier is Optimal w.r to minimizing the classification error Probability.

14

2 1

2 1

2 1 1 2

2 1 1 1 2 2

1 1 2 2

1 2

, ,

Use Bayes Rule:

e

e

e R R

e R R

P P x R P x R

P P x R P P x R P

P P P x dx P P x dx

P P x p x dx P x p x dx

Page 15: 1 E. Fatemizadeh Statistical Pattern Recognition

Minimizing Classification Error ProbabilityMinimizing Classification Error Probability

R1 and R2 are Union of space:

15

1 2

1

1 1 1

1 1 2

1 1 2

Thus:

:

R R

e R

P x p x dx P x p x dx P

P P P x P x p x dx

R P x P x

Page 16: 1 E. Fatemizadeh Statistical Pattern Recognition

16

The Bayes classification rule for many (M>2) classes: Given classify it to if:

Such a choice also minimizes the classification error probability

Minimizing the average risk For each wrong decision, a penalty term is assigned

since some decisions are more sensitive than others

ijxPxP ji )()(

x i

Page 17: 1 E. Fatemizadeh Statistical Pattern Recognition

17

For M=2• Define the loss matrix

• penalty term for deciding class ,although the pattern belongs to , etc.

Risk with respect to

)(2221

1211

L

12

1

1

1 2

12 11 11 1 ( )( )R R

r p x d x p x d x

2

Page 18: 1 E. Fatemizadeh Statistical Pattern Recognition

18

Risk with respect to

RED lighted:

Average risk

2

1 2

2 2221 2 2( ( ))R R

p x d xr p x d x

)()( 2211 PrPrr

Probabilities of wrong decisions, weighted by the penalty terms

Page 19: 1 E. Fatemizadeh Statistical Pattern Recognition

19

Choose and so that r is minimized

Then assign to if

Equivalently:

assign x in if

: likelihood ratio

1R 2R

x i

)( 21

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xp

12

)()()()(

)()()()(

222211122

222111111

PxpPxp

PxpPxp

Page 20: 1 E. Fatemizadeh Statistical Pattern Recognition

20

1 2 11 22

211 1 2

12

122 1 2

21

21 12

1( ) ( ) and 0

2

if

if

if Minimum classification

error probability

P P

x P x P x

x P x P x

If

Page 21: 1 E. Fatemizadeh Statistical Pattern Recognition

21

An example:

00.1

5.00

2

1)()(

))1(exp(1

)(

)exp(1

)(

21

22

21

L

PP

xxp

xxp

Page 22: 1 E. Fatemizadeh Statistical Pattern Recognition

22

Then the threshold value is:

Threshold for minimum r

2

1

))1(exp()exp( :

: minimumfor

0

220

0

x

xxx

Px e

2

1

2

)21(ˆ

))1((exp2)(exp :ˆ

0

220

n

x

xxx

0x̂

Page 23: 1 E. Fatemizadeh Statistical Pattern Recognition

23

Thus moves to the left of (WHY?)

0x̂ 02

1x

Page 24: 1 E. Fatemizadeh Statistical Pattern Recognition

MinMax CriteriaMinMax Criteria

Goal:Minimizing Maximum Possible Overall Risk

(No Priori Knowledge)

24

1 2

1

2

2

11 1 1 21 2 2

12 1 1 22 2 2

1 1 1

( ) ( ) ( ) ( )

( ) ( )

Use This Fact: ( ) 1 ( ) 1

)

&

( ( ) ,

R

R

R R

R p x P p x P dx

p x P p x P

P P p x dx p x

d

x

x

d

Page 25: 1 E. Fatemizadeh Statistical Pattern Recognition

MinMax CriteriaMinMax Criteria

With Some Simplification:

25

1

2 1

2 1

1

1 22 21 22 2

1 11 22 12 11 1 21 22 2

1

11 22 12 11 1 21 22 2

22 21 22 2 11 12

( )

( )

Our Goal: Minimize Effect of ( )

0

R

R R

R R

mM

R

R P p x dx

P p x dx p x dx

P

p x dx p x dx

R p x dx

2

11 1

R

p x dx

Page 26: 1 E. Fatemizadeh Statistical Pattern Recognition

MinMax CriteriaMinMax Criteria

A Simple Case (λ11= λ22=0, λ12= λ21=1)

26

2 1

1 2

R R

p x dx p x dx

Page 27: 1 E. Fatemizadeh Statistical Pattern Recognition

27

If are contiguous:

is the surface separating the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface

)()( :

)()( :

xPxPR

xPxPR

ijj

jii

ji RR , 0)()()( xPxPxg ji

+ - 0xg )(

DISCRIMINANT FUNCTIONS DECISION SURFACES

Page 28: 1 E. Fatemizadeh Statistical Pattern Recognition

28

For monotonic increasing f(.), the rule remains the same if we use:

is a discriminant function

In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet if chosen appropriately, can be computationally more tractable.

jixPfxPfx jii ))(())(( :if

))(()( xPfxg ii

Page 29: 1 E. Fatemizadeh Statistical Pattern Recognition

29

BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONS

Multivariate Gaussian pdf

The last one Called Covariance Matrix

1

12 2

1 1( ) exp ( ) ( )

2(2 )

: 1 vector in

( )( ) : matrix in

i ii i

i

ii

i ii i

p x x x

E x

E x x

Page 30: 1 E. Fatemizadeh Statistical Pattern Recognition

30

is monotonically increasing. Define:

Example:

ln

)(ln)(ln

))()(ln()(

ii

iii

Pxp

Pxpxg

ii

iiiiT

ii

C

CPxxxg

ln)2

1(2ln)

2(

)(ln)()(2

1)( 1

2

2

i0

0

Page 31: 1 E. Fatemizadeh Statistical Pattern Recognition

31

That is, is quadratic and the surfaces

Quadrics, Ellipsoids, Parabolas, Hyperbolas, Pairs of lines.

For example:

iiii

iii

CP

xxxxxg

)ln()(2

1

)(1

)(2

1)(

22

212

2211222

212

)(xgi

0)()( xgxg ji

Page 32: 1 E. Fatemizadeh Statistical Pattern Recognition

32

Decision Hyperplanes

Quadratic terms come from:

If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write:

Discriminant functions are LINEAR

xx 1i

T

ΣΣi

ii

Τii

ii

ioTii

ΣPw

Σw

wxwxg

10

1

2

1)(ln

)(

Page 33: 1 E. Fatemizadeh Statistical Pattern Recognition

33

Let in addition:•

• 22

02

2

)(

)(ln)(

2

1

,

)(

0)()()(

1)(

Then .

ji

ji

j

i

jio

ji

oT

jiij

iT

ii

P

Px

w

xxw

xgxgxg

wxxg

Page 34: 1 E. Fatemizadeh Statistical Pattern Recognition

34

Non Diagonal:

Decision hyperplane

2

0)()( 0 xxwxg Tij

)(1

jiw

1

1

0 2

11 2

( )1( ) ln( )

2 ( )

( )

i ji

i jj

i j

T

Px

P

x x x

)( tonormal

tonormalnot

1

ji

ji

Page 35: 1 E. Fatemizadeh Statistical Pattern Recognition

35

Minimum Distance Classifiers

equiprobable

Euclidean Distance: smaller

Mahalanobis Distance: smaller

MP i

1)(

)()(2

1)( 1

i

T

ii xxxg

:Assign :2ixI

iE xd

:Assign :2ixI

2

11 ))()((

i

T

im xxd

Page 36: 1 E. Fatemizadeh Statistical Pattern Recognition

MoreMore

Geometry Analysis

36

11 2

11 2 1 2

1 11 1 22 2

2 2

1 2

1

(( ) ( ))

, , ,

(( ) ( )) (( ) ( )) ,

Center of Mass:

Principal Axes: 2

Tm i i

T Tl l

T T T

i i i i

il l

T

il

l

i

k

x

d x x

v v v diag

x x x

C

xx C

x x

C

μ

Page 37: 1 E. Fatemizadeh Statistical Pattern Recognition

37

Page 38: 1 E. Fatemizadeh Statistical Pattern Recognition

38

:tionclassificaBayesian using2.2

0.1 vectorheclassify t

9.13.0

3.01.1 ,

3

3 ,

0

0), ,()(

), ,()( and)()( : ,Given

2122

112121

x

ΣNxp

ΣNxpPP

55.015.0

15.095.0 1-Σ

672.38.0

0.28.0,0.2,952.2

2.2

0.1

2.2,0.1: , from sMahalanobi Compute

12,

21

1,2

21

m

mm

dd

1,,21 that Observe .Classify EE ddx

Example:

Page 39: 1 E. Fatemizadeh Statistical Pattern Recognition

Statistical Error Analysis Minimum Attainable Classification Error

(Bayes):

39

1

1 1

1

min ,

Lemma: min , , , 0, 0 1

Chenoff Bound

0.5, , , ,

exp

21 1ln , Bhattach

8 2 2

e i i j j

s s

s sss

e i j i j CB

i j

CB i j

i j

T i j

i j

P P p x P p x dx

a b a b a b s

P P P p x p x dx

s N N

P P B

B

i j

i j i j

μ μ

μ μ μ μ aryya Dist.

Page 40: 1 E. Fatemizadeh Statistical Pattern Recognition

40

Maximum Likelihood

:method The

to w.r. of Likelihood theasknown iswhich

);(

);,...,();(

,...,

);()( : parameter

ctorunknown vean in known with )(Let

tindependen andknown ,...., ,Let

1

21

21

21

X

xp

xxxpXp

xxxX

xpxp

xp

xxx

k

N

k

N

N

N

ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS

Page 41: 1 E. Fatemizadeh Statistical Pattern Recognition

41

0)(

)(

)(

1

)(

)( :ˆ

);(ln);(ln)(

);(maxarg :ˆ

;

; 1

1

1ML

k

k

N

kML

k

N

k

k

Ν

k

xp

xp

L

xpXpL

xp

Page 42: 1 E. Fatemizadeh Statistical Pattern Recognition

42

Page 43: 1 E. Fatemizadeh Statistical Pattern Recognition

43

0

0

0 N

2

0N

22ˆ 2

0

Un

If, indeed, there is a such that

( ) ( ; ), then

lim [ ]

ˆ

biased

Consistence

E

lim 0

1,

Cramer-Rao Bo

ffi

u

c

nd

1

t

,

e

ˆ

i n

ML

ML

p x p x

E

E

LI E

NI

NNI

Page 44: 1 E. Fatemizadeh Statistical Pattern Recognition

44

Example:

AA

AA

xN

L

L

L

xΣx

Σ

xp

xΣxCxpL

xpxpxxxΣNxp

TT

k

N

kMLk

N

k

l

kT

klk

kT

k

N

kk

N

k

kkN

2)(

if :Remember

10)(

.

.

.

)(

)(

))()(2

1exp(

)2(

1);(

)()(2

1);(ln)(

);()(,...,,unknown, :),( :)(

1

1

1

1

1

2

12

1

11

21

Page 45: 1 E. Fatemizadeh Statistical Pattern Recognition

45

Maximum Aposteriori Probability Estimation In ML method, θ was considered as a

parameterHere we shall look at θ as a random vector

described by a pdf p(θ), assumed to be known

Given

Compute the maximum of

From Bayes theorem

N21 x,...,x,xX

)( Xp

( ) ( ) ( ) ( ) or

( ) ( )

( )

p p X p X p X

p p Xp X

p X

Page 46: 1 E. Fatemizadeh Statistical Pattern Recognition

46

The method:

MLMAP

MAP

MAP

p

XpP

Xp

ˆenough broador uniform is )( If

))()((:ˆ

or )(maxargˆ

Page 47: 1 E. Fatemizadeh Statistical Pattern Recognition

47

Page 48: 1 E. Fatemizadeh Statistical Pattern Recognition

48

Example:

k

N

kML

k

N

k

MAP

k

N

kk

N

kMAP

ll

N

xN

For

N

x

xpxp

p

x,...,xXΣNxp

1MAP

2

2

2

2

12

2

0

02211

2

2

0

2

1

1ˆˆ

Nfor or ,1

1

ˆ

0)ˆ(1

)(1

or 0))()(ln( :

)2

exp(

)2(

1)(

unknown, ),,( :)(

Page 49: 1 E. Fatemizadeh Statistical Pattern Recognition

49

Bayesian Inference

Traini

1

ng Set

ML,MAP a single estimate for .

Here a different root is followed.

Given: { ,..., }, ( ) and ( )

The goal: estimate p( )

How??

NX x x p x p

x X

Page 50: 1 E. Fatemizadeh Statistical Pattern Recognition

50

2

20 0

2

2 2 2 220 0 0

2 2 2 210 0

A bit more insight via an example

( ) ( , ) , Unknown Mean

( ) ( , )

It turns out (Problem 2.22) that: ( ) ( , )

1 , ,

N N

N

N N kk

Let p x N

p N

p X N

N xx x

NN N

)()(

)()(

)()(

)(

)()()(

)()(

1

k

N

kxpXp

dpXp

pXp

Xp

pXpXp

dXpxpXxp

)(

Page 51: 1 E. Fatemizadeh Statistical Pattern Recognition

51

The above is a sequence of Gaussians as

Maximum EntropyEntropy

xdxpxpH )(ln)(

sconstraint available

thesubject to maximum :)(ˆ Hxp

N

Page 52: 1 E. Fatemizadeh Statistical Pattern Recognition

52

Example: x is nonzero in the intervaland zero otherwise. Compute the ME pdf

• The constraint:

• Lagrange Multipliers

21 xxx

2

1

1)(x

x

dxxp

2 2

1 1

( ( ) 1) ln ( ) 1x x

LL

x x

HH H p x dx p x dx

p x

otherwise

0

1)(ˆ

)1exp()(ˆ

21

12

xxx

xxxp

xp

Page 53: 1 E. Fatemizadeh Statistical Pattern Recognition

53

Mixture Models

Assume parametric modeling, i.e.,

The goal is to estimategiven a set

We Ignore the details!

M

jj

J

jj

xdjxpP

Pjxpxp

1 x

1

1)( ,1

)()(

);jx(p

jPPP ,...,, and 21 N21 x,...,x,xX

Page 54: 1 E. Fatemizadeh Statistical Pattern Recognition

54

Nonparametric Estimation

2

2

hx̂x̂

hx̂

0,,0

if , as )()(ˆ , continuous )( If

N

kkh

Nxpxpxp

NNN

2ˆ- ,

1)ˆ(ˆ)(ˆ

total

in

hxx

N

k

hxpxp

N

hk

N

kP

N

NN

Page 55: 1 E. Fatemizadeh Statistical Pattern Recognition

55

Parzen Windows Divide the multidimensional space in

hypercubes

Page 56: 1 E. Fatemizadeh Statistical Pattern Recognition

56

Define

• That is, it is 1 inside a unit side hypercube centered at 0

• The problem:

• Parzen windows-kernels-potential functions

))(1

(1

)(ˆ1

N

i

il h

xx

Nhxp

xN

at centered hypercube side-han

inside points ofnumber *1

*volume

1

ousdiscontinu (.)

continuous )(

xp

x

xdxx

x

1)( ,0)(

smooth is )(

otherwise2

1

0

1)( ij

ixx

Page 57: 1 E. Fatemizadeh Statistical Pattern Recognition

57

Mean value

Hence unbiased in the limit

'1

')'()'

(1

)])([1

(1

)](ˆ[x

l

N

i

il

xdxph

xx

hh

xxE

NhxpE

lh

1 ,0h

0)'

( of width the0

h

xxh

1)'

(1

xdh

xx

hl

)()(1

0 xh

x

hh

l

'

)(')'()'()](ˆ[x

xpxdxpxxxpE

Page 58: 1 E. Fatemizadeh Statistical Pattern Recognition

58

Variance• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000

Page 59: 1 E. Fatemizadeh Statistical Pattern Recognition

59

h=0.1, N=10000

The higher the N the better the accuracy

Page 60: 1 E. Fatemizadeh Statistical Pattern Recognition

60

If• • •

asymptotically unbiased

The method• Remember:

Νh

N

h 0

1112

2221

1

2

2

112 )(

)()(

)(

)(

P

P

xp

xpl

)(

)(1

)(1

2

1

12

11

N

i

il

N

i

il

hxx

hN

hxx

hN

Page 61: 1 E. Fatemizadeh Statistical Pattern Recognition

61

CURSE OF DIMENSIONALITY

In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.

If in the one-dimensional space an interval, filled with N points, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓ-dimensional cube will require Nℓ points.

The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

Page 62: 1 E. Fatemizadeh Statistical Pattern Recognition

62

NAIVE – BAYES CLASSIFIER

Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points.

Assume x1, x2 ,…, xℓ mutually independent. Then:

In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.

It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

x ixp |

1

||j

iji xpxp

Page 63: 1 E. Fatemizadeh Statistical Pattern Recognition

63

K Nearest Neighbor Density Estimation

In Parzen:• The volume is constant• The number of points in the volume is

varying

Now:• Keep the number of points

constant

• Leave the volume to be varying

kkN

)()(ˆ

xNV

kxp

Page 64: 1 E. Fatemizadeh Statistical Pattern Recognition

64

)( 11

22

22

11 VN

VN

VNkVN

k

Page 65: 1 E. Fatemizadeh Statistical Pattern Recognition

65

The Nearest Neighbor Rule Choose k out of the N training vectors, identify

the k nearest ones to x

Out of these k identify ki that belong to class ωi

The simplest version

k=1 !!!

For large N this is not bad. It can be shown that:

if PB is the optimal Bayesian error probability, then:

jikk:x jii Assign

2)1

2( BBBNNB PPM

MPPP

Page 66: 1 E. Fatemizadeh Statistical Pattern Recognition

66

For small PB:

k

P2PPP NN

BkNNB

BkNN PP,k

23 )(3

2

BBNN

BNN

PPP

PP

Page 67: 1 E. Fatemizadeh Statistical Pattern Recognition

67

Voronoi tesselation

jixxdxxdxR jii ),(),(: