classiﬁcation algorithms lecture 17labs.seas.wustl.edu/bme/raman/lectures/lecture15... ·...

1!

Classification Algorithms!

!Lecture 17!

2!

Probability Theory!Apples and Oranges!

Pick red box!(40%)!

Pick blue box!(60%)!

(Red Box)!2 Apples!

6 Oranges!

(Blue Box)!3 Apples!1 Orange!

any piece of fruit in the boxes is equally likely!

From Bishop, PRML!

3!


Pick red box!(40%)!


(Red Box)!2 Apples!

6 Oranges!


1.  What is the overall probability that the selection will pick an apple?!

From Bishop, PRML!

4!


Pick red box!(40%)!


(Red Box)!2 Apples!

6 Oranges!


2. Given that we have chosen an orange, what is the probability the !box we chose was the blue one?!

From Bishop, PRML!

5!

Probability Theory!

Total # Trials:!

!!# instances where X=xi!!!!!# instances where Y=yj!!

!

N

!

ci

!

rj

From Bishop, PRML!

6!

Probability Theory!

Marginal Probability!

From Bishop, PRML!

7!

Probability Theory!


Conditional Probability!

From Bishop, PRML!

8!

Probability Theory!


Conditional Probability!Joint Probability!

From Bishop, PRML!

9!

Probability Theory!

Sum Rule!!!!

From Bishop, PRML!

10!

Probability Theory!

Sum Rule!!!!

Product Rule!!

!

p X = xi Y = y j( )p Y = y j( )

From Bishop, PRML!

11!

The Rules of Probability!

g  Sum Rule!

g  Product Rule!

!

p X,Y( ) = p X Y( )p Y( )

From Bishop, PRML!

12!

Bayesʼ Theorem!

posterior ∝ likelihood × prior

!

p Y X( )p X( ) = p X Y( )p Y( ) From Bishop, PRML!

13!


Pick red box!(40%)!


(Red Box)!6 Apples!

2 Oranges!

(Blue Box)!3 Apples!

1 Oranges!

1.  What is the overall probability that the selection will pick an apple?!

From Bishop, PRML!

14!

Sum Rule & Product Rule at work!g P(B=r)=4/10!g P(B=b)=6/10!g P(F=a | B=r)=1/4!g P(F=o | B=r)=3/4!g P(F=a | B=b)=3/4!g P(F=o | B=b)=1/4!

!

P F = a( ) = P F = aB = b( )P B = b( ) + P F = aB = r( )P B = r( )

P F = a( ) =14"410

+34"610

=1120

P F = o( ) =920

!

p X,Y( ) = p X Y( )p Y( )

15!


Pick red box!(40%)!


(Red Box)!2 Apples!

6 Oranges!


2. Given that we have chosen an orange, what is the probability the !box we chose was the blue one?!

16!

Sum Rule & Product Rule at work!g P(B=r)=4/10!g P(B=b)=6/10!g P(F=a | B=r)=1/4!g P(F=o | B=r)=3/4!g P(F=a | B=b)=3/4!g P(F=o | B=b)=1/4!

!

P B = bF = o( ) =P F = oB = b( )P B = b( )

P(F = o)

P B = bF = o( ) =14"610

"209

=13

!

p X,Y( ) = p X Y( )p Y( )

17!

Classification!g Likelihood ratio test:!

n  Assume we are to classify an object based on the evidence provided by a measurement (or feature vector) x"

n  Would you agree that a reasonable decision rule would be the following?"

g  "Choose the class that is most ʻprobableʼ given the observed feature vector x”"

g  More formally: Evaluate the posterior probability of each class P(Ci|x) and choose the class with largest P(Ci|x)"

18!


n  Let us examine the implications of this decision rule for a 2-class problem"

n  In this case the decision rule becomes"

!

if P C1 x( ) > P C2 x( ) " x #C1

else P C1 x( ) < P C2 x( ) " x #C2

19!



n  In this case the decision rule becomes"

n  More compactly:"!

if P C1 x( ) > P C2 x( ) " x #C1

else P C1 x( ) < P C2 x( ) " x #C2

!

P C1 x( )<C2

>C1

P C2 x( )

20!



n  More compactly:"

n  From Bayes rule:"

!

P C1 x( )<C2

>C1

P C2 x( )

!

P xC1( )P(C1)P(x)

<C2

>C1

P xC2( )P(C2)P(x)

21!





!

P C1 x( )<C2

>C1

P C2 x( )

!

P xC1( )P(C1)P(x)

<C2

>C1

P xC2( )P(C2)P(x)

P xC1( )P(C1)<C2

>C1

P xC2( )P(C2)

"(x) =P xC1( )P xC2( )

<C2

>C1

P(C2)P(C1)

22!





n  The term ∧(x) is called the likelihood ratio"

!

P C1 x( )<C2

>C1

P C2 x( )

!

P xC1( )P(C1)P(x)

<C2

>C1

P xC2( )P(C2)P(x)

"(x) =P xC1( )P xC2( )

<C2

>C1

P(C2)P(C1)

23!

An example!g Likelihood ratio test:!

!

P xC1( ) =12"

e#12x#4( )2

P xC2( ) =12"

e#12x#10( )2

(lets assume equal priors)!P(C1)=P(C2)!

From Gutierrez-Osuna!

24!


!

"(x) =P xC1( )P xC2( )

<C2

>C1

1

!

"(x) =

12#

e$12x$4( )2

12#

e$12x$10( )2

=e$12x$4( )2

e$12x$10( )2

<C2

>C1

1

log "(x)( ) = $ x $ 4( )2 + x $10( )2<C2

>C1

0

!

P xC1( ) =12"

e#12x#4( )2

P xC2( ) =12"

e#12x#10( )2

25!


!

"(x) =P xC1( )P xC2( )

<C2

>C1

1

!

log "(x)( ) = # x # 4( )2 + x #10( )2<C2

>C1

0

7<C2

>C1

x

!

P xC1( ) =12"

e#12x#4( )2

P xC2( ) =12"

e#12x#10( )2

26!

Variants!g  Maximum A Posteriori (MAP) Criterion!

g Maximum Likelihood (ML) Criterion!!

"(x) =P xC1( )P(C1)

P(x)<C2

>C1

P xC2( )P(C2)P(x)

=P C1 x( )P C2 x( )

<C2

>C1

1

!

"(x) =P xC1( )P(C1)

P(x)<C2

>C1

P xC2( )P(C2)P(x)

=P xC1( )P xC2( )

<C2

>C1

1 P C1( ) = P C2( )[ ]

27!

Discriminant Functions!g  All the decision rules we have presented in this lecture have the same

structure!n  At each point x in feature space choose class Ci which maximizes (or minimizes)

some measure gi(x)"g  This structure can be formalized with a set of discriminant functions gi

(x), i=1..C, and the following decision rule!! !“assign x to the class C if gi(x) >gj(x) all j≠I”!

g  Therefore, we can visualize the decision rule as a network or machine that computes C discriminant functions and selects the category corresponding to the largest discriminant. Such network is depicted in the following figure!

Criterion! Discriminant !Function!

MAP" gi(x)=P(Ci|x)"

ML" gi(x)=P(x|Ci)"

g1(x)!

g2(x)!

gC(x)!

x1!

x2!

xd!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

Select max gi(x)!

features!

discriminant!functions!

Class!assignment!


28!

Quadratic Classifiers!g Bayes classifiers for Normally distributed

classes!n  Case 1: Σi=σ2I"n  Case 2: Σi=Σ (Σ diagonal)"n  Case 3: Σi=Σ (Σ non-diagonal)"n  Case 4: Σi=σi

2I"n  Case 5: Σi≠Σj general case"

From Gutierrez-Osuna! From Duda, Hart and Stork!

29!

Quadratic Classifiers!g  Bayes classifiers for Normally distributed classes!

g  As we will show, for classes that are normally distributed, this family of discriminant functions can be reduced to very simple expressions!

!

choose Ci if gi(x) > g j (x) "j # iwhere gi(x) = P(Ci | x) (MAP)

g1(x)!

g2(x)!

gC(x)!

x1!

x2!

xd!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

gC(x)!

Select max gi(x)!

features!

discriminant!functions!

Class!assignment!


30!

Quadratic Classifiers!g  Bayes classifiers for Normally distributed classes!

!g  Gaussian distribution:!

g  Bayes Rule!

!

choose Ci if gi(x) > g j (x) "j # iwhere gi(x) = P(Ci | x) (MAP)

!

P x( ) =1

2"( )D / 2 #1/ 2exp $

12x $ µ( )T#$1 x $ µ( )

%

& '

(

) *

µ +mean D $ dimensional, 2 +var iance DxD covariance matrix# +determinant covariance matrix

!

gi(x) = P(Ci | x) =P(x |Ci)P(Ci)

P(x)=

12"( )D / 2 #i

1/ 2 exp $12x $ µi( )T#i

$1 x $ µi( )%

& '

(

) * P(Ci)

1P(x)

31!

Quadratic Classifiers!g  Bayes Rule for Gaussian distribution (after eliminating

constants):!

g  Taking natural logs since the logarithm is also monotonically increasing function!

g  This is called quadratic discriminant function!

!

gi(x) = P(Ci | x) = "i#1/ 2 exp #

12x # µi( )T"i

#1 x # µi( )$

% &

'

( ) P(Ci)

!

gi(x) = "12x " µi( )T#i

"1 x " µi( )$

% &

'

( ) "12log #i( ) + log P(Ci)( )

32!

Case 1: : Σi=σ2I !!g  This situation occurs when the features are statistically independent

with the same variance for all classes!n  In this case, the quadratic discriminant function becomes"

gi (x) = !12x !µi( )T ! 2I( )

!1x !µi( )

"

#$

%

&'!12log ! 2I( )+ log P(Ci )( )

gi (x) = !12! 2 x !µi( )T x !µi( )

"

#$

%

&'!12D log ! 2( )+ log P(Ci )( )

gi (x) =

droppingsecondterm

!12! 2 x !µi( )T x !µi( )

"

#$

%

&'+ log P(Ci )( )

33!


with the same variance for all classes!n  In this case, the quadratic discriminant function becomes"

n  Expanding"

n  Ignoring xTx as it is constant for all classes"

gi (x) = !12! 2 xT x ! xTµi !µi

T x +µiTµi( )+ log P(Ci )( )

gi (x) = !12! 2 xT x ! 2µi


gi (x) = !12x !µi( )T ! 2I( )

!1x !µi( )

"

#$

%

&'!12log ! 2I( )+ log P(Ci )( )

gi (x) = !12! 2 x !µi( )T x !µi( )

"

#$

%

&'!12D log ! 2( )+ log P(Ci )( )

gi (x) =

droppingsecondterm

!12! 2 x !µi( )T x !µi( )

"

#$

%

&'+ log P(Ci )( )

gi (x) = !12! 2 !2µi


34!


with the same variance for all classes!n  In this case, the quadratic discriminant function becomes"""

n  Expanding""

n  Ignoring xTx as it is constant for all classes"

n  Discriminant function form"

gi (x) = !12! 2 xT x ! 2µi

T x +µiTµi( )+ log P(Ci )( )!

gi(x) = "12# 2 x " µi( )T x " µi( )

$

% &

'

( ) + log P(Ci)( )

gi (x) = !12! 2 !2µi


gi (x) = wiT x +wio

wherewi =

µi

! 2

wi0 = !12! 2 µi

Tµi + log P(Ci )( )

"

#$$

%$$

35!


with the same variance for all classes!n  Discriminant function form"

"n  Since the discriminant is linear, the decision boundaries gi(x), gj(x) will be

hyperplanes"n  If we assume equal priors"

n  This is the nearest mean classifier"n  If unit variance (σ2=1), the distance becomes Euclidean distance"

!

gi(x) = "12#2

"2µiT x " µi

Tµi( ) + log P(Ci)( )

!

gi(x) = wiT x + wio

wherewi =

µi

" 2

wi = #12" 2 µi

Tµi + log P(Ci)( )

$

% &

' &

!

gi(x) = "12#2

x " µi( )T x " µi( )$

% &

'

( ) + log P(Ci)( )

gi(x) = "12#2

x " µi( )T x " µi( )$

% &

'

( )

36!

Case 1: : Σi=σ2I !!


37!

Case 2: : Σi=Σ (Σ diagonal) !!g  The classes still have the same covariance matrix, but the features are

allowed to have different variances!n  In this case, the quadratic discriminant function becomes"

"n  Eliminating the term x[k]2, which is constant for all classes"

g  This discriminant is linear, so the decision boundaries gi(x)=gj(x), will also be hyper-planes"g  The loci of constant probability are hyper-ellipses aligned with the feature axes"g  Note that the only difference with the previous classifier is that the distance of each axis is

normalized by the variance of the axis"

38!

Case 2: : Σi=Σ (Σ diagonal) !!


39!

Case 3: : Σi=Σ (Σ non-diagonal) !!g  In this case, all the classes have the same covariance matrix, but this

is no longer diagonal!g  The quadratic discriminant becomes!

!g  Eliminating constant log(|Σ|) term!

n  The quadratic term is called the Mahanalobis distance"

g  The Mahalanobis distance is a vector! distance that uses a Σ-1 norm!

n  Σ-1 can be thought of as a stretching factor "on the space n Note that for an identity covariance matrix (Σ=I), the"n  Mahalanobis distance becomes the familiar Euclidean distance"

!

gi(x) = "12x " µi( )T#i

"1 x " µi( )$

% &

'

( ) "12log #i( ) + log P(Ci)( )

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( ) "12log #( ) + log P(Ci)( )

!

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( ) + log P(Ci)( )

!

x " y#"1

2= x " y( )T#"1 x " y( )

40!

Case 3: : Σi=Σ (Σ non-diagonal) !!g  Expansion of the quadratic term in the discriminant yields!

g  Removing the term xTΣ-1x, which is constant for all classes!

g  Reorganizing terms we obtain!

"n  This discriminant is linear, so the decision boundaries will also be hyper-planes"n  The constant probability loci are hyper-ellipses aligned with the eigenvectors of Σ"

g  If we can assume equal priors!n  The classifier becomes a minimum (Mahalanobis) distance classifier"

!

!

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( ) + log P(Ci)( )

gi(x) = "12xT#"1x " 2µi

T#"1x + µiT#"1µi( )$

% &

'

( ) + log P(Ci)( )

!

gi(x) = µiT"#1x #

12

µiT"#1µi

$

% &

'

( ) + log P(Ci)( )

!

gi(x) = wiT x + wio

wherewi = "#1µi

wi = #12

µiT"#1µi + log P(Ci)( )

$

% &

' &

!

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( )

41!

Case 3: : Σi=Σ (Σ non-diagonal) !!g  Expansion of the quadratic term in the discriminant yields!

g  Reorganizing terms we obtain!

"n  This discriminant is linear, so the decision boundaries will also be hyper-planes"n  The constant probability loci are hyper-ellipses aligned with the eigenvectors of Σ"

g  If we can assume equal priors!n  The classifier becomes a minimum (Mahalanobis) distance classifier"

!

!

gi(x) = µiT"#1x #

12

µiT"#1µi

$

% &

'

( ) + log P(Ci)( )

!

gi(x) = wiT x + wio

wherewi = "#1µi

wi = #12

µiT"#1µi + log P(Ci)( )

$

% &

' &

!

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( )

42!

Case 3: : Σi=Σ (Σ non-diagonal) !


43!

Case 4: Σi=σi I!g  In this case, each class has a different covariance

matrix, which is proportional to the identity matrix!n  The quadratic discriminant becomes"

g  This expression cannot be reduced further so!n  The decision boundaries are quadratic: hyper-ellipses"n  The loci of constant probability are hyper-spheres aligned with

the feature axis"

!

gi(x) = "12x " µi( )T#i

"1 x " µi( )$

% &

'

( ) "12log #i( ) + log P(Ci)( )

gi(x) = "12x " µi( )T* i

"2 x " µi( )$

% &

'

( ) "12N log * i

2( ) + log P(Ci)( )

44!

Case 4: Σi=σi I!


45!

Case 5: Σi≠Σj!g  We already derived the expression for the general case at the

beginning of this discussion!

!g  Reorganizing terms in a quadratic form yields!

n  The loci of constant probability for each class are hyper-ellipses, oriented with the eigenvectors of Σi for that class"

n  The decision boundaries are again quadratic: hyper-ellipses or hyper-parabolloids"

n  Notice that the quadratic expression in the discriminant is proportional to the Mahalanobis distance using the class-conditional covariance Σi"

!

gi(x) = xTWix + wiT x + wi0

where

Wi = "12#i"1

wi = #i"1µi

wi0 = "12

µiT#i

"1µi "12log #i( ) + log P(Ci)( )

$

%

& &

'

& &

!

gi(x) = "12x " µi( )T#"1 x " µi( )

$

% &

'

( ) "12log #i( ) + log P(Ci)( )

46!

Case 5: Σi≠Σj!


47!

Naïve Bayes Classifier: An Example!

Day Outlook Temperature Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

From Machine Learning, Mitchell!

48!

Naïve Bayes Classifier!

Outlook = sunny, Temp = Cool, Humidity = high, Wind = strong!

Predict the target value Play = yes or Play = no!

49!




Using Bayes rule we can write:!

!

P(Play = yes) =P sunny,cool,humidity,strong play = yes( )P play = yes( )

P sunny,cool,humidity,strong play = playi( )playi =yes,no"

!

P(Play = no) =P sunny,cool,humidity,strong play = yes( )P play = no( )

P sunny,cool,humidity,strong play = playi( )playi =yes,no"

50!




More generally we can write the most probable target value as:!

!

"MAP = argmax" j #V

P a1,a2,...,an" j( )P " j( )P a1,a2,...,an( )

"MAP = argmax" j #V

P a1,a2,...,an" j( )P " j( )

51!





!

"MAP = argmax" j #V

P a1,a2,...,an" j( )P " j( )

Naïve Bayes classifier is based on the simplifying assumption!that attributes/features are conditionally independent!

!

P a1,a2,...,an" j( ) = P a1" j( )# P a2" j( )# ...P an" j( )P a1,a2,...,an" j( ) = P ai" j( )

i$

52!





!

"MAP = argmax" j #V

P a1,a2,...,an" j( )P " j( )

Naïve Bayes classifier is based on the simplifying assumption!that attributes/features are conditionally independent!

!

"MAP = argmaxv j #V

P " j( ) P ai" j( )i$

53!

Naïve Bayes Classifier: An Example!

Day Outlook Temperature Humidity Wind Play D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

From Machine Learning, Mitchell!


56!

Non-parametric density estimation!


57!

Nearest Neighbor Classifier!


58!



59!



classiﬁcation algorithms lecture 17labs.seas.wustl.edu/bme/raman/lectures/lecture15... ·...

Documents