cse 291 -pattern recognition henrik i christensen uc san diego · bayesian decision theory chapter2...

61
CSE 291 - Pattern Recognition Henrik I Christensen UC San Diego

Upload: phamkiet

Post on 08-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

CSE291- PatternRecognition

HenrikIChristensenUCSanDiego

CSE291

• Today– Lecture2:BayesianDecisionTheory– PaperList

• Biddingonpapers

– WaitlistedStudents– AnyonethatdidnotgetaPiazzainvitation?– AOB?

BayesianDecisionTheory

Chapter 2(Duda,Hart&Stork)

CSE291- PatternRecognition

HenrikIChristensenUCSanDiego

BayesianDecisionTheory

• Designclassifierstorecommenddecisions thatminimizesometotalexpected”risk”.– Thesimplestrisk istheclassificationerror(i.e.,costsareequal).

– Typically,therisk includesthecost associatedwithdifferentdecisions.

Terminology

• Stateofnatureω (randomvariable):– e.g.,ω1 forseabass,ω2 forsalmon

• ProbabilitiesP(ω1) andP(ω2) (priors):– e.g.,priorknowledgeofhowlikelyistogetaseabassorasalmon

• Probabilitydensityfunctionp(x)(evidence):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex (e.g.,x correspondstolightness)

Terminology(cont’d)

• Conditionalprobabilitydensityp(x/ωj) (likelihood):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex giventhatthepatternbelongstoclassωj

e.g., lightness distributionsbetween salmon/sea-basspopulations

Terminology(cont’d)

• ConditionalprobabilityP(ωj/x)(posterior):– e.g.,theprobabilitythatthefishbelongstoclassωj givenmeasurementx.

DecisionRuleUsingPriorProbabilities

Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2

or P(error)=min[P(ω1),P(ω2)]

• Favoursthemostlikelyclass.• Thisrulewillbemakingthesamedecisionalltimes.

– i.e.,optimumifnootherinformationisavailable

1 2

2 1

( )( )

( )P if wedecide

P errorP if wedecidew ww w

ì= íî

DecisionRuleUsingConditionalProbabilities

• UsingBayes’rule,theposteriorprobabilityofcategoryωjgivenmeasurementxisgivenby:

where(i.e.,scalefactor– sumofprobs=1)

Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2or

Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2

( / ) ( )( / )

( )j j

j

p x P likelihood priorP xp x evidencew w

w ´= =

2

1( ) ( / ) ( )j j

jp x p x Pw w

=

DecisionRuleUsingConditionalpdf (cont’d)

1 22 1( ) ( )3 3

P Pw w= = P(ωj /x)p(x/ωj)

ProbabilityofError

• Theprobabilityoferrorisdefinedas:

or

• Whatistheaverageprobabilityerror?

• TheBayesruleisoptimum,thatis,itminimizestheaverageprobabilityerror!

1 2

2 1

( / )( / )

( / )P x if wedecide

P error xP x if wedecidew ww w

ì= íî

( ) ( , ) ( / ) ( )P error P error x dx P error x p x dx¥ ¥

-¥ -¥

= =ò ò

P(error/x) = min[P(ω1/x), P(ω2/x)]

WheredoProbabilitiesComeFrom?

• Therearetwocompetitiveanswerstothisquestion:

(1) Relativefrequency (objective)approach.– Probabilitiescanonlycomefromexperiments.

(2) Bayesian (subjective)approach.– Probabilitiesmayreflectdegreeofbeliefandcanbebasedonopinion.

Example(objectiveapproach)

• Classifycarswhethertheyaremoreorlessthan$50K:– Classes:C1 ifprice>$50K,C2 ifprice<=$50K– Features:x,theheight ofacar

• UsetheBayes’ruletocomputetheposteriorprobabilities:

• Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2)

( / ) ( )( / )( )i i

ip x C P CP C x

p x=

Example(cont’d)

• Collectdata– Askdrivershowmuchtheircarwasandmeasureheight.

• Determineprior probabilitiesP(C1),P(C2)– e.g.,1209samples:#C1=221#C2=988

1

2

221( ) 0.1831209988( ) 0.8171209

P C

P C

= =

= =

Example(cont’d)

• Determineclassconditionalprobabilities(likelihood)– Discretizecarheightintobinsandusenormalizedhistogram

( / )ip x C

Example(cont’d)

• Calculatetheposteriorprobability foreachbin:

1 11

1 1 2 2

( 1.0 / ) ( )( / 1.0)( 1.0 / ) ( ) ( 1.0 / ) ( )

0.2081*0.183 0.4380.2081*0.183 0.0597*0.817

p x C P CP C xp x C P C p x C P C

== = =

= + =

= =+

( / )iP C x

AMoreGeneralTheory

• Usemorethanonefeatures.• Allowmorethantwocategories.• Allowactions otherthanclassifyingtheinputtooneofthepossiblecategories(e.g.,rejection).

• Employamoregeneralerrorfunction(i.e.,“risk”function)byassociatinga“cost”(“loss”function)witheacherror(i.e.,wrongaction).

Terminology

• Featuresformavector• Afinitesetofc categoriesω1,ω2,…,ωc

• Bayesrule(i.e.,usingvectornotation):

• Afinitesetof lactionsα1,α2,…,αl

• Aloss functionλ(αi /ωj)– thecost associatedwithtakingactionαi whenthecorrect

classificationcategoryisωj

dRÎx

( / ) ( )( / )

( )j j

j

p PP

pw w

w =x

xx

1( ) ( / ) ( )

c

j jj

where p p Pw w=

=åx x

ConditionalRisk(orExpectedLoss)

• Supposeweobservexandtakeaction αi

• Supposethatthecostassociatedwithtakingactionαi withωj beingthecorrectcategoryisλ(αi /ωj)

• Theconditionalrisk (orexpectedloss)withtakingactionαi is:

1( / ) ( / ) ( / )

c

i i j jj

R a a Pl w w=

=åx x

OverallRisk

• Supposeα(x)isageneral decisionrulethatdetermineswhichactionα1,α2,…,αltotakeforeveryx;thentheoverallriskisdefinedas:

• Theoptimum decisionruleistheBayesrule

( ( ) / ) ( )R R a p d= ò x x x x

OverallRisk(cont’d)

• TheBayesdecisionruleminimizesR by:(i)ComputingR(αi /x) foreveryαi givenanx(ii)ChoosingtheactionαiwiththeminimumR(αi /x)

• TheresultingminimumoverallriskiscalledBayesrisk andisthebest(i.e.,optimum)performancethatcanbeachieved:

* minR R=

Example:Two-categoryclassification

• Define– α1:decideω1

– α2:decideω2

– λij=λ(αi /ωj)

• Theconditionalrisksare:

1( / ) ( / ) ( / )

c

i i j jj

R a a Pl w w=

=åx x

(c=2)

Example:Two-categoryclassification(cont’d)

• Minimumriskdecisionrule:

or (i.e.,usinglikelihoodratio)

or

>

thresholdlikelihood ratio

SpecialCase:Zero-OneLossFunction

• Assignthesamelosstoallerrors:

• Theconditionalriskcorrespondingtothislossfunction:

SpecialCase:Zero-OneLossFunction(cont’d)

• Thedecisionrulebecomes:

• Inthiscase,theoverallriskistheaverageprobabilityerror!

or

or

Example

2 1( ) / ( )a P Pq w w=

2 12 22

1 21 11

( )( )( )( )bPPw l lqw l l

-=

-(decisionregions)

Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2

Assumingzero-one loss:

12 21l l>

>

assume:

Assuminggeneral loss:

DiscriminantFunctions

• Ausefulwaytorepresentclassifiersisthroughdiscriminant functions gi(x),i =1,...,c,whereafeaturevectorx isassignedtoclassωi if:

gi(x)>gj(x) forall j i¹

DiscriminantsforBayesClassifier

• Assumingagenerallossfunction:

gi(x)=-R(αi/x)

• Assumingthezero-onelossfunction:

gi(x)=P(ωi/x)

DiscriminantsforBayesClassifier(cont’d)

• Isthechoiceofgi unique?– Replacinggi(x) withf(gi(x)),wheref() ismonotonically

increasing,doesnotchangetheclassificationresults.

( / ) ( )( )( )

( ) ( / ) ( )( ) ln ( / ) ln ( )

i ii

i i i

i i i

p Pgp

g p Pg p P

w w

w ww w

=

== +

xxx

x xx x

gi(x)=P(ωi/x)

we’llusethisformextensively!

Caseoftwocategories

• Morecommontouseasinglediscriminantfunction(dichotomizer)insteadoftwo:

• Examples:1 2

1 1

2 2

( ) ( / ) ( / )( / ) ( )( ) ln ln( / ) ( )

g P Pp Pgp P

w ww ww w

= -

= +

x x xxxx

DecisionRegions andBoundaries• Decisionrulesdividethefeaturespaceindecisionregions

R1,R2,…,Rc, separatedbydecisionboundaries.

decisionboundaryisdefinedby:

g1(x)=g2(x)

DiscriminantFunctionforMultivariateGaussianDensity

• Considerthefollowingdiscriminantfunction:

( ) ln ( / ) ln ( )i i ig p Pw w= +x x

N(µ,Σ)

p(x/ωi)

MultivariateGaussianDensity:CaseI

• Σi=σ2(diagonal)– Featuresarestatisticallyindependent– Eachfeaturehasthesamevariance

favoursthea-priorimorelikelycategory

MultivariateGaussianDensity:CaseI(cont’d)

wi=

)

)

MultivariateGaussianDensity:CaseI(cont’d)

• Propertiesofdecisionboundary:– Itpassesthroughx0– Itisorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.– Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi)

and P(ωj)

¹

)

)

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

¹

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

¹

MultivariateGaussianDensity:CaseI(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

¹

MultivariateGaussianDensity:CaseI(cont’d)

• Minimumdistanceclassifier– WhenP(ωi)areequal,then:

2( ) || ||i ig µ= - -x x

max

MultivariateGaussianDensity:CaseII

• Σi=Σ

MultivariateGaussianDensity:CaseII(cont’d)

MultivariateGaussianDensity:CaseII(cont’d)

• Propertiesofhyperplane(decisionboundary):– Itpassesthroughx0– Itisnot orthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.¹

MultivariateGaussianDensity:CaseII(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

¹

MultivariateGaussianDensity:CaseII(cont’d)

IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.

¹

MultivariateGaussianDensity:CaseII(cont’d)

• Mahalanobisdistanceclassifier– WhenP(ωi)areequal,then:

max

MultivariateGaussianDensity:CaseIII

• Σi=arbitrary

e.g., hyperplanes,pairsofhyperplanes,hyperspheres,hyperellipsoids,hyperparaboloids etc.

hyperquadrics;

Example- CaseIII

P(ω1)=P(ω2)

decisionboundary:

boundarydoesnot passthroughmidpointofμ1,μ2

MultivariateGaussianDensity:CaseIII(cont’d)

non-lineardecisionboundaries

MultivariateGaussianDensity:CaseIII(cont’d)

• Moreexamples

ErrorBounds• Exacterrorcalculationscouldbedifficult– easierto

estimateerrorbounds!

ormin[P(ω1/x),P(ω2/x)]

P(error)

ErrorBounds(cont’d)

• IftheclassconditionaldistributionsareGaussian,then

where:

| |

ErrorBounds(cont’d)

• TheChernoff boundcorrespondstoβ thatminimizes e-κ(β)

– Thisisa1-Doptimizationproblem,regardlesstothedimensionalityoftheclassconditionaldensities.

loose boundloose bound

tight bound

ErrorBounds(cont’d)• Bhattacharyya bound

– Approximatetheerrorboundusingβ=0.5– EasiertocomputethanChernofferrorbutlooser.

• TheChernoffandBhattacharyyaboundswillnotbegoodboundsifthedistributionsarenot Gaussian.

Example

k(0.5)=4.06

( ) 0.0087P error £

Bhattacharyya error:

ReceiverOperatingCharacteristic(ROC)Curve

• Everyclassifieremployssomekindofathreshold.

• Changingthethresholdaffectstheperformanceofthesystem.

• ROCcurvescanhelpusevaluatesystemperformancefordifferent thresholds.

2 1( ) / ( )a P Pq w w=

2 12 22

1 21 11

( )( )( )( )bPPw l lqw l l

-=

-

Example:PersonAuthentication• Authenticateapersonusingbiometrics(e.g.,fingerprints).

• Therearetwopossibledistributions(i.e.,classes):– Authentic (A)andImpostor (I)

IA

Example:PersonAuthentication(cont’d)

• Possibledecisions:– (1)correctacceptance(truepositive):

• XbelongstoA,andwedecideA– (2)incorrectacceptance (falsepositive):

• XbelongstoI,andwedecideA– (3)correctrejection(truenegative):

• XbelongstoI,andwedecideI– (4)incorrectrejection (falsenegative):

• XbelongstoA,andwedecideI

I A

false positive

correct acceptance

correct rejection

false negative

ErrorvsThreshold

ROC

FalseNegativesvsPositives

Summary

• TheBayesiancaseisthemostwellmodeled• Characterizationofsuccessanderrors• Risk/Costcanbeimportantbutalsohardtocapture