cse 291 -pattern recognition henrik i christensen uc san diego · bayesian decision theory chapter2...
Post on 08-Jul-2018
216 Views
Preview:
TRANSCRIPT
CSE291
• Today– Lecture2:BayesianDecisionTheory– PaperList
• Biddingonpapers
– WaitlistedStudents– AnyonethatdidnotgetaPiazzainvitation?– AOB?
BayesianDecisionTheory
Chapter 2(Duda,Hart&Stork)
CSE291- PatternRecognition
HenrikIChristensenUCSanDiego
BayesianDecisionTheory
• Designclassifierstorecommenddecisions thatminimizesometotalexpected”risk”.– Thesimplestrisk istheclassificationerror(i.e.,costsareequal).
– Typically,therisk includesthecost associatedwithdifferentdecisions.
Terminology
• Stateofnatureω (randomvariable):– e.g.,ω1 forseabass,ω2 forsalmon
• ProbabilitiesP(ω1) andP(ω2) (priors):– e.g.,priorknowledgeofhowlikelyistogetaseabassorasalmon
• Probabilitydensityfunctionp(x)(evidence):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex (e.g.,x correspondstolightness)
Terminology(cont’d)
• Conditionalprobabilitydensityp(x/ωj) (likelihood):– e.g.,howfrequentlywewillmeasureapatternwithfeaturevaluex giventhatthepatternbelongstoclassωj
e.g., lightness distributionsbetween salmon/sea-basspopulations
Terminology(cont’d)
• ConditionalprobabilityP(ωj/x)(posterior):– e.g.,theprobabilitythatthefishbelongstoclassωj givenmeasurementx.
DecisionRuleUsingPriorProbabilities
Decideω1 if P(ω1) >P(ω2); otherwisedecide ω2
or P(error)=min[P(ω1),P(ω2)]
• Favoursthemostlikelyclass.• Thisrulewillbemakingthesamedecisionalltimes.
– i.e.,optimumifnootherinformationisavailable
1 2
2 1
( )( )
( )P if wedecide
P errorP if wedecidew ww w
ì= íî
DecisionRuleUsingConditionalProbabilities
• UsingBayes’rule,theposteriorprobabilityofcategoryωjgivenmeasurementxisgivenby:
where(i.e.,scalefactor– sumofprobs=1)
Decideω1ifP(ω1 /x)>P(ω2/x); otherwisedecideω2or
Decideω1ifp(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwisedecideω2
( / ) ( )( / )
( )j j
j
p x P likelihood priorP xp x evidencew w
w ´= =
2
1( ) ( / ) ( )j j
jp x p x Pw w
=
=å
ProbabilityofError
• Theprobabilityoferrorisdefinedas:
or
• Whatistheaverageprobabilityerror?
• TheBayesruleisoptimum,thatis,itminimizestheaverageprobabilityerror!
1 2
2 1
( / )( / )
( / )P x if wedecide
P error xP x if wedecidew ww w
ì= íî
( ) ( , ) ( / ) ( )P error P error x dx P error x p x dx¥ ¥
-¥ -¥
= =ò ò
P(error/x) = min[P(ω1/x), P(ω2/x)]
WheredoProbabilitiesComeFrom?
• Therearetwocompetitiveanswerstothisquestion:
(1) Relativefrequency (objective)approach.– Probabilitiescanonlycomefromexperiments.
(2) Bayesian (subjective)approach.– Probabilitiesmayreflectdegreeofbeliefandcanbebasedonopinion.
Example(objectiveapproach)
• Classifycarswhethertheyaremoreorlessthan$50K:– Classes:C1 ifprice>$50K,C2 ifprice<=$50K– Features:x,theheight ofacar
• UsetheBayes’ruletocomputetheposteriorprobabilities:
• Weneedtoestimatep(x/C1),p(x/C2),P(C1),P(C2)
( / ) ( )( / )( )i i
ip x C P CP C x
p x=
Example(cont’d)
• Collectdata– Askdrivershowmuchtheircarwasandmeasureheight.
• Determineprior probabilitiesP(C1),P(C2)– e.g.,1209samples:#C1=221#C2=988
1
2
221( ) 0.1831209988( ) 0.8171209
P C
P C
= =
= =
Example(cont’d)
• Determineclassconditionalprobabilities(likelihood)– Discretizecarheightintobinsandusenormalizedhistogram
( / )ip x C
Example(cont’d)
• Calculatetheposteriorprobability foreachbin:
1 11
1 1 2 2
( 1.0 / ) ( )( / 1.0)( 1.0 / ) ( ) ( 1.0 / ) ( )
0.2081*0.183 0.4380.2081*0.183 0.0597*0.817
p x C P CP C xp x C P C p x C P C
== = =
= + =
= =+
( / )iP C x
AMoreGeneralTheory
• Usemorethanonefeatures.• Allowmorethantwocategories.• Allowactions otherthanclassifyingtheinputtooneofthepossiblecategories(e.g.,rejection).
• Employamoregeneralerrorfunction(i.e.,“risk”function)byassociatinga“cost”(“loss”function)witheacherror(i.e.,wrongaction).
Terminology
• Featuresformavector• Afinitesetofc categoriesω1,ω2,…,ωc
• Bayesrule(i.e.,usingvectornotation):
• Afinitesetof lactionsα1,α2,…,αl
• Aloss functionλ(αi /ωj)– thecost associatedwithtakingactionαi whenthecorrect
classificationcategoryisωj
dRÎx
( / ) ( )( / )
( )j j
j
p PP
pw w
w =x
xx
1( ) ( / ) ( )
c
j jj
where p p Pw w=
=åx x
ConditionalRisk(orExpectedLoss)
• Supposeweobservexandtakeaction αi
• Supposethatthecostassociatedwithtakingactionαi withωj beingthecorrectcategoryisλ(αi /ωj)
• Theconditionalrisk (orexpectedloss)withtakingactionαi is:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pl w w=
=åx x
OverallRisk
• Supposeα(x)isageneral decisionrulethatdetermineswhichactionα1,α2,…,αltotakeforeveryx;thentheoverallriskisdefinedas:
• Theoptimum decisionruleistheBayesrule
( ( ) / ) ( )R R a p d= ò x x x x
OverallRisk(cont’d)
• TheBayesdecisionruleminimizesR by:(i)ComputingR(αi /x) foreveryαi givenanx(ii)ChoosingtheactionαiwiththeminimumR(αi /x)
• TheresultingminimumoverallriskiscalledBayesrisk andisthebest(i.e.,optimum)performancethatcanbeachieved:
* minR R=
Example:Two-categoryclassification
• Define– α1:decideω1
– α2:decideω2
– λij=λ(αi /ωj)
• Theconditionalrisksare:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pl w w=
=åx x
(c=2)
Example:Two-categoryclassification(cont’d)
• Minimumriskdecisionrule:
or (i.e.,usinglikelihoodratio)
or
>
thresholdlikelihood ratio
SpecialCase:Zero-OneLossFunction
• Assignthesamelosstoallerrors:
• Theconditionalriskcorrespondingtothislossfunction:
SpecialCase:Zero-OneLossFunction(cont’d)
• Thedecisionrulebecomes:
• Inthiscase,theoverallriskistheaverageprobabilityerror!
or
or
Example
2 1( ) / ( )a P Pq w w=
2 12 22
1 21 11
( )( )( )( )bPPw l lqw l l
-=
-(decisionregions)
Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2
Assumingzero-one loss:
12 21l l>
>
assume:
Assuminggeneral loss:
DiscriminantFunctions
• Ausefulwaytorepresentclassifiersisthroughdiscriminant functions gi(x),i =1,...,c,whereafeaturevectorx isassignedtoclassωi if:
gi(x)>gj(x) forall j i¹
DiscriminantsforBayesClassifier
• Assumingagenerallossfunction:
gi(x)=-R(αi/x)
• Assumingthezero-onelossfunction:
gi(x)=P(ωi/x)
DiscriminantsforBayesClassifier(cont’d)
• Isthechoiceofgi unique?– Replacinggi(x) withf(gi(x)),wheref() ismonotonically
increasing,doesnotchangetheclassificationresults.
( / ) ( )( )( )
( ) ( / ) ( )( ) ln ( / ) ln ( )
i ii
i i i
i i i
p Pgp
g p Pg p P
w w
w ww w
=
== +
xxx
x xx x
gi(x)=P(ωi/x)
we’llusethisformextensively!
Caseoftwocategories
• Morecommontouseasinglediscriminantfunction(dichotomizer)insteadoftwo:
• Examples:1 2
1 1
2 2
( ) ( / ) ( / )( / ) ( )( ) ln ln( / ) ( )
g P Pp Pgp P
w ww ww w
= -
= +
x x xxxx
DecisionRegions andBoundaries• Decisionrulesdividethefeaturespaceindecisionregions
R1,R2,…,Rc, separatedbydecisionboundaries.
decisionboundaryisdefinedby:
g1(x)=g2(x)
DiscriminantFunctionforMultivariateGaussianDensity
• Considerthefollowingdiscriminantfunction:
( ) ln ( / ) ln ( )i i ig p Pw w= +x x
N(µ,Σ)
p(x/ωi)
MultivariateGaussianDensity:CaseI
• Σi=σ2(diagonal)– Featuresarestatisticallyindependent– Eachfeaturehasthesamevariance
favoursthea-priorimorelikelycategory
MultivariateGaussianDensity:CaseI(cont’d)
• Propertiesofdecisionboundary:– Itpassesthroughx0– Itisorthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.– Ifσ isverysmall,thepositionoftheboundaryisinsensitivetoP(ωi)
and P(ωj)
¹
)
)
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
¹
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
¹
MultivariateGaussianDensity:CaseI(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
¹
MultivariateGaussianDensity:CaseI(cont’d)
• Minimumdistanceclassifier– WhenP(ωi)areequal,then:
2( ) || ||i ig µ= - -x x
max
MultivariateGaussianDensity:CaseII(cont’d)
• Propertiesofhyperplane(decisionboundary):– Itpassesthroughx0– Itisnot orthogonaltothelinelinkingthemeans.– WhathappenswhenP(ωi)=P(ωj) ?– IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.¹
MultivariateGaussianDensity:CaseII(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
¹
MultivariateGaussianDensity:CaseII(cont’d)
IfP(ωi)=P(ωj),thenx0 shiftsawayfromthemostlikelycategory.
¹
MultivariateGaussianDensity:CaseII(cont’d)
• Mahalanobisdistanceclassifier– WhenP(ωi)areequal,then:
max
MultivariateGaussianDensity:CaseIII
• Σi=arbitrary
e.g., hyperplanes,pairsofhyperplanes,hyperspheres,hyperellipsoids,hyperparaboloids etc.
hyperquadrics;
ErrorBounds• Exacterrorcalculationscouldbedifficult– easierto
estimateerrorbounds!
ormin[P(ω1/x),P(ω2/x)]
P(error)
ErrorBounds(cont’d)
• TheChernoff boundcorrespondstoβ thatminimizes e-κ(β)
– Thisisa1-Doptimizationproblem,regardlesstothedimensionalityoftheclassconditionaldensities.
loose boundloose bound
tight bound
ErrorBounds(cont’d)• Bhattacharyya bound
– Approximatetheerrorboundusingβ=0.5– EasiertocomputethanChernofferrorbutlooser.
• TheChernoffandBhattacharyyaboundswillnotbegoodboundsifthedistributionsarenot Gaussian.
ReceiverOperatingCharacteristic(ROC)Curve
• Everyclassifieremployssomekindofathreshold.
• Changingthethresholdaffectstheperformanceofthesystem.
• ROCcurvescanhelpusevaluatesystemperformancefordifferent thresholds.
2 1( ) / ( )a P Pq w w=
2 12 22
1 21 11
( )( )( )( )bPPw l lqw l l
-=
-
Example:PersonAuthentication• Authenticateapersonusingbiometrics(e.g.,fingerprints).
• Therearetwopossibledistributions(i.e.,classes):– Authentic (A)andImpostor (I)
IA
Example:PersonAuthentication(cont’d)
• Possibledecisions:– (1)correctacceptance(truepositive):
• XbelongstoA,andwedecideA– (2)incorrectacceptance (falsepositive):
• XbelongstoI,andwedecideA– (3)correctrejection(truenegative):
• XbelongstoI,andwedecideI– (4)incorrectrejection (falsenegative):
• XbelongstoA,andwedecideI
I A
false positive
correct acceptance
correct rejection
false negative
top related