MachineLearning
TheNaïveBayesClassifier
1
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns
2
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns
3
Wherearewe?
WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning
• Question:Whatisthedifferencebetweenthem?
Wecouldalsolearnfunctionsthatpredictprobabilitiesofoutcomes
– Differentfromusingaprobabilisticcriteriontolearn
Maximumaposteriori(MAP)predictionasopposedtoMAPlearning
4
Wherearewe?
WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning
• Question:Whatisthedifferencebetweenthem?
Wecouldalsolearnfunctionsthatpredictprobabilitiesofoutcomes
– Differentfromusingaprobabilisticcriteriontolearn
Maximumaposteriori(MAP)predictionasopposedtoMAPlearning
5
MAPprediction
Let’sbeusetheBayesruleforpredictingy givenaninputx
6
Posteriorprobabilityoflabelbeingy forthisinputx
MAPprediction
Let’sbeusetheBayesruleforpredictingy givenaninputx
Predicty fortheinputx using
7
MAPprediction
Let’sbeusetheBayesruleforpredictingy givenaninputx
Predicty fortheinputx using
8
MAPprediction
Let’sbeusetheBayesruleforpredictingy givenaninputx
Predicty fortheinputx using
9
Don’tconfusewithMAPlearning:findshypothesisby
MAPprediction
Predicty fortheinputx using
10
Likelihood ofobservingthisinputx whenthelabelisy
Priorprobabilityofthelabelbeingy
Allweneedarethesetwosetsofprobabilities
Example:Tennisagain
11
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
12
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
13
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
14
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
Example:Tennisagain
15
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
Example:Tennisagain
16
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12
P(H,W|No)P(No)=0.1£ 0.7=0.07
Example:Tennisagain
17
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12
P(H,W|No)P(No)=0.1£ 0.7=0.07
MAPprediction=Yes
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
Outlook: S(unny),O(vercast),R(ainy)
Temperature: H(ot),M(edium),C(ool)
Humidity: H(igh),N(ormal),L(ow)
Wind: S(trong),W(eak)
18
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
Outlook: S(unny),O(vercast),R(ainy)
Temperature: H(ot),M(edium),C(ool)
Humidity: H(igh),N(ormal),L(ow)
Wind: S(trong),W(eak)
19
Weneedtolearn
1. ThepriorP(Play?)2. ThelikelihoodsP(X|Play?)
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)
• (24 – 1)parametersineachcase
Oneforeachassignment
20
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)
21
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
3 3 3 2
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)
22Valuesforthisfeature
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
3 3 3 2
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)
• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase
Oneforeachassignment
23Valuesforthisfeature
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• Iftherearedfeatures,then:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
24
Ingeneral
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
25
Ingeneral
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
26
Ingeneral
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
27
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
28
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howcanwedealwiththis?
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
29
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howcanwedealwiththis?
Answer:Makeindependenceassumptions
Recall:Conditionalindependence
SupposeX,YandZarerandomvariables
XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved
Orequivalently
30
Modelingthefeatures
𝑃(𝑥+, 𝑥-,⋯ , 𝑥/|𝑦) requiredk(2d – 1)parameters
Whatifallthefeatureswereconditionallyindependentgiventhelabel?
Thatis,𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦 = 𝑃 𝑥+ 𝑦 𝑃 𝑥- 𝑦 ⋯𝑃 𝑥/ 𝑦
Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!
31
TheNaïveBayesAssumption
Modelingthefeatures
𝑃(𝑥+, 𝑥-,⋯ , 𝑥/|𝑦) requiredk(2d – 1)parameters
Whatifallthefeatureswereconditionallyindependentgiventhelabel?
Thatis,𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦 = 𝑃 𝑥+ 𝑦 𝑃 𝑥- 𝑦 ⋯𝑃 𝑥/ 𝑦
Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!
32
TheNaïveBayesAssumption
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
33
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
Decisionrule
34
ℎ45 𝒙 = argmax<
𝑃 𝑦 𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦)
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
Decisionrule
35
ℎ45 𝒙 = argmax<
𝑃 𝑦 𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦)
= argmax<
𝑃 𝑦 =𝑃(𝑥>|𝑦)�
>
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Considerthetwoclasscase.Wepredictthelabeltobe+if
36
𝑃 𝑦 = + =𝑃 𝑥> 𝑦 = + > 𝑃 𝑦 = − =𝑃 𝑥> 𝑦 = −)�
>
�
>
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Considerthetwoclasscase.Wepredictthelabeltobe+if
37
𝑃 𝑦 = + =𝑃 𝑥> 𝑦 = + > 𝑃 𝑦 = − =𝑃 𝑥> 𝑦 = −)�
>
�
>
𝑃 𝑦 = + ∏ 𝑃 𝑥> 𝑦 = +)�>
𝑃 𝑦 = − ∏ 𝑃(𝑥>|𝑦 = −)�>
> 1
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Takinglogandsimplifying,weget
38
Thisisalinearfunctionofthefeaturespace!
Easytoprove.Seenoteoncoursewebsite
log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘F𝒙 + 𝑏
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• PracticalConcerns
39
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)
IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?
40
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?
41
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
42
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
43
Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗
• The𝑗MN featureofthe𝑖MN examplewillbe𝑥O>
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?
44
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
45
HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision
Maximumlikelihoodestimation
Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples
46
Eachexampleinthedatasetisindependentandidenticallydistributed
SowecanrepresentP(D|h)asthisproduct
Maximumlikelihoodestimation
Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
47
Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”
Eachexampleinthedatasetisindependentandidenticallydistributed
SowecanrepresentP(D|h)asthisproduct
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
48
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
49
TheNaïveBayesassumption
xij isthejthfeatureofxi
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
50
Howdoweproceed?
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
51
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
52
Whatnext?
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
53
Whatnext?
Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
54
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
55
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
56
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
57
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
hconsistsofp,allthea’sandb’s
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
58
• Prior:P(y=1)=p andP(y=0)=1– p
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
59
• Prior:P(y=1)=p andP(y=0)=1– p
[z]iscalledtheindicatorfunctionortheIversonbracket
Itsvalueis1iftheargumentzistrueandzerootherwise
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
60
Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
61
P(y=1)=p
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
62
P(y=1)=p
P(xj =1|y=1)=aj
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
63
P(y=1)=p
P(xj =1|y=1)=aj
P(xj =1|y=0)=bj
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
64
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
65
P(Play=+)=9/14 P(Play=-)=5/14
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
66
P(Play=+)=9/14 P(Play=-)=5/14
P(O =S|Play=+)=2/9
Let’slearnanaïveBayesclassifier
67
P(Play=+)=9/14 P(Play=-)=5/14
P(O =S|Play=+)=2/9
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
P(O =R|Play=+)=3/9
Let’slearnanaïveBayesclassifier
68
P(Play=+)=9/14 P(Play=-)=5/14
P(O =S|Play=+)=2/9
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
P(O =R|Play=+)=3/9
P(O =O|Play=+)=4/9
Andsoon,forotherattributesandalsoforPlay=-
NaïveBayes:LearningandPrediction
• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods
– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass
• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel
69
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns+anexample
70
ImportantcaveatswithNaïveBayes
1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat
that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes
computation andlearningeasier
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
71
ImportantcaveatswithNaïveBayes
1. Featuresarenotconditionallyindependentgiventhelabel
AllbetsareoffifthenaïveBayesassumptionisnotsatisfied
Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated
72
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
73
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero?
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
74
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
75
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero?
Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)
Butthatwillmaketheprobabilitieszero
Example:Classifyingtext
• Instancespace:Textdocuments• Labels:Spam orNotSpam
• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam
HowwouldyoubuildaNaïveBayesclassifier?
76
Letusbrainstorm
Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
77
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
78
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
79
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
80
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
81
Howoftendoesawordoccurwithalabel?
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
82
Smoothing
Continuousfeatures
• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)
• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence
assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma
normaldistribution
• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution
83
Summary:NaïveBayes
• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel
• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures
• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification
• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid
• Decisionboundaryislinearforbinaryclassification
84