naïve bayes

CS188:ArtificialIntelligenceNaïveBayes

Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]

MachineLearning

§ Upuntilnow:howuseamodeltomakeoptimaldecisions

§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)

§ Today:model-basedclassificationwithNaiveBayes

Classification

Example:SpamFilter

§ Input:anemail§ Output:spam/ham

§ Setup:§ Getalargecollectionofexampleemails,eachlabeled

“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails

§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …

DearSir.

First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…

TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.

99MILLIONEMAILADDRESSESFORONLY$99

Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.

Example:DigitRecognition

§ Input:images/pixelgrids§ Output:adigit0-9

§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages

§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …

OtherClassificationTasks

§ Classification:giveninputsx,predictlabels(classes)y

§ Examples:§ Spamdetection(input:document,

classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,

classes:diseases)§ Automaticessaygrading(input:document,

classes:grades)§ Frauddetection(input:accountactivity,

classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore

§ Classificationisanimportantcommercialtechnology!

Model-BasedClassification

§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables

§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures

§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?

NaïveBayesforDigits

§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel

§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity

ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.

§ Here:lotsoffeatures,eachisbinaryvalued

§ NaïveBayes model:

§ Whatdoweneedtolearn?

F1 FnF2

GeneralNaïveBayes

§ AgeneralNaiveBayes model:

§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway

F1 FnF2

|Y|parameters

nx|F|x|Y|parameters

|Y|x|F|n values

InferenceforNaïveBayes

§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel

§ Step2:sumtogetprobabilityofevidence

§ Step3:normalizebydividingStep1byStep2

GeneralNaïveBayes

§ WhatdoweneedinordertouseNaïveBayes?

§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere

§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq

§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon

Example:ConditionalProbabilities

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

ASpamFilter

§ NaïveBayesspamfilter

§ Data:§ Collectionofemails,labeled

spamorham§ Note:someonehastohand

labelallthisdata!§ Splitintotraining,held-out,

testsets

§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSESFOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

NaïveBayesforText

§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed

§ Generativemodel:

§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel

§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?

§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering

Wordatpositioni,not ith wordinthedictionary!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

in is lecture lecture next over person remember room sitting the the the to to up wake when you

how many variables are there?how many values?

Example:SpamFiltering

§ Model:

§ Whataretheparameters?

§ Wheredothesetablescomefrom?

the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...

the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...

ham : 0.66spam: 0.33

SpamExample

Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9

TrainingandTesting

ImportantConcepts

§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset

§ Features:attribute-valuepairswhichcharacterizeeachx

§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!

§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly

§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot

generalizingwell§ Underfitting:fitsthetrainingsetpoorly

TrainingData

Held-OutData

TestData

UnderfittingandOverfitting

0 2 4 6 8 10 12 14 16 18 20-15

Degree15polynomial

Overfitting

Example:Overfitting

2wins!!

Example:Overfitting

§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):

south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...

Whatwentwronghere?

screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...

GeneralizationandOverfitting

§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability

§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough

§ Togeneralizebetter:weneedtosmoothorregularizetheestimates

ParameterEstimation

§ Estimatingthedistributionofarandomvariable

§ Elicitation: askahuman(whyisthishard?)

§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:

§ Thisistheestimatethatmaximizesthelikelihoodofthedata

r bbrb b

YourFirstConsultingJob

§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?

§ Yousay:Pleaseflipitafewtimes:

§ Yousay:Theprobabilityis:§ P(H)=3/5

§ Hesays:Why???§ Yousay:Because…

YourFirstConsultingJob

§ P(Heads)=q,P(Tails)=1-q

§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution

§ SequenceD ofaH HeadsandaT Tails

D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)

MaximumLikelihoodEstimation

§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem

§ What’stheobjectivefunction?

§ MLE:ChooseqtomaximizeprobabilityofD

MaximumLikelihoodEstimation

§ Setderivativetozero,andsolve!

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Brief Article

The Author

January 11, 2012

�̂ = argmax✓

lnP (D | �)

ln �↵H

d�lnP (D | �) =

d�ln �↵H (1� �)↵T

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Smoothing

MaximumLikelihood?

§ Relativefrequenciesarethemaximumlikelihoodestimates

§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata

UnseenEvents

LaplaceSmoothing

§ Laplace’sestimate:§ Pretendyousaweveryoutcome

oncemorethanyouactuallydid

§ CanderivethisestimatewithDirichlet priors (seecs281a)

LaplaceSmoothing

§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes

§ What’sLaplacewithk=0?§ kisthestrength oftheprior

§ Laplaceforconditionals:§ Smootheachconditionindependently:

RealNB:Smoothing

§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:

helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...

verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...

Dothesemakemoresense?

Tuning

TuningonHeld-OutData

§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a

§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata

§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata

§ Choosethebestvalueanddoafinaltestonthetestdata

Summary

§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities

§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel

§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata

§ Smoothingestimatesisimportantinrealsystems

naïve bayes - university of california, berkeleycs188/su19/assets/... · naïve bayes for digits...

Documents

naïve bayes and logistic regression · naïve bayes and...

bayes classifier and naïve bayes - oregon state...

more naïve bayes

classification: logistic...

the naïve bayes classifier - svivek · •the naïve bayes...

naïve bayes - university of california,...

naïve bayes - pooyan jamshidi

naïve bayes - sns courseware

naïve bayes classifiers

knn & naïve bayes

learning: naïve bayes classifier

naïve bayes: refinements

naïve bayes text classification

naÏve bayes classifier

the naïve bayes classifier

naïve bayes - university of texas at austin · naïve...

naïve bayes 𝑖 𝜶 -...

naïve bayes discussion - umd

discrete naïve bayes