fa18 cs188 lecture20 naive bayes - wuwei lan · 2020-06-09 · machine learning § machine...

MachineLearning

§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters(e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering,neuralnets)

§ Today:model-basedclassificationwithNaiveBayes

Classification

Example:SpamFilter

§ Input:anemail§ Output:spam/ham

§ Setup:§ Getalargecollectionofexampleemails,eachlabeled

“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails

§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts,WidelyBroadcast§ …

DearSir.

First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…

TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.

99MILLIONEMAILADDRESSESFORONLY$99

Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoputittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.

Example:DigitRecognition

§ Input:images/pixelgrids§ Output:adigit0-9

§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages

§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …§ Featuresareincreasinglyinducedratherthancrafted

0

1

2

1

??

OtherClassificationTasks

§ Classification:giveninputsx,predictlabels(classes)y

§ Examples:§ Medicaldiagnosis(input:symptoms,

classes:diseases)§ Frauddetection(input:accountactivity,

classes:fraud/nofraud)§ Automaticessaygrading(input:document,

classes:grades)§ Customerserviceemailrouting§ Reviewsentiment§ LanguageID§ …manymore

§ Classificationisanimportantcommercialtechnology!

Model-BasedClassification

Model-BasedClassification

§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththeoutputlabelandinputfeaturesarerandomvariables

§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures

§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?

NaïveBayesforDigits

§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel

§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity

ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.

§ Here:lotsoffeatures,eachisbinaryvalued

§ NaïveBayesmodel: (Bayes'theorem)

§ Whatdoweneedtolearn?

Y

F1 FnF2

GeneralNaïveBayes

§ AgeneralNaiveBayes model:

§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway

Y

F1 FnF2

|Y|parameters

nx|F|x|Y|parameters

|Y|x|F|n values

InferenceforNaïveBayes

§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel

§ Step2:sumtogetprobabilityofevidence

§ Step3:normalizebydividingStep1byStep2

+

GeneralNaïveBayes

§ WhatdoweneedinordertouseNaïveBayes?

§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere

§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq

§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon

Example:ConditionalProbabilities

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

NaïveBayesforText

§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositioni§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed

§ Generativemodel:

§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel

§ Eachpositionisidenticallydistributed§ Allpositionssharethesameconditionalprobs P(W|Y)§ Whymakethisassumption?

§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering

Wordatpositioni,notith wordinthedictionary!

Example:SpamFiltering

§ Model:

§ Whataretheparameters?

§ Wheredothesetablescomefrom?

the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...

the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...

ham : 0.66spam: 0.33

SpamExample

Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9

TrainingandTesting

EmpiricalRiskMinimization

§ Empiricalriskminimization§ Basicprincipleofmachinelearning§ Wewantthemodel(classifier,etc)thatdoesbestonthetruetestdistribution§ Don’tknowthetruedistributionsopickthebestmodelonouractualtrainingset§ Finding“thebest”modelonthetrainingsetisphrasedasanoptimizationproblem

§ Mainworry:overfittingtothetrainingset§ Betterwithmoretrainingdata(lesssamplingvariance,trainingmoreliketest)§ Betterifwelimitthecomplexityofourhypotheses(regularizationand/orsmallhypothesisspaces)

ImportantConcepts

§ Data:labeledinstances(e.g.emailsmarkedspam/ham)§ Trainingset§ Heldoutset§ Testset

§ Features:attribute-valuepairswhichcharacterizeeachx

§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!

§ Evaluation(manymetricspossible,e.g.accuracy)§ Accuracy:fractionofinstancespredictedcorrectly

§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot

generalizingwell§ We’llinvestigateoverfitting andgeneralizationformallyinafew

lectures

TrainingData

Held-OutData

TestData

GeneralizationandOverfitting

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree15polynomial

Overfitting

Example:Overfitting

2wins!!

Example:Overfitting

§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):

south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...

Whatwentwronghere?

screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...

GeneralizationandOverfitting

§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduringtrainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability

§ Asanextremecase,imagineusingtheentireemailastheonlyfeature(e.g.documentID)§ Wouldgetthetrainingdataperfect(ifdeterministiclabeling)§ Wouldn’tgeneralize atall§ Justmakingthebag-of-wordsassumptiongivesussomegeneralization,butisn’tenough

§ Togeneralizebetter:weneedtosmoothorregularizetheestimates

ParameterEstimation

ParameterEstimation

§ Estimatingthedistributionofarandomvariable

§ Elicitation: askahuman(whyisthishard?)

§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:

§ Thisistheestimatethatmaximizesthelikelihoodofthedata

r r b

r b b

r bbrb b

r bb

r

b

b

Smoothing

MaximumLikelihood?

§ Relativefrequenciesarethemaximumlikelihoodestimates

§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata

????

UnseenEvents

LaplaceSmoothing

§ Laplace’sestimate:§ Pretendyousaweveryoutcome

oncemorethanyouactuallydid

§ CanderivethisestimatewithDirichlet priors (seecs281a)

r r b

LaplaceSmoothing

§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes

§ What’sLaplacewithk=0?§ kisthestrength oftheprior

§ Laplaceforconditionals:§ Smootheachconditionindependently:

r r b

Estimation:LinearInterpolation*

§ Inpractice,LaplaceoftenperformspoorlyforP(X|Y):§ When|X|isverylarge§ When|Y|isverylarge

§ Anotheroption:linearinterpolation§ AlsogettheempiricalP(X)fromthedata§ MakesuretheestimateofP(X|Y)isn’ttoodifferentfromtheempiricalP(X)

§ Whatifa is0?1?

§ Forevenbetterwaystoestimateparameters,aswellasdetailsofthemath,seecs281a,cs288

RealNB:Smoothing

§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:

helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...

verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...

Dothesemakemoresense?

Tuning

TuningonHeld-OutData

§ Nowwe’vegottwokindsofunknowns§ Parameters:theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a

§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata

§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata

§ Choosethebestvalueanddoafinaltestonthetestdata

Features

Errors,andWhattoDo

§ Examplesoferrors

Dear GlobalSCAPE Customer,

GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .

. . . To receive your $30 Amazon.com promotional certificate, click through to

http://www.amazon.com/apparel

and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .

WhattoDoAboutErrors?

§ Needmorefeatures– wordsaren’tenough!§ Haveyouemailedthesenderbefore?§ Have1Kotherpeoplejustgottenthesameemail?§ Isthesendinginformationconsistent?§ IstheemailinALLCAPS?§ DoinlineURLspointwheretheysaytheypoint?§ Doestheemailaddressyouby(your)name?

§ CanaddtheseinformationsourcesasnewvariablesintheNBmodel

§ Nextclasswe’lltalkaboutclassifierswhichletyoueasilyaddarbitraryfeaturesmoreeasily,and,later,howtoinducenewfeatures

Baselines

§ Firststep:getabaseline§ Baselinesareverysimple“strawman”procedures§ Helpdeterminehowhardthetaskis§ Helpknowwhata“good”accuracyis

§ Weakbaseline:mostfrequentlabelclassifier§ Givesalltestinstanceswhateverlabelwasmostcommoninthetrainingset§ E.g.forspamfiltering,mightlabeleverythingasham§ Accuracymightbeveryhighiftheproblemisskewed§ E.g.callingeverything“ham”gets66%,soaclassifierthatgets70%isn’tverygood…

§ Forrealresearch,usuallyusepreviousworkasa(strong)baseline

ConfidencesfromaClassifier

§ Theconfidenceofaprobabilisticclassifier:§ Posteriorprobabilityofthetoplabel

§ Representshowsuretheclassifierisoftheclassification§ Anyprobabilisticmodelwillhaveconfidences§ Noguaranteeconfidenceiscorrect

§ Calibration§ Weakcalibration:higherconfidencesmeanhigheraccuracy§ Strongcalibration:confidencepredictsaccuracyrate§ What’sthevalueofcalibration?

Summary

§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities

§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel

§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata

§ Smoothingestimatesisimportantinrealsystems

§ Classifierconfidencesareuseful,whenyoucangetthem

fa18 cs188 lecture20 naive bayes - wuwei lan · 2020-06-09 · machine learning § machine...

Documents