dan jurafsky text classification - wuwei lan › courses › sp19 › 3521... · dan jurafsky text...

24
Dan Jurafsky Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis

Upload: others

Post on 09-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

TextClassification

• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …

Page 2: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

TextClassification:definition

• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}

• Output:apredictedclassc Î C

Page 3: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

ClassificationMethods:SupervisedMachineLearning

• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}• Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

• Output:• alearnedclassifierγ:dà c

3

Page 4: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky ClassificationMethods:SupervisedMachineLearning

• Anykindofclassifier• Naïve Bayes• Logisticregression• Support-vectormachines• k-NearestNeighbors

• …

Page 5: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

NaïveBayesIntuition

• Simple(“naïve”)classificationmethodbasedonBayesrule

• Reliesonverysimplerepresentationofdocument• Bagofwords

Page 6: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Thebagofwordsrepresentation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ(

)=c

Page 7: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Thebagofwordsrepresentation

γ(

)=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Page 8: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

MultinomialNaïve BayesIndependenceAssumptions

P(x1, x2,…, xn | c)

• BagofWordsassumption:Assumepositiondoesn’tmatter

• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)

Page 9: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

LearningtheMultinomialNaïve BayesModel

• Firstattempt:maximumlikelihoodestimates• simplyusethefrequenciesinthedata

Sec.13.3

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

Page 10: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

MultinomialNaïveBayes:Learning

• CalculateP(cj) terms• Foreachcj inC do

docsj¬ alldocswithclass=cj

P(wk | cj )←nk +α

n+α |Vocabulary |P(cj )←

| docsj || total # documents|

• CalculateP(wk | cj) terms• Textj¬ singledoccontainingalldocsj• For eachwordwk inVocabulary

nk¬ #ofoccurrencesofwk inTextj

• Fromtrainingcorpus,extractVocabulary

Page 11: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Choosingaclass:P(c|d5)

P(j|d5) 1/4*(2/9)3 *2/9*2/9≈0.0001

Doc Words ClassTraining 1 Chinese BeijingChinese c

2 ChineseChineseShanghai c3 ChineseMacao c4 TokyoJapanChinese j

Test 5 ChineseChineseChineseTokyo Japan ?

11

ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=

Priors:P(c)=P(j)=

34 1

4

P̂(w | c) = count(w,c)+1count(c)+ |V |

P̂(c) = Nc

N

(5+1)/(8+6)=6/14=3/7(0+1)/(8+6)=1/14

(1+1)/(3+6)=2/9(0+1)/(8+6)=1/14

(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9

3/4*(3/7)3 *1/14*1/14≈0.0003

µ

µ

Page 12: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

UnderflowPrevention:logspace

• Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.• Sincelog(xy)=log(x)+log(y)

• Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.• Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.

• Modelisnowjustmaxofsumofweights

cNB = argmaxc j∈C

logP(cj )+ logP(xi | cj )i∈positions∑

Page 13: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Summary:NaiveBayesisNotSoNaive

• VeryFast,lowstoragerequirements• RobusttoIrrelevantFeatures

IrrelevantFeaturescanceleachotherwithoutaffectingresults

• VerygoodindomainswithmanyequallyimportantfeaturesDecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata

• Optimaliftheindependenceassumptionshold:Ifassumedindependenceiscorrect,thenitistheBayesOptimalClassifierforproblem

• Agooddependablebaselinefortextclassification• Butwewillseeotherclassifiersthatgivebetteraccuracy

Page 14: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

TextClassification:Evaluation

Page 15: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

The2-by-2contingencytable

correct notcorrectselected tp fp

notselected fn tn

Page 16: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Precisionandrecall

• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected

correct notcorrectselected tp fp

notselected fn tn

Page 17: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Acombinedmeasure:F

• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

• Theharmonicmeanisaveryconservativeaverage;seeIIR§8.3

• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½): F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

Page 18: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

18

MoreThanTwoClasses:Setsofbinaryclassifiers

• Dealingwithany-oformultivalue classification• Adocumentcanbelongto0,1,or>1classes.

• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C

• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstoany classforwhich γc returnstrue

Sec.14.5

Page 19: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

19

MoreThanTwoClasses:Setsofbinaryclassifiers

• One-oformultinomialclassification• Classesaremutuallyexclusive:eachdocumentinexactlyoneclass

• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C

• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstotheone classwithmaximumscore

Sec.14.5

Page 20: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

Confusionmatrixc• Foreachpairofclasses<c1,c2>howmanydocumentsfromc1

wereincorrectlyassignedtoc2?• c3,2:90wheatdocumentsincorrectlyassignedtopoultry

20

Docsintestset AssignedUK

Assignedpoultry

Assignedwheat

Assignedcoffee

Assignedinterest

Assignedtrade

TrueUK 95 1 13 0 1 0

Truepoultry 0 1 0 0 0 0

Truewheat 10 90 0 1 0 0

Truecoffee 0 0 0 34 3 7

Trueinterest - 1 2 13 26 5

Truetrade 0 0 2 14 5 10

Page 21: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

21

Perclassevaluationmeasures

Recall:Fractionofdocsinclassi classifiedcorrectly:

Precision:Fractionofdocsassignedclassi thatare

actuallyaboutclassi:

Accuracy:(1- errorrate)Fractionofdocsclassifiedcorrectly:

ciii∑

ciji∑

j∑

ciic ji

j∑

ciicij

j∑

Sec. 15.2.4

Page 22: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

22

Micro- vs.Macro-Averaging

• Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?

• Macroaveraging:Computeperformanceforeachclass,thenaverage.

• Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.

Sec. 15.2.4

Page 23: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

23

Micro- vs.Macro-Averaging:Example

Truth:yes

Truth:no

Classifier:yes 10 10

Classifier:no 10 970

Truth:yes

Truth:no

Classifier:yes 90 10

Classifier:no 10 890

Truth:yes

Truth:no

Classifier:yes 100 20

Classifier:no 20 1860

Class1 Class2 MicroAve.Table

Sec.15.2.4

• Macroaveraged precision:(0.5+0.9)/2=0.7• Microaveraged precision:100/120=.83• Microaveraged scoreisdominatedbyscoreoncommonclasses

Page 24: Dan Jurafsky Text Classification - Wuwei Lan › courses › SP19 › 3521... · Dan Jurafsky Text Classification: definition • Input: • a document d • a fixed set of classes

DanJurafsky

DevelopmentTestSetsandCross-validation

• Metric:P/R/F1orAccuracy• Unseentestset

• avoidoverfitting (‘tuningtothetestset’)• moreconservativeestimateofperformance

• Cross-validationovermultiplesplits• Handlesamplingerrorsfromdifferentdatasets

• Poolresultsovereachsplit• Computepooleddev setperformance

Trainingset Development Test Set TestSet

TestSet

TrainingSet

TrainingSetDev Test

TrainingSet

Dev Test

Dev Test