text classification 1 - sameer singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... ·...

Post on 29-Jul-2018

235 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TextClassification1

Prof.SameerSinghCS295:STATISTICALNLP

WINTER2017

January12,2017

BasedonslidesfromNathanSchneider,NoahSmith,DanKleinandeveryoneelsetheycopiedfrom.

TextClassification1

CS295:STATISTICALNLP(WINTER2017) 2

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

TextClassification

CS295:STATISTICALNLP(WINTER2017) 3

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

SentimentAnalysis

CS295:STATISTICALNLP(WINTER2017) 4

Filledwithhorrificdialogue,laughablecharacters,alaughableplot,adreallynointerestingstakesduringthisfilm,"StarWarsEpisodeI:ThePhantomMenace"isnotatallwhatIwantedfromafilmthatissupposedtobethehugeopeningtothesegueintothefantasticOriginalTrilogy.Thepositivesincludethescore,thesound…

OtherExamples

CS295:STATISTICALNLP(WINTER2017) 5

• Reviewsoffilms,restaurants,products:positivevs.negative• Amazonreviewsdata,IMDBreviewsdata

• Library-likesubjects(e.g.,theDeweydecimalsystem)• Newsstories:politicsvs.sportsvs.businessvs.technology...

• 20newsgroupdata• Authorattributes:identity,politicalstance,gender,age,...• Email:spamvs.not

• Gmail:important,promotion,updates,socialmedia,…• Whatisthereadinglevelofapieceoftext?

• Automaticgraders?• Howinfluentialwillascientificpaperbe?• Advertisementrecommendations…• Willapieceofproposedlegislationpass?

• Identifythepresidentialcandidatefromspeeches• Postrecommendations/Fakenewsdetection

• Canmajorlyinfluencetheworld!

FormalSetup

CS295:STATISTICALNLP(WINTER2017) 6

Classification

SupervisedLearning

TrainingAlgorithm

Evaluation:ContingencyTable

CS295:STATISTICALNLP(WINTER2017) 7

Accuracy

CS295:STATISTICALNLP(WINTER2017) 8

Problem

• Classimbalancehurts..• Gettingoneclassrightmattersmorethantheother(retrieval)

PrecisionandRecall

CS295:STATISTICALNLP(WINTER2017) 9

>2Classes?

CS295:STATISTICALNLP(WINTER2017) 10

Macro-averagedMeasures

Micro-averagedMeasures

StatisticalSignificance

CS295:STATISTICALNLP(WINTER2017) 11

McNemar’s Test,Psychometrika, (1947)MoretestsinSmithbook,appendixB

TextClassification

CS295:STATISTICALNLP(WINTER2017) 12

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

ClassificationusingJointProb

CS295:STATISTICALNLP(WINTER2017) 13

NaïveBayesClassifier

CS295:STATISTICALNLP(WINTER2017) 14

Twoassumptions

• Wordorderingdoesnotmatter(BagofWords)

NaïveBayesClassifier

CS295:STATISTICALNLP(WINTER2017) 15

Twoassumptions

• Wordorderingdoesnotmatter (BagofWords)• Wordsareindependentgivencategory

EstimationofParameters

CS295:STATISTICALNLP(WINTER2017) 16

ProblemwithNaïveBayes

CS295:STATISTICALNLP(WINTER2017) 17

LinearModels

CS295:STATISTICALNLP(WINTER2017) 18

NaïveBayesasaLinearModel

CS295:STATISTICALNLP(WINTER2017) 19

TextClassification

CS295:STATISTICALNLP(WINTER2017) 20

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

GroupProjects

CS295:STATISTICALNLP(WINTER2017) 21

• Idealteamsizeis3• Absolutemaximumof4• <3ifIapprove(ongoingwork)

GroupsfortheProject

• Firsttworeportsareveryshort(1page)• Finalreportmattersthemost

SubmitFourReports

• Outputisanyphraseorsentence,definitely!• Inputisanyphraseorsentence

• Outputisasequenceorstructure(yes!)• Classification:onlyifoverwordsorphrases

• Outputislinguisticclasses/structures(yes!)

HowdoIknowit’sNLP?

ScopeofWork

CS295:STATISTICALNLP(WINTER2017) 22

• NewTask/Data• NewMethod/Models• NewApplicationofExistingMethodtoExistingTask

Novelty

• Youdonothavemuchtime!• Aimtohavethewholepipelinedonesoon• Keepthe“scale”ofthedatasmall,sub-sampleifneeded• Bettertohaveacompletefinishedreport

• thangrandideasthatdidnotwork

Butnottoomuch!

• Youdonothavetocodeeverything• Exploitexistingcode,datasets,libraries,webservices• Donotreinventallthewheels!

Reuse

Example1:What’stheword..

CS295:STATISTICALNLP(WINTER2017) 23

What’sthewordforsomeoneusingpretentiouswords? lexiphanic

MachineLearning(LSTM)

definitionofawordfromthedictionary theworditself

ThiscanbeacoolTwitterbot!

• Accuracyofguessingtheword,usingdefinitionsfromdifferentdictionary?

• Baselines:Google,reversedictionary.org,…Evaluation

Example2:SQuAD

CS295:STATISTICALNLP(WINTER2017) 24

https://rajpurkar.github.io/SQuAD-explorer/

Tesla was the fourth of five children. He had an older brother named Dane and three sisters, Milka, Angelina and Marica. Dane was killed in a horse-riding accident when Nikola was five. In 1861, Tesla attended the "Lower" or "Primary" School in Smiljan where he studied German, arithmetic, and religion. In 1862, the Tesla family moved to Gospić, Austrian Empire, where Tesla's father worked as a pastor. Nikola completed "Lower" or "Primary" School, followed by the "Lower Real Gymnasium" or "Normal School."

How many siblings did Tesla have?fourWhat was Tesla’s brother’s name?DaneWhat happened to Dane?killed in a horse-riding accident

DatasetsandPapers

CS295:STATISTICALNLP(WINTER2017) 25

• SearchKaggle,Quora,etc forlargetextdatasets• SeerecentpapersinNLPforreleaseddatasets• Lookfor“sharedtasks”,“challenges”,workshops• Linkstosomeexistingdatasetscomingtowebsitesoon

Data

• NLPConferences:ACL,EMNLP,NAACL• MLConferences:NIPS,ICML,ICLR,AAAI• Datafocusedvenues:TREC/TAC,SemEval,CONLL• Workshopsattheseconferences:interestingdirections• Morepaperscomingsoontothewebsite

Papers

WritingthePitch

CS295:STATISTICALNLP(WINTER2017) 26

• Teamnameandmembers• Singlesentencedescriptionforeachmember

• (approximately)whattheywilldo• Singlesentenceonwhatmakesyourteamdiverse

Team

• MotivationandProblemDescription• Plannedapproach:tentative• Evaluation:usually,mostimportant

Project

• If1or2,meetmebefore/on January17(o.w.noneed)• Everygrouphastomeetafterwardstodiscusstheproject

Appointment

Upcoming…

CS295:STATISTICALNLP(WINTER2017) 27

• Homework1isup!• Nextlectureswillcontinuewithmoredetails• SignupfortheKaggle account(@uci.edu email)• Due:January26,2017

Homework

• ProjectpitchisdueJanuary23,2017!• Startassemblingteamsnow!(usePiazza)• Startlookingatpapers,data,etc.forideas

Project

top related