intro to information extraction -sentiment lexicon...

31
Intro to Information Extraction -Sentiment Lexicon Induction as an example Many slides adapted from Ellen Riloff and Dan Jurafsky

Upload: others

Post on 22-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

IntrotoInformationExtraction-SentimentLexiconInductionasan

example

Many slides adapted from Ellen Riloff and Dan Jurafsky

Page 2: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

WhatisInformationExtraction?• Informationextraction(IE)isanumbrellatermforNLPtasksthatinvolveextractingpiecesofinformationfromtextandassigningsomemeaningtotheinformation.

• ManyIEapplicationsaimtoturnunstructuredtextintoa“structured” representation.

• IEproblemstypicallyinvolve:– identifyingtextsnippetstoextract

– assigningsemanticmeaningtoentitiesorconcepts

– findingrelationsbetweenentitiesorconcepts

Page 3: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

IEApplications• BiologicalProcesses(Genomics)

• ClinicalMedicine

• QuestionAnswering/WebSearch

• QueryExpansion/SemanticSets

• ExtractingEntityProfiles

• TrackingEvents(Violent,Diseases,Business,etc)

• TrackingOpinions(Political,ProductReputation,FinancialPrediction,On-lineReviews,etc.)

Page 4: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

GeneralTechniques• SyntacticAnalysis

– PhraseIdentification

– FeatureExtraction

• SemanticAnalysis

• StatisticalMeasures

• MachineLearning

– Supervised&WeaklySupervised

• GraphAlgorithms

Page 5: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

NamedEntityRecognition(NER)

MarsOneannouncedMonday thatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.

TheWallStreetJournalreportsthatGoogle planstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.

NERtypicallyinvolvesextractingandlabelingcertaintypesofentities,suchaspropernamesanddates.

Page 6: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Domain-specificNER

IL−2 gene expressionand NFkappaB activationthrough CD28 requiresreactiveoxygenproductionby 5lipoxygenase.

Biomedicalsystemsmustrecognizegenesandproteins:

Adrenal-Sparingsurgeryissafeandeffective,andmaybecomethetreatmentofchoiceinpatientswithhereditaryphaeochromocytoma.

Clinicalmedicalsystemsmustrecognizeproblemsandtreatments:

Page 7: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

SemanticClassIdentification

MarsOneannouncedMondaythatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.

TheWallStreetJournalreportsthatGoogleplanstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.

Page 8: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

SemanticLexiconInduction• Althoughsomegeneralsemanticdictionariesexist(e.g.,WordNet),domain-specificapplicationsoftenhavespecializedvocabulary.

• SemanticLexiconInductiontechniqueslearnlistsofwordsthatbelongtoasemanticclass.

Vehicles:car,jeep,helicopter,bike,tricycle,scooter,…

Animal:tiger,zebra,wolverine,platypus,echidna,…

Symptoms:cough,sneeze,pain,pu/pd,elevatedbp,…

Products:camera,laptop,iPad,tablet,GPSdevice,…

Page 9: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Domain-specificVocabulary

A 14yo m/n doxy owned by a reputable breeder is being treated for IBD with pred.

doxy

predIBDbreeder

ANIMALHUMANDISEASEDRUG

Domain-specific meanings: lab, mix, m/n = ANIMAL

Page 10: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

SemanticTaxonomyInduction• Ideally,wewantsemanticconceptstobeorganizedinataxonomy,tosupportgeneralizationbuttodistinguishdifferentsubtypes.

AnimalMammal

FelineLion,PantheraLeoTiger,PantheraTigris,FelisTigrisCougar,MountainLion,Puma,Panther,Catamount

CanineWolf,CanisLupusCoyote,PrairieWolf,BrushWolf,AmericanJackalDog,Puppy,CanisLupusFamiliaris,Mongrel

Page 11: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

ChallengesinTaxonomyInduction• Butthereareoftenmanywaystoorganizeaconceptualspace!

• Stricthierarchiesarerareinrealdata– graphs/networksaremorerealisticthantreestructures.

• Forexample,animalscouldbesubcategorizedbasedon:– carnivorevs.herbivore– water-dwellingvs.land-dwelling– wildvs.petsvs.agricultural– physicalcharacteristics(e.g.,baleenvs.toothedwhales)– habitat(e.g.,arcticvs.desert)

Page 12: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

RelationExtraction

InSalzburg,littleMozart grewupinalovingmiddle-classenvironment.

Birthplace(Mozart,Salzburg)

SteveBallmer isanAmericanbusinessmanwhohasbeenservingastheCEOofMicrosoft sinceJanuary2000

Employed-By(SteveBallmer,Microsoft)CEO(SteveBallmer,Microsoft)

Page 13: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

RelationsforWebSearch

Page 14: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Paraphrasing• Relationscanoftenbeexpressedwithamultitudeofdifferenceexpressions.

• Paraphrasingsystemstrytoexplicitlylearnphrasesthatrepresentthesametypeofrelation.

• Examples:– XwasborninY

– YisthebirthplaceofX

– X’sbirthplaceisY

– X’shometownisY

– XgrewupinY

Page 15: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

15

Event ExtractionGoal: extract facts about events from unstructured documents

December 29, Pakistan - The U.S. embassy in Islamabad was damaged this morning by a car bomb. Three diplomats were injured in the explosion. Al Qaeda has claimed responsibility for the attack.

EVENTType: bombingTarget: U.S. embassyLocation: Islamabad,

PakistanDate: December 29Weapon: car bombVictim: three diplomatsPerpetrator: Al Qaeda

Example: extracting information about terrorism events in news articles:

Page 16: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

EventExtraction

Document TextNew Jersey, February, 26. An outbreak of swine flu has been confirmed in Mercer County, NJ. Five teenage boysappear to have contracted the deadly virus from an unknown source. The CDC is investigating the cases and is taking measures to prevent the spread. . .

EventDisease: swine fluLocation: Mercer County, NJVictim: Five teenage boysDate: February 26Status: confirmed

Anotherexample:extractinginformationaboutdiseaseoutbreakevents.

Page 17: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Large-ScaleIEfromtheWeb

• SomeresearchershavebeendevelopingIEsystemsforlarge-scaleextractionoffactsandrelationsfromtheWeb.

• ThesesystemsexploitthemassiveamountoftextandredundancyavailableontheWebanduseweaklysupervised,iterativelearningtoharvestinformationforautomatedknowledgebaseconstruction.

• TheKnowItAllprojectatUWandNELLprojectatCMUarewell-knownresearchgroupspursuingthiswork.

Page 18: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information
Page 19: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

OpinionExtraction

Source:<writer>Target:PowershotAspect:pictures,colorsEvaluation:beautiful,easytogrip

IjustboughtaPowershotafewdaysago.Itooksomepicturesusingthecamera.Colorsaresobeautifulevenwhenflashisused.Alsoeasytogripsincethebodyhasagriphandle.[Kobayashietal.,2007]

Page 20: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

OpinionExtractionfromNews[Wilson&Wiebe,2009]

Source:ItaliansenatorRenzoGubertTarget:theChineseGovernmentEvaluation:praisedPOSITIVE

ItaliansenatorRenzoGubertpraisedtheChineseGovernment’sefforts.

AfricanobserversgenerallyapprovedofhisvictorywhileWesterngovernmentsdenouncedit.

Source:AfricanobserversTarget:hisvictoryEvaluation:approvedPOSITIVE

Source:WesterngovernmentsTarget:it(hisvictory)Evaluation:denouncedNEGATIVE

Page 21: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Summary• Informationextractionsystemsfrequentlyrelyonlow-levelNLPtoolsforbasiclanguageanalysis,ofteninapipelinearchitecture.

• ThereareawidevarietyofapplicationsforIE,includingbothbroad-coverageanddomain-specificapplications.

• SomeIEtasksarerelativelywell-understood(e.g.,namedentityrecognition),whileothersarestillquitechallenging!

• We’veonlyscratchedthesurfaceofpossibleIEtasks…nearlyendlesspossibilities.

Page 22: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

TurneyAlgorithmtolearnaSentimentLexicon

1. Extractaphrasallexiconfromreviews

2. Learnpolarityofeachphrase

3. Rateareviewbytheaveragepolarityofitsphrases

22

Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews

Page 23: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Extracttwo-wordphraseswithadjectives

FirstWord SecondWord ThirdWord (notextracted)

JJ NNorNNS anythingRB, RBR,RBS JJ NotNNnorNNS

JJ JJ NotNNnorNNSNNorNNS JJ NotNNnor NNS

RB,RBR,orRBS

VB,VBD,VBN,VBG

anything

23

Page 24: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Howtomeasurepolarityofaphrase?• Positivephrasesco-occurmorewith“excellent”->”great”

• Negativephrasesco-occurmorewith“poor”

• Buthowtomeasureco-occurrence?

24

Page 25: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

PointwiseMutualInformation

• Pointwise mutualinformation:

– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

PMI(X,Y ) = log2P(x,y)P(x)P(y)

Page 26: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

PointwiseMutualInformation

• Pointwise mutualinformation:

– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

• PMIbetweentwowords:

– Howmuchmoredotwowordsco-occurthaniftheywereindependent?PMI(word1,word2 ) = log2

P(word1,word2)P(word1)P(word2)

PMI(X,Y ) = log2P(x,y)P(x)P(y)

Page 27: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

HowtoEstimatePointwiseMutualInformation

– Querysearchengine(Altavista)

• P(word)estimatedbyhits(word)/N

• P(word1,word2)byhits(word1 NEAR word2)/N– (MorecorrectlythebigramdenominatorshouldbekN,becausethereareatotalofNconsecutivebigrams(word1,word2),butkN bigramsthatarekwordsapart,butwejustuseNontherestofthisslideandthenext.)

PMI(word1,word2 ) = log2

1Nhits(word1 NEAR word2)

1Nhits(word1) 1

Nhits(word2)

Page 28: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Doesphraseappearmorewith“poor” or“excellent”?

28

Polarity(phrase) = PMI(phrase,"excellent")−PMI(phrase,"poor")

= log2hits(phrase NEAR "excellent")hits("poor")hits(phrase NEAR "poor")hits("excellent")!

"#

$

%&

= log2hits(phrase NEAR "excellent")

hits(phrase)hits("excellent")hits(phrase)hits("poor")

hits(phrase NEAR "poor")

= log2

1N hits(phrase NEAR "excellent")1N hits(phrase) 1

N hits("excellent")− log2

1N hits(phrase NEAR "poor")1N hits(phrase) 1

N hits("poor")

Page 29: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Phrasesfromathumbs-upreview

29

Phrase POStags

Polarity

onlineservice JJNN 2.8

onlineexperience JJNN 2.3

directdeposit JJNN 1.3

localbranch JJNN 0.42…

lowfees JJNNS 0.33

trueservice JJNN -0.73

otherbank JJNN -0.85

inconvenientlylocated

JJNN -1.5

Average 0.32

Page 30: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

Phrasesfromathumbs-downreview

30

Phrase POStags

Polarity

directdeposits JJNNS 5.8

onlineweb JJNN 1.9

veryhandy RBJJ 1.4…

virtualmonopoly JJNN -2.0

lesserevil RBRJJ -2.3

otherproblems JJNNS -2.8

lowfunds JJNNS -6.8

unethicalpractices

JJNNS -8.5

Average -1.2

Page 31: Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Spring18/l16_ie-overview_sentiment_lexicon.pdf · tasks that involve extracting pieces of information

ResultsofTurneyalgorithm• 410reviewsfromEpinions

– 170(41%)negative

– 240(59%)positive

• Majorityclassbaseline:59%

• Turney algorithm:74%

• Phrasesratherthanwords

• Learnsdomain-specificinformation31