NaturalLanguageInference,ReadingComprehensionandDeepLearning
ChristopherManning@chrmanning•@stanfordnlp
StanfordUniversitySIGIR2016
Machine ComprehensionTested by question answering (Burges)
“Amachinecomprehends apassageoftext if,foranyquestion regardingthattextthatcanbeansweredcorrectlybyamajorityofnativespeakers,thatmachinecanprovideastringwhichthosespeakerswouldagreebothanswersthatquestion,anddoesnotcontaininformationirrelevanttothatquestion.”
IR needs language understanding
• ThereweresomethingsthatkeptIRandNLPapart• IRwasheavilyfocusedonefficiencyandscale• NLPwaswaytoofocusedonformratherthanmeaning
• Nowtherearecompellingreasonsforthemtocometogether• TakingIRprecisionandrecalltothenextlevel• [carparts forsale]• Shouldmatch:Sellingautomobileandpickupengines, transmissions
• Example fromJeffDean’sWSDM2016talk
• Informationretrieval/questionansweringinmobilecontexts• Websnippetsnolongercutitonawatch!
Menu
1. Naturallogic:Aweaklogicoverhumanlanguagesforinference
2. Distributedwordrepresentations
3. Deep,recursiveneuralnetworklanguageunderstanding
How can information retrieval be viewed more as theorem proving (than matching)?
AI2 4th Grade Science Question Answering [Angeli, Nayak, & Manning, ACL 2016]
Our“knowledge”:
Ovariesarethefemalepartoftheflower,whichproduceseggsthatareneededformakingseeds.
Thequestion:
Whichpartofaplantproducestheseeds?
Theanswerchoices:
theflowertheleavesthestemtheroots
How can we represent and reason with broad-coverage knowledge?
1. Rigid-schemaknowledgebaseswithwell-definedlogicalinference
2. Open-domainknowledgebases(OpenIE)– noclearontologyorinference[Etzionietal.2007ff]
3. HumanlanguagetextKB–Norigidschema,butwith“Naturallogic”candoformalinferenceoverhumanlanguagetext [MacCartneyandManning2008]
Natural Language Inference[Dagan 2005, MacCartney & Manning, 2009]
Doesapieceoftextfollowsfromorcontradictanother?TwosenatorsreceivedcontributionsengineeredbylobbyistJackAbramoffinreturnforpoliticalfavors.
JackAbramoffattemptedtobribetwolegislators.
Heretrytoproveorrefuteaccordingtoalargetextcollection:1. Theflowerofaplantproducestheseeds2. Theleavesofaplantproducestheseeds3. Thestemofaplantproducestheseeds4. Therootsofaplantproducestheseeds
Follows
Text as Knowledge Base
Storingknowledgeastextiseasy!
Doinginferencesovertextmightbehard
Don’twanttoruninferenceovereveryfact!
Don’twanttostorealltheinferences!
Inferences … on demand from a query … [Angeli and Manning 2014]
… using text as the meaningrepresentation
Natural Logic: logical inference over text
WearedoinglogicalinferenceThecatateamouse⊨ ¬Nocarnivoreseatanimals
WedoitwithnaturallogicIfImutateasentenceinthisway,doIpreserveitstruth?Post-DealIranAsksifU.S. IsStill‘GreatSatan,’orSomethingLess ⊨ACountryAsksifU.S. IsStill‘GreatSatan,’orSomethingLess
• Asoundandcompleteweaklogic[Icard andMoss2014]• Expressiveforcommonhumaninferences*• “Semantic”parsingisjustsyntacticparsing• Tractable:Polynomialtimeentailmentchecking• Playsnicelywithlexicalmatchingback-offmethods
#1. Common sense reasoning
PolarityinNaturalLogic
WeorderphrasesinpartialordersSimplestone:is-a-kind-ofAlso:geographicalcontainment,etc.
Polarity:Inacertaincontext,isitvalidtomoveupordowninthisorder?
Example inferences
Quantifiersdeterminethepolarity ofphrases
Validmutationsconsider polarity
Successfultoyinference:• Allcatseatmice ⊨ Allhousecatsconsumerodents
“Soft” Natural Logic
• Wealsowanttomakelikely(butnotcertain)inferences• SamemotivationasMarkovlogic,probabilisticsoftlogic,
etc.• Eachmutationedgetemplatefeaturehasacostθ ≥0• Costofanedgeisθi ·fi• Costofapathisθ ·f• Canlearnparametersθ• Inferenceisthengraphsearch
#2. Dealing with real sentences
Naturallogicworkswithfactsliketheseintheknowledgebase:ObamawasborninHawaii
Butreal-worldsentencesarecomplexandlong:BorninHonolulu,Hawaii,Obama isagraduateofColumbiaUniversityandHarvardLawSchool,whereheservedaspresidentoftheHarvardLawReview.
Approach:1. Classifierdivideslongsentencesintoentailedclauses2. Naturallogicinferencecanshortentheseclauses
Universal Dependencies (UD)http://universaldependencies.github.io/docs/
Asingleleveloftypeddependencysyntaxthat(i) worksforallhumanlanguages(ii) givesasimple,human-friendlyrepresentationofsentence
Dependencysyntaxisbetterthanaphrase-structuretreeformachineinterpretation– it’salmostasemanticnetwork
UDaimstobelinguisticallybetteracrosslanguagesthanearlierrepresentations,suchasCoNLLdependencies
Generation of minimal clauses
1. Classificationproblem:givenadependencyedge,doesitintroduceaclause?
2. Isitmissingacontrolledsubjectfromsubj/object?
3. Shortenclauseswhilepreservingvalidity,usingnaturallogic!• Allyoungrabbitsdrinkmilk
⊭ Allrabbitsdrinkmilk
• OK: SJC,theBayArea’sthirdlargestairport,oftenexperiences delaysduetoweather.
• Oftenbetter:SJCoftenexperiencesdelays.
#3. Add a lexical alignment classifier
• Sometimeswecan’tquitemaketheinferencesthatwewouldliketomake:
• Weuseasimplelexicalmatchback-offclassifierwithfeatures:• Matchingwords,mismatchedwords,unmatchedwords• Thesealwaysworkprettywell• Thiswasthe lessonofRTEevaluations andperhaps orIRingeneral
The full system
• Werunourusualsearchoversplitup,shortenedclauses• Ifwefindapremise,great!• Ifnot,weusethelexicalclassifierasanevaluationfunction
• Weworktodothisquicklyatscale• Visit1Mnodes/second, don’trefeaturize, justdelta• 32bytesearchstates (thanksGabor!)
Solving NY State 4th grade science (Allen AI Institute datasets)
Multiplechoicequestionsfromreal4th gradescienceexamsWhichactivityisanexampleofagoodhealthhabit?
(A)Watchingtelevision(B)Smokingcigarettes(C)Eatingcandy(D)Exercisingeveryday
Inourcorpus knowledgebase:• PlasmaTV’scandisplayupto16millioncolors...greatfor
watchingTV...alsomakeagoodscreen.• Notsmokingordrinkingalcoholisgoodforhealth,regardlessof
whetherclothingiswornornot.• Eatingcandyfordinerisanexampleofapoorhealthhabit.• Healthyisexercising
Solving 4th grade science (Allen AI NDMC)
System Dev TestKnowBot[Hixon etal.NAACL2015] 45 –KnowBot(augmentedwith humaninloop) 57 –IRbaseline(Lucene) 49 42NaturalLI 52 51Moredata+IRbaseline 62 58Moredata+NaturalLI 65 61NaturalLI+🔔 +(lex.classifier) 74 67Aristo[Clarketal.2016]6systems,evenmoredata 71
Testset:NewYorkRegents4thGradeScienceexammultiple-choice questionsfromAI2Training:BasicisBarron’sstudyguide;moredataisSciText corpusfromAI2.Score:%correct
Natural Logic
• Canwejustusetextasaknowledgebase?• Naturallogicprovidesauseful,formal(weak)logicfortextual
inference• Naturallogiciseasilycombinablewithlexicalmatching
methods,includingneuralnetmethods• Theresultingsystemisusefulfor:• Common-sensereasoning• QuestionAnswering• OpenInformationExtraction• i.e.,gettingoutrelation triples fromtext
Can information retrieval benefit from distributed representations of words?
From symbolic to distributed representations
Thevastmajorityofrule-basedorstatisticalNLPand IR workregardedwordsasatomicsymbols:hotel, conference,walk
Invectorspaceterms,thisisavectorwithone1andalotofzeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Wenowcallthisa“one-hot”representation.
Sec. 9.2.2
From symbolic to distributed representations
Itsproblem:• Ifusersearchesfor[Dellnotebookbatterysize],wewouldliketomatchdocumentswith“Delllaptopbatterycapacity”
• Ifusersearchesfor[Seattlemotel],wewouldliketomatchdocumentscontaining“Seattlehotel”
Butmotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0OurqueryanddocumentvectorsareorthogonalThereisnonaturalnotionofsimilarityinasetofone-hotvectors
Sec. 9.2.2
Capturing similarity
Therearemanythingsyoucandoaboutsimilarity,manywellknowninIR
Queryexpansionwithsynonymdictionaries
Learningwordsimilaritiesfromlargecorpora
ButawordrepresentationthatencodessimilaritywinsLessparameterstolearn(perword,notperpair)
Moresharingofstatistics
Moreopportunitiesformulti-tasklearning
Distributional similarity-based representations
Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors
“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)
OneofthemostsuccessfulideasofmodernNLPgovernment debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
ë Thesewordswillrepresentbankingì
Basic idea of learning neural network word embeddings
Wedefinesomemodelthataimstopredictawordbasedonotherwordsinitscontext
Chooseargmaxw w·((wj−1 +wj+1)/2)
whichhasalossfunction,e.g.,
J =1−wj·((wj−1 +wj+1)/2)
Welookatmanysamplesfromabiglanguagecorpus
Wekeepadjustingthevectorrepresentationsofwordstominimizethisloss
Unitnormvectors
With distributed, distributional representations, syntactic and semantic similarity is captured
0.2860.792−0.177−0.1070.109−0.5420.3490.271
currency =
Distributional representations can solve the fragility of NLP tools
StandardNLPsystems– here,theStanfordParser– areincrediblyfragilebecauseofsymbolicrepresentations
Crazysententialcomplement, suchasfor“likes [(being)crazy]”
Distributional representations can capture the long tail of IR similarity
Google’sRankBrain
Notnecessarilyasgoodfortheheadofthequerydistribution,butgreatforseeingsimilarityinthetail
3rd mostimportantrankingsignal(we’retold…)
LSA:Count!models• Factorizea(maybeweighted,oftenlog-scaled)term-
document(Deerwester etal.1990)orword-contextmatrix(Schütze1992)intoUΣVT
• Retainonlyksingularvalues,inordertogeneralize
[Cf.Baroni:Don’tcount,predict!Asystematiccomparisonofcontext-countingvs.context-predicting semanticvectors. ACL2014]
LSA (Latent Semantic Analysis) vs. word2vec
k
Sec. 18.3
word2vecCBOW/SkipGram:Predict![Mikolov etal.2013]:Simplepredictmodelsforlearningwordvectors• Trainwordvectorstotrytoeither:
• Predictawordgivenitsbag-of-wordscontext(CBOW);or
• Predictacontextword(position-independent)fromthecenterword
• Updatewordvectorsuntiltheycandothispredictionwell
LSA vs. word2vec
Sec. 18.3
word2vec encodes semantic components as linear relations
COALS model (count-modified LSA)[Rohde, Gonnerman & Plaut, ms., 2005]
Count based vs. direct prediction
LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)
• Fast training• Efficient usage of statistics
• Primarily used to capture word similarity
• May not use the best methods for scaling counts
• NNLM, HLBL, RNN, word2vec Skip-gram/CBOW, (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu)
• Scales with corpus size
• Inefficient usage of statistics
• Can capture complex patterns beyond word similarity
• Generate improved performance on other tasks
Ratiosofco-occurrenceprobabilitiescanencodemeaningcomponents
Crucialinsight:
x =solid x =water
large
x =gas
small
x =random
smalllarge
small large large small
~1 ~1large small
Encoding meaning in vector differences[Pennington, Socher, and Manning, EMNLP 2014]
Ratiosofco-occurrenceprobabilitiescanencodemeaningcomponents
Crucialinsight:
x =solid x =water
1.9x10-4
x =gas x =fashion
2.2x10-5
1.36 0.96
Encoding meaning in vector differences[Pennington, Socher, and Manning, EMNLP 2014]
8.9
7.8x10-4 2.2x10-3
3.0x10-3 1.7x10-5
1.8x10-5
6.6x10-5
8.5x10-2
A:Log-bilinearmodel:
withvectordifferences
Encoding meaning in vector differences
Q:Howcanwecaptureratiosofco-occurrenceprobabilitiesasmeaningcomponentsinawordvectorspace?
Nearestwordsto frog:
1.frogs2.toad3.litoria4.leptodactylidae5.rana6.lizard7.eleutherodactylus
Glove Word similarities[Pennington et al., EMNLP 2014]
litoria leptodactylidae
rana eleutherodactylushttp://nlp.stanford.edu/projects/glove/
Glove Visualizations
http://nlp.stanford.edu/projects/glove/
Glove Visualizations: Company - CEO
Named Entity Recognition Performance
ModelonCoNLL CoNLL ’03dev CoNLL ’03test ACE2 MUC7CategoricalCRF 91.0 85.4 77.4 73.4SVD(logtf) 90.5 84.8 73.6 71.5HPCA 92.6 88.7 81.7 80.7C&W 92.2 87.4 81.7 80.2CBOW 93.1 88.2 82.2 81.1GloVe 93.2 88.3 82.9 82.2
F1scoreofCRFtrainedonCoNLL2003Englishwith50dimwordvectors
Word embeddings: Conclusion
Glovetranslatesmeaningfulrelationshipsbetweenword-wordco-occurrencecounts into linearrelations inthewordvectorspace
GloveshowstheconnectionbetweenCount!workandPredict!work– appropriatescalingofcountsgivesthepropertiesandperformanceofPredict!Models
Alotofotherimportantworkinthislineofresearch:[Levy&Goldberg,2014][Arora,Li,Liang,Ma&Risteski,2015][Hashimoto,Alvarez-Melis &Jaakkola,2016]
Can we use neural networks to understand, not just word similarities,
but language meaning in general?
CompositionalityArtificial Intelligence requires being able to understand bigger things from knowing
about smaller parts
We need more than word embeddings!
Howcanweknowwhenlargerlinguisticunitsaresimilarinmeaning?
Thesnowboarderisleapingoverthemogul
Apersononasnowboardjumpsintotheair
Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements
Beyond the bag of words: Sentiment detection
Isthetoneofapieceoftextpositive,negative,orneutral?
• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT
……loved……………great………………impressed………………marvelous…………
Stanford Sentiment Treebank
• 215,154phraseslabeledin11,855sentences• Cantrainandtestcompositions
http://nlp.stanford.edu:8080/sentiment/
Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]
Tree-structured LSTM
GeneralizessequentialLSTMtotreeswithanybranchingfactor
Positive/Negative Results on Treebank
75
80
85
90
95
TrainingwithSentence Labels TrainingwithTreebank
BiNB
RNN
MV-RNN
RNTN
TreeLSTM
Experimental Results on Treebank
• TreeRNN cancaptureconstructionslikeXbutY• Biword NaïveBayesisonly58%onthese
Stanford Natural Language Inference Corpus http://nlp.stanford.edu/projects/snli/570K Turker-judged pairs, based on an assumed picture
Amanridesabikeon asnowcoveredroad.Aman isoutside. ENTAILMENT
2femalebabieseatingchips.Twofemalebabiesareenjoyingchips.NEUTRAL
Amaninanapronshoppingatamarket.Amaninanapronispreparingdinner.CONTRADICTION
NLI with Tree-RNNs[Bowman, Angeli, Potts & Manning, EMNLP 2015]
Approach: Wewouldliketoworkoutthemeaningofeachsentenceseparately– apurecompositionalmodel
ThenwecomparethemwithNN&classifyforinference
P(Entail)=0.8
manoutside vs. maninsnow
snowin
insnowman
maninsnow
outsideman
manoutside
Softmax classifier
ComparisonNNlayer(s)
Composition NN layer
Learnedwordvectors
Tree recursive NNs (TreeRNNs)
Theoreticallyappealing
Veryempiricallycompetitive
But
Prohibitivelyslow
Usuallyrequireanexternalparser
Don’texploitcomplementarylinearstructureoflanguage
A recurrent NN allows efficient batched computation on GPUs
TreeRNN: Input-specific structure undermines batched computation
The Shift-reduce Parser-Interpreter NN (SPINN) [Bowman, Gauthier et al. 2016]
BasemodelequivalenttoaTreeRNN,but…supportsbatchedcomputation:25× speedups
Plus:Effectivenewhybridthatcombineslinearandtree-structuredcontext
Canstandalonewithoutaparser
Beginning observation: binary trees = transition sequences
SHIFT SHIFTREDUCE SHIFTSHIFT REDUCE
REDUCE
SHIFT SHIFTSHIFT SHIFTREDUCE REDUCE
REDUCE
SHIFT SHIFTSHIFT REDUCESHIFT REDUCE
REDUCE
The Shift-reduce Parser-Interpreter NN (SPINN)
The Shift-reduce Parser-Interpreter NN (SPINN)
ThemodelincludesasequenceLSTMRNN• ThisactsasasimpleparserbypredictingSHIFTorREDUCE• Italsogivesleftsequencecontextasinputtocomposition
Implementing the stack
• Naïveimplementation:simulatesstacksin abatchwithafixed-sizemultidimensionalarray ateachtimestep• Backpropagationrequiresthateachintermediatestackbemaintainedinmemory
• ⇒ Largeamountofdatacopyingandmovementrequired• Efficientimplementation
• Haveonlyonestackarrayforeachexample• Ateachtimestep,augmentwiththecurrentheadofthestack• Keeplistofbackpointers forREDUCEoperations
• Similartozipperdatastructuresemployedelsewhere
Array Backpointers
1
2
3
4
5
Spot 1
sat 1 2
down 123
(satdown) 14
(Spot(satdown)) 5
A thinner stack
Using SPINN for natural language inference
o₁ o₂the cat
sat down
the cat
is angry
(the cat) (sat down)
(the cat) (is angry)
SNLI Results
Model %Accuracy(Testset)
Feature-based classifier 78.2PreviousSOTAsentenceencoder[Mou etal.2016]
82.1
LSTMRNNsequencemodel 80.6Tree LSTM 80.9SPINN 83.2SOTA(sentencepairalignmentmodel)[Parikhetal.2016]
86.8
Successes for SPINN over LSTM
Exampleswithnegation• P:Therhythmicgymnastcompletesherfloorexerciseatthecompetition.
• H:Thegymnastcannotfinishherexercise.
Longexamples(>20words)• P:AmanwearingglassesandaraggedcostumeisplayingaJaguarelectricguitarandsingingwiththeaccompanimentofadrummer.
• H:Amanwithglassesandadisheveledoutfitisplayingaguitarandsingingalongwithadrummer.
Envoi
• Thereareverygoodreasonsforwantingtorepresentmeaningwithdistributedrepresentations
• Sofar,distributionallearninghasbeenmosteffectiveforthis• Butcf.[Young,Lai,Hodosh&Hockenmaier 2014]ondenotationalrepresentations,usingvisualscenes
• However,wewantnotjustwordmeanings,butalso:• Meaningsoflargerunits,calculatedcompositionally• Theabilitytodonaturallanguageinference
• TheSPINNmodelisfast— closetorecurrentnetworks!• Itshybridsequence/treestructureispsychologicallyplausible
andout-performsothersentencecompositionmethods
2011201320152017speechvisionNLPIR
Final Thoughts
Youarehere
IR NLP
Final Thoughts
I’mcertainthatdeep learningwillcometodominateSIGIRover thenextcoupleofyears…justlikespeech, vision,andNLPbefore it.Thisisagoodthing.Deeplearningprovidessomepowerfulnewtechniques thatarejustbeingamazinglysuccessful onmanyhardappliedproblems.However, weshould realize thatthere isalsocurrentlyahugeamountofhypeaboutdeep learningandartificialintelligence.Weshouldnotletagenuineenthusiasmforimportantandsuccessful newtechniques leadtoirrationalexuberance oradiminishedappreciation ofotherapproaches. Finally,despite theefforts ofanumberofpeople, inpractice therehas beenaconsiderable divisionbetween thehumanlanguage technologyfieldsofIR,NLP,andspeech.Partlythisisduetoorganizational factorsandpartlythatatonetimethesubfieldseachhadaverydifferent focus.However, recentchanges inemphasis– withIRpeoplewantingtounderstand theuserbetterandNLPpeoplemuchmoreinterested inmeaningandcontext– meanthattherearealotofcommoninterests, andIwouldencouragemuchmorecollaborationbetween NLPandIRinthenextdecade.