natural language processing with deep learning cs224n/ling284€¦ · with deep learning...

Natural Language Processingwith Deep Learning

CS224N/Ling284

Richard Socher

Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

LecturePlan

1. Motivation:CompositionalityandRecursion2. StructurepredictionwithsimpleTreeRNN:Parsing3. BackpropagationthroughStructure4. Morecomplexunits

Reminders/comments:LearnuponGPUs,Azure,DockerAss4:Getsomethingworking,usingaGPUformilestoneFinalprojectdiscussions– comemeetwithus!OHtodayafterclass.YouhavetocometoeveryOH.NoadditionalfeedbackbeyondOH.Nothingongradescope.

3/1/181

1.ThespectrumoflanguageinCS

2 3/1/18

Semanticinterpretationoflanguage–Notjustwordvectors

Howcanweknowwhenlargerunitsaresimilarinmeaning?

• Thesnowboarder isleapingoveramogul

• Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

3/1/183

Compositionality

Language understanding –& Artificial Intelligence – requires being able to understand bigger

things from knowing about smaller parts

3/1/186

Arelanguagesrecursive?

• Cognitivelysomewhatdebatable• But:recursionisnaturalfordescribinglanguage• [Themanfrom[thecompanythatyouspokewithabout[the

project]yesterday]]• nounphrasecontaininganounphrasecontaininganounphrase• Argumentsfornow:1)Helpfulindisambiguation:

3/1/18Lecture1,Slide8

Isrecursionuseful?

2)Helpfulforsometaskstorefertospecificphrases:• JohnandJanewenttoabigfestival.Theyenjoyedthetripandthemusicthere.

• “they”:JohnandJane• “thetrip”:wenttoabigfestival• “there”:bigfestival

3)Worksbetterforsometaskstousegrammaticaltreestructure• It’sapowerfulpriorforlanguagestructure

BuildingonWordVectorSpaceModels

x1012345678910

1Monday

Tuesday 9.51.5

Bymappingthemintothesamevectorspace!

thecountryofmybirththeplacewhereIwasborn

Howcanwerepresentthemeaningoflongerphrases?

France 22.5

Germany 13

3/1/18

Howshouldwemapphrasesintoavectorspace?

the countryof my birth

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

UseprincipleofcompositionalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.

Modelsinthissectioncanjointlylearnparsetreesandcompositionalvectorrepresentations

x1012345678910

thecountryofmybirth

theplacewhereIwasborn

Monday

Tuesday

FranceGermany

113/1/18

ConstituencySentenceParsing:Whatwewant

Thecatsatonthemat.12 3/1/18

LearnStructureandRepresentation

Thecatsatonthemat.

13 3/1/18

Recursivevs.recurrentneuralnetworks

3/1/18

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

0.40.3

2.33.6

2.13.3

4.53.8

5.56.1

2.53.8

Lecture1,Slide14

Recursivevs.recurrentneuralnetworks

3/1/18

• Recursiveneuralnetsrequireatreestructure

• Recurrentneuralnetscannotcapturephraseswithoutprefixcontextandoftencapturetoomuchoflastwordsinfinalvector

0.40.3

2.33.6

2.13.3

2.53.8

5.56.1

0.40.3

2.33.6

2.13.3

4.53.8

5.56.1

2.53.8

Lecture1,Slide15

RecursiveNeuralNetworksforStructurePrediction

onthemat.

Neural Network

Inputs:twocandidatechildren’srepresentationsOutputs:1. Thesemanticrepresentationifthetwonodesaremerged.2. Scoreofhowplausiblethenewnodewouldbe.

16 3/1/18

RecursiveNeuralNetworkDefinition

score=UTp

p =tanh(W +b),

SameW parametersatallnodesofthetree

Neural Network

1.3score= =parent

17 3/1/18

ParsingasentencewithanRNN

Neural Network

Thecatsatonthemat.

18 3/1/18

Parsingasentence

Neural Network

Thecatsatonthemat.

3/1/18

Parsingasentence

Neural Network

Thecatsatonthemat.

3/1/18

Parsingasentence

21Thecatsatonthemat.

3/1/18

Max-MarginFramework- Details

• Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:

• x issentence;y isparsetree

22 3/1/18

Max-MarginFramework- Details

• Similartomax-marginparsing(Taskar etal.2004),asupervisedmax-marginobjective

• Thelosspenalizesallincorrectdecisions

• StructuresearchforA(x)wasgreedy(joinbestnodeseachtime)• Instead:Beamsearchwithchart

23 3/1/18

BackpropagationThroughStructure

IntroducedbyGoller &Küchler (1996)

Principallythesameasgeneralbackpropagation

Threedifferencesresultingfromtherecursionandtreestructure:1. SumderivativesofW fromallnodes(likeRNN)2. Splitderivativesateachnode(fortree)3. Adderrormessagesfromparent+nodeitself

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

(l+1)i

+ �W

Which in one simplified vector notation becomes:

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to richard@socher.org

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

3/1/18

BTS:1)Sumderivativesofallnodes

Youcanactuallyassumeit’sadifferentW ateachnodeIntuitionviaexample:

Ifwetakeseparatederivativesofeachoccurrence,wegetsame:

25 3/1/18

BTS:2)Splitderivativesateachnode

Duringforwardprop,theparentiscomputedusing2children

Hence,theerrorsneedtobecomputedwrt eachofthem:

whereeachchild’serrorisn-dimensional

c1p=tanh(W+b)c1

26 3/1/18

BTS:3)Adderrormessages

• Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)• Totalerrormessages=errormessagesfromparent+errormessagefromownscore

parentscore

BTSPythonCode:forwardProp

BTSPythonCode:backProp

The second derivative in eq. 28 for output units is simply

�1)ij

�1)i· a

�1)⌘= a

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

�1)ij

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

�2)ij

)| {z }�

�2)ij

+ �W

�2)ji

. (48)

(�(nl

�2)ij

= (�(nl

�2)ij

= (�(nl

�2)ij

�1)a

�1) (50)

= (�(nl

�2)ij

�1)·i a

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

�1)i

= (�(nl

))TW (nl

�1)·i

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

�2)ij

f(W (nl

�2)i· a

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

�2)j

=⇣(�(nl

))TW (nl

�1)·i

0(z(nl

�1)i

�2)j

�1)ji

0(z(nl

�1)i

| {z }

�2)j

�1)i

�2)j

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

(l+1)i

+ �W

Which in one simplified vector notation becomes:

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to richard@socher.org

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

Discussion:SimpleRNN• DecentresultswithsinglematrixTreeRNN

• SingleweightmatrixTreeRNN couldcapturesomephenomenabutnotadequateformorecomplex,higherordercompositionandparsinglongsentences

• Thereisnorealinteractionbetweentheinputwords

• Thecompositionfunctionisthesameforallsyntacticcategories,punctuation,etc. W

pWscore s

Version2:Syntactically-UntiedRNN

• AsymbolicContext-FreeGrammar(CFG)backboneisadequateforbasicsyntacticstructure

• Weusethediscretesyntacticcategoriesofthechildrentochoosethecompositionmatrix

• ATreeRNN candobetterwithdifferentcompositionmatrixfordifferentsyntacticenvironments

• Theresultgivesusabettersemantics

CompositionalVectorGrammars

• Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.

• Solution:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntacticcategoriesofthechildrenforeachbeamcandidate

• CompositionalVectorGrammar=PCFG+TreeRNN

RelatedWorkforparsing

• ResultingCVGParserisrelatedtopreviousworkthatextendsPCFGparsers

• KleinandManning(2003a):manualfeatureengineering• Petrov etal.(2006):learningalgorithmthatsplitsandmerges

syntacticcategories• Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach

categorywithalexicalitem• HallandKlein(2012)combineseveralsuchannotationschemesina

factoredparser.• CVGsextendtheseideasfromdiscreterepresentationstoricher

continuousones

Experiments• StandardWSJsplit,labeledF1• BasedonsimplePCFGwithfewerstates• Fastpruningofsearchspace,fewmatrix-vectorproducts• 3.8%higherF1,20%fasterthanStanfordfactoredparser

Parser Test, AllSentencesStanfordPCFG, (KleinandManning,2003a) 85.5Stanford Factored(KleinandManning,2003b) 86.6

FactoredPCFGs(Hall andKlein,2012) 89.4Collins(Collins, 1997) 87.7SSN(Henderson, 2004) 89.4Berkeley Parser(Petrov andKlein,2007) 90.1CVG(RNN)(Socheretal., ACL2013) 85.0CVG(SU-RNN)(Socheretal., ACL2013) 90.4Charniak - SelfTrained (McClosky etal.2006) 91.0Charniak - SelfTrained-ReRanked (McClosky etal.2006) 92.13/1/18Lecture1,Slide34

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

LearnssoftnotionofheadwordsInitialization:

NP-PP PP-NP

PRP$-NP

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

ADJP-NP

ADVP-ADJP

Analysisofresultingvectorrepresentations

Allthefiguresareadjustedforseasonalvariations1.Allthenumbersareadjustedforseasonalfluctuations2.Allthefiguresareadjustedtoremoveusualseasonalpatterns

Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms

Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.

Version3:CompositionalityThroughRecursiveMatrix-VectorSpaces

OnewaytomakethecompositionfunctionmorepowerfulwasbyuntyingtheweightsW

Butwhatifwordsactmostlyasanoperator,e.g.“very”inverygood

Proposal:Anewcompositionfunction

p=tanh(W+b)c1c2

Before:

3/1/1838

CompositionalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks

p=tanh(W+b)c1c2

p=tanh(W+b)C2c1C1c2

39 3/1/18

Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

3/1/1840

PredictingSentimentDistributionsGoodexamplefornon-linearityinlanguage

41 3/1/18

ClassificationofSemanticRelationships

• CananMV-RNNlearnhowalargesyntacticcontextconveysasemanticrelationship?

• My[apartment]e1 hasaprettylarge[kitchen] e2à component-wholerelationship(e2,e1)

• Buildasinglecompositionalsemanticsfortheminimalconstituentincludingbothterms

ClassificationofSemanticRelationships

Classifier Features F1SVM POS,stemming,syntacticpatterns 60.1MaxEnt POS,WordNet,morphologicalfeatures,noun

compoundsystem,thesauri,Googlen-grams77.6

SVM POS,WordNet,prefixes,morphologicalfeatures,dependencyparsefeatures,Levinclasses,PropBank,FrameNet,NomLex-Plus,Googlen-grams,paraphrases,TextRunner

RNN – 74.8MV-RNN – 79.1MV-RNN POS,WordNet,NER 82.4

SceneParsing

• Themeaningofasceneimageisalsoafunctionofsmallerregions,

• howtheycombineaspartstoformlargerobjects,

• andhowtheobjectsinteract.

Similarprincipleofcompositionality.

44 3/1/18

AlgorithmforParsingImages

SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)

Features

Grass Tree

Segments

SemanticRepresentations

People Building

ParsingNaturalSceneImagesParsingNaturalSceneImages

45 3/1/18

Multi-classsegmentation

Method Accuracy

PixelCRF (Gouldetal.,ICCV2009) 74.3

Classifier onsuperpixel features 75.9

Region-basedenergy (Gouldetal.,ICCV2009) 76.4

Locallabelling (Tighe &Lazebnik,ECCV2010) 76.9

Superpixel MRF(Tighe &Lazebnik, ECCV2010) 77.5

SimultaneousMRF(Tighe &Lazebnik,ECCV2010) 77.5

RecursiveNeuralNetwork 78.1

StanfordBackgroundDataset(Gouldetal.2009)46 3/1/18

Nextlecture

• Modeloverview,comparison,extensions,combinations,etc

natural language processing with deep learning cs224n/ling284€¦ · with deep learning...

Documents

natural language processing with deep learning cs224n

natural language processing with deep...

identifying russian trolls on reddit with deep learning...

natural language processing with deep learning...

document classification using deep belief nets lawrence...

natural language processing with deep learning...

natural language processing with deep learning cs224n...

cs224n: natural language processing with deep learning

natural language processing with deep learning cs224n...

cs224n: natural language processing with deep...

cs224n: natural language processing with deep learning1 ·...

natural language processing with deep learning...

natural language processing with deep...

natural language processing with deep learning...

natural language processing with deep learning cs224n...

natural language processing with deep...

cs224n: natural language processing with deep learning ......

cs224n: natural language processing with deep learning...

cs224n: deep learning for nlp - stanford...

cs224n/lin4 with deep learning tural language pr...