![Page 1: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/1.jpg)
Natural Language Processingwith Deep Learning
CS224N/Ling284
Richard Socher
Lecture 14: Tree Recursive Neural Networks and Constituency Parsing
![Page 2: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/2.jpg)
LecturePlan
1. Motivation:CompositionalityandRecursion2. StructurepredictionwithsimpleTreeRNN:Parsing3. BackpropagationthroughStructure4. Morecomplexunits
Reminders/comments:LearnuponGPUs,Azure,DockerAss4:Getsomethingworking,usingaGPUformilestoneFinalprojectdiscussions– comemeetwithus!OHtodayafterclass.YouhavetocometoeveryOH.NoadditionalfeedbackbeyondOH.Nothingongradescope.
3/1/181
![Page 3: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/3.jpg)
1.ThespectrumoflanguageinCS
2 3/1/18
![Page 4: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/4.jpg)
Semanticinterpretationoflanguage–Notjustwordvectors
Howcanweknowwhenlargerunitsaresimilarinmeaning?
• Thesnowboarder isleapingoveramogul
• Apersononasnowboardjumpsintotheair
Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements
3/1/183
![Page 5: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/5.jpg)
Compositionality
![Page 6: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/6.jpg)
![Page 7: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/7.jpg)
Language understanding –& Artificial Intelligence – requires being able to understand bigger
things from knowing about smaller parts
3/1/186
![Page 8: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/8.jpg)
7
![Page 9: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/9.jpg)
Arelanguagesrecursive?
• Cognitivelysomewhatdebatable• But:recursionisnaturalfordescribinglanguage• [Themanfrom[thecompanythatyouspokewithabout[the
project]yesterday]]• nounphrasecontaininganounphrasecontaininganounphrase• Argumentsfornow:1)Helpfulindisambiguation:
3/1/18Lecture1,Slide8
![Page 10: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/10.jpg)
Isrecursionuseful?
2)Helpfulforsometaskstorefertospecificphrases:• JohnandJanewenttoabigfestival.Theyenjoyedthetripandthemusicthere.
• “they”:JohnandJane• “thetrip”:wenttoabigfestival• “there”:bigfestival
3)Worksbetterforsometaskstousegrammaticaltreestructure• It’sapowerfulpriorforlanguagestructure
3/1/18Lecture1,Slide9
![Page 11: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/11.jpg)
BuildingonWordVectorSpaceModels
10
x2
x1012345678910
5
4
3
2
1Monday
92
Tuesday 9.51.5
Bymappingthemintothesamevectorspace!
15
1.14
thecountryofmybirththeplacewhereIwasborn
Howcanwerepresentthemeaningoflongerphrases?
France 22.5
Germany 13
3/1/18
![Page 12: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/12.jpg)
Howshouldwemapphrasesintoavectorspace?
the countryof my birth
0.40.3
2.33.6
44.5
77
2.13.3
2.53.8
5.56.1
13.5
15
UseprincipleofcompositionalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.
Modelsinthissectioncanjointlylearnparsetreesandcompositionalvectorrepresentations
x2
x1012345678910
5
4
3
2
1
thecountryofmybirth
theplacewhereIwasborn
Monday
Tuesday
FranceGermany
113/1/18
![Page 13: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/13.jpg)
ConstituencySentenceParsing:Whatwewant
91
53
85
91
43
NPNP
PP
S
71
VP
Thecatsatonthemat.12 3/1/18
![Page 14: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/14.jpg)
LearnStructureandRepresentation
NPNP
PP
S
VP
52 3
3
83
54
73
Thecatsatonthemat.
91
53
85
91
43
71
13 3/1/18
![Page 15: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/15.jpg)
Recursivevs.recurrentneuralnetworks
3/1/18
the countryof my birth
0.40.3
2.33.6
44.5
77
2.13.3
2.53.8
5.56.1
13.5
15
the countryof my birth
0.40.3
2.33.6
44.5
77
2.13.3
4.53.8
5.56.1
13.5
15
2.53.8
Lecture1,Slide14
![Page 16: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/16.jpg)
Recursivevs.recurrentneuralnetworks
3/1/18
• Recursiveneuralnetsrequireatreestructure
• Recurrentneuralnetscannotcapturephraseswithoutprefixcontextandoftencapturetoomuchoflastwordsinfinalvector
the countryof my birth
0.40.3
2.33.6
44.5
77
2.13.3
2.53.8
5.56.1
13.5
15
the countryof my birth
0.40.3
2.33.6
44.5
77
2.13.3
4.53.8
5.56.1
13.5
15
2.53.8
Lecture1,Slide15
![Page 17: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/17.jpg)
RecursiveNeuralNetworksforStructurePrediction
onthemat.
91
43
33
83
85
33
Neural Network
83
1.3
Inputs:twocandidatechildren’srepresentationsOutputs:1. Thesemanticrepresentationifthetwonodesaremerged.2. Scoreofhowplausiblethenewnodewouldbe.
85
16 3/1/18
![Page 18: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/18.jpg)
RecursiveNeuralNetworkDefinition
score=UTp
p =tanh(W +b),
SameW parametersatallnodesofthetree
85
33
Neural Network
83
1.3score= =parent
c1 c2
c1c2
17 3/1/18
![Page 19: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/19.jpg)
ParsingasentencewithanRNN
Neural Network
0.120
Neural Network
0.410
Neural Network
2.333
91
53
85
91
43
71
Neural Network
3.152
Neural Network
0.301
Thecatsatonthemat.
18 3/1/18
![Page 20: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/20.jpg)
Parsingasentence
91
53
52
Neural Network
1.121
Neural Network
0.120
Neural Network
0.410
Neural Network
2.333
53
85
91
43
71
19
Thecatsatonthemat.
3/1/18
![Page 21: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/21.jpg)
Parsingasentence
52
Neural Network
1.121
Neural Network
0.120
33
Neural Network
3.683
91
5353
85
91
43
71
20
Thecatsatonthemat.
3/1/18
![Page 22: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/22.jpg)
Parsingasentence
52
33
83
54
73
91
5353
85
91
43
71
21Thecatsatonthemat.
3/1/18
![Page 23: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/23.jpg)
Max-MarginFramework- Details
• Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:
• x issentence;y isparsetree
85
33
RNN
831.3
22 3/1/18
![Page 24: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/24.jpg)
Max-MarginFramework- Details
• Similartomax-marginparsing(Taskar etal.2004),asupervisedmax-marginobjective
• Thelosspenalizesallincorrectdecisions
• StructuresearchforA(x)wasgreedy(joinbestnodeseachtime)• Instead:Beamsearchwithchart
23 3/1/18
![Page 25: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/25.jpg)
BackpropagationThroughStructure
IntroducedbyGoller &Küchler (1996)
Principallythesameasgeneralbackpropagation
Threedifferencesresultingfromtherecursionandtreestructure:1. SumderivativesofW fromallnodes(likeRNN)2. Splitderivativesateachnode(fortree)3. Adderrormessagesfromparent+nodeitself
24
The second derivative in eq. 28 for output units is simply
@a
(nl
)i
@W
(nl
�1)ij
=@
@W
(nl
�1)ij
z
(nl
)i
=@
@W
(nl
�1)ij
⇣W
(nl
�1)i· a
(nl
�1)⌘= a
(nl
�1)j
. (46)
We adopt standard notation and introduce the error � related to an output unit:
@E
n
@W
(nl
�1)ij
= (yi
� t
i
)a(nl
�1)j
= �
(nl
)i
a
(nl
�1)j
. (47)
So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:
@E
@W
(nl
�2)ij
=X
n
@E
n
@a
(nl
)| {z }�
(nl
)
@a
(nl
)
@W
(nl
�2)ij
+ �W
(nl
�2)ji
. (48)
Now,
(�(nl
))T@a
(nl
)
@W
(nl
�2)ij
= (�(nl
))T@z
(nl
)
@W
(nl
�2)ij
(49)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)a
(nl
�1) (50)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)·i a
(nl
�1)i
(51)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
a
(nl
�1)i
(52)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(z(nl
�1)i
) (53)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(W (nl
�2)i· a
(nl
�2)) (54)
= (�(nl
))TW (nl
�1)·i f
0(z(nl
�1)i
)a(nl
�2)j
(55)
=⇣(�(nl
))TW (nl
�1)·i
⌘f
0(z(nl
�1)i
)a(nl
�2)j
(56)
=
0
@s
l+1X
j=1
W
(nl
�1)ji
�
(nl
)j
)
1
Af
0(z(nl
�1)i
)
| {z }
a
(nl
�2)j
(57)
= �
(nl
�1)i
a
(nl
�2)j
(58)
where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.
So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):
�
(l) =⇣(W (l))T �(l+1)
⌘� f 0(z(l)), (59)
7
where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a
(l))a(l). Using that definition, we get thehidden layer backprop derivatives:
@
@W
(l)ij
E
R
= a
(l)j
�
(l+1)i
+ �W
(l)ij
(60)
(61)
Which in one simplified vector notation becomes:
@
@W
(l)E
R
= �
(l+1)(a(l))T + �W
(l). (62)
In summary, the backprop procedure consists of four steps:
1. Apply an input x
n
and forward propagate it through the network to get the hidden and outputactivations using eq. 18.
2. Evaluate �
(nl
) for output units using eq. 42.
3. Backpropagate the �’s to obtain a �
(l) for each hidden layer in the network using eq. 59.
4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.
If you have any further questions or found errors, please send an email to [email protected]
5 Recursive Neural Networks
Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.
References
[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.
8
3/1/18
![Page 26: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/26.jpg)
BTS:1)Sumderivativesofallnodes
Youcanactuallyassumeit’sadifferentW ateachnodeIntuitionviaexample:
Ifwetakeseparatederivativesofeachoccurrence,wegetsame:
25 3/1/18
![Page 27: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/27.jpg)
BTS:2)Splitderivativesateachnode
Duringforwardprop,theparentiscomputedusing2children
Hence,theerrorsneedtobecomputedwrt eachofthem:
whereeachchild’serrorisn-dimensional
85
33
83
c1p=tanh(W+b)c1
c2c2
85
33
83
c1 c2
26 3/1/18
![Page 28: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/28.jpg)
BTS:3)Adderrormessages
• Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)• Totalerrormessages=errormessagesfromparent+errormessagefromownscore
3/1/18Lecture1,Slide27
85
33
83
c1 c2
parentscore
![Page 29: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/29.jpg)
BTSPythonCode:forwardProp
3/1/18Lecture1,Slide28
![Page 30: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/30.jpg)
BTSPythonCode:backProp
3/1/18Lecture1,Slide29
The second derivative in eq. 28 for output units is simply
@a
(nl
)i
@W
(nl
�1)ij
=@
@W
(nl
�1)ij
z
(nl
)i
=@
@W
(nl
�1)ij
⇣W
(nl
�1)i· a
(nl
�1)⌘= a
(nl
�1)j
. (46)
We adopt standard notation and introduce the error � related to an output unit:
@E
n
@W
(nl
�1)ij
= (yi
� t
i
)a(nl
�1)j
= �
(nl
)i
a
(nl
�1)j
. (47)
So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:
@E
@W
(nl
�2)ij
=X
n
@E
n
@a
(nl
)| {z }�
(nl
)
@a
(nl
)
@W
(nl
�2)ij
+ �W
(nl
�2)ji
. (48)
Now,
(�(nl
))T@a
(nl
)
@W
(nl
�2)ij
= (�(nl
))T@z
(nl
)
@W
(nl
�2)ij
(49)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)a
(nl
�1) (50)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)·i a
(nl
�1)i
(51)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
a
(nl
�1)i
(52)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(z(nl
�1)i
) (53)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(W (nl
�2)i· a
(nl
�2)) (54)
= (�(nl
))TW (nl
�1)·i f
0(z(nl
�1)i
)a(nl
�2)j
(55)
=⇣(�(nl
))TW (nl
�1)·i
⌘f
0(z(nl
�1)i
)a(nl
�2)j
(56)
=
0
@s
l+1X
j=1
W
(nl
�1)ji
�
(nl
)j
)
1
Af
0(z(nl
�1)i
)
| {z }
a
(nl
�2)j
(57)
= �
(nl
�1)i
a
(nl
�2)j
(58)
where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.
So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):
�
(l) =⇣(W (l))T �(l+1)
⌘� f 0(z(l)), (59)
7
where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a
(l))a(l). Using that definition, we get thehidden layer backprop derivatives:
@
@W
(l)ij
E
R
= a
(l)j
�
(l+1)i
+ �W
(l)ij
(60)
(61)
Which in one simplified vector notation becomes:
@
@W
(l)E
R
= �
(l+1)(a(l))T + �W
(l). (62)
In summary, the backprop procedure consists of four steps:
1. Apply an input x
n
and forward propagate it through the network to get the hidden and outputactivations using eq. 18.
2. Evaluate �
(nl
) for output units using eq. 42.
3. Backpropagate the �’s to obtain a �
(l) for each hidden layer in the network using eq. 59.
4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.
If you have any further questions or found errors, please send an email to [email protected]
5 Recursive Neural Networks
Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.
References
[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.
8
![Page 31: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/31.jpg)
Discussion:SimpleRNN• DecentresultswithsinglematrixTreeRNN
• SingleweightmatrixTreeRNN couldcapturesomephenomenabutnotadequateformorecomplex,higherordercompositionandparsinglongsentences
• Thereisnorealinteractionbetweentheinputwords
• Thecompositionfunctionisthesameforallsyntacticcategories,punctuation,etc. W
c1 c2
pWscore s
3/1/18Lecture1,Slide30
![Page 32: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/32.jpg)
Version2:Syntactically-UntiedRNN
• AsymbolicContext-FreeGrammar(CFG)backboneisadequateforbasicsyntacticstructure
• Weusethediscretesyntacticcategoriesofthechildrentochoosethecompositionmatrix
• ATreeRNN candobetterwithdifferentcompositionmatrixfordifferentsyntacticenvironments
• Theresultgivesusabettersemantics
3/1/18Lecture1,Slide31
![Page 33: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/33.jpg)
CompositionalVectorGrammars
• Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.
• Solution:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntacticcategoriesofthechildrenforeachbeamcandidate
• CompositionalVectorGrammar=PCFG+TreeRNN
3/1/18Lecture1,Slide32
![Page 34: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/34.jpg)
RelatedWorkforparsing
• ResultingCVGParserisrelatedtopreviousworkthatextendsPCFGparsers
• KleinandManning(2003a):manualfeatureengineering• Petrov etal.(2006):learningalgorithmthatsplitsandmerges
syntacticcategories• Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach
categorywithalexicalitem• HallandKlein(2012)combineseveralsuchannotationschemesina
factoredparser.• CVGsextendtheseideasfromdiscreterepresentationstoricher
continuousones
3/1/18Lecture1,Slide33
![Page 35: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/35.jpg)
Experiments• StandardWSJsplit,labeledF1• BasedonsimplePCFGwithfewerstates• Fastpruningofsearchspace,fewmatrix-vectorproducts• 3.8%higherF1,20%fasterthanStanfordfactoredparser
Parser Test, AllSentencesStanfordPCFG, (KleinandManning,2003a) 85.5Stanford Factored(KleinandManning,2003b) 86.6
FactoredPCFGs(Hall andKlein,2012) 89.4Collins(Collins, 1997) 87.7SSN(Henderson, 2004) 89.4Berkeley Parser(Petrov andKlein,2007) 90.1CVG(RNN)(Socheretal., ACL2013) 85.0CVG(SU-RNN)(Socheretal., ACL2013) 90.4Charniak - SelfTrained (McClosky etal.2006) 91.0Charniak - SelfTrained-ReRanked (McClosky etal.2006) 92.13/1/18Lecture1,Slide34
![Page 36: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/36.jpg)
SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]
LearnssoftnotionofheadwordsInitialization:
NP-CC
NP-PP PP-NP
PRP$-NP
3/1/18Lecture1,Slide35
![Page 37: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/37.jpg)
SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]
ADJP-NP
ADVP-ADJP
JJ-NP
DT-NP
3/1/18Lecture1,Slide36
![Page 38: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/38.jpg)
Analysisofresultingvectorrepresentations
Allthefiguresareadjustedforseasonalvariations1.Allthenumbersareadjustedforseasonalfluctuations2.Allthefiguresareadjustedtoremoveusualseasonalpatterns
Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms
Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.
3/1/18Lecture1,Slide37
![Page 39: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/39.jpg)
Version3:CompositionalityThroughRecursiveMatrix-VectorSpaces
OnewaytomakethecompositionfunctionmorepowerfulwasbyuntyingtheweightsW
Butwhatifwordsactmostlyasanoperator,e.g.“very”inverygood
Proposal:Anewcompositionfunction
p=tanh(W+b)c1c2
Before:
3/1/1838
![Page 40: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/40.jpg)
CompositionalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks
p=tanh(W+b)c1c2
p=tanh(W+b)C2c1C1c2
39 3/1/18
![Page 41: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/41.jpg)
Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]
p=
AB
=P
3/1/1840
![Page 42: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/42.jpg)
PredictingSentimentDistributionsGoodexamplefornon-linearityinlanguage
41 3/1/18
![Page 43: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/43.jpg)
ClassificationofSemanticRelationships
• CananMV-RNNlearnhowalargesyntacticcontextconveysasemanticrelationship?
• My[apartment]e1 hasaprettylarge[kitchen] e2à component-wholerelationship(e2,e1)
• Buildasinglecompositionalsemanticsfortheminimalconstituentincludingbothterms
3/1/18Lecture1,Slide42
![Page 44: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/44.jpg)
ClassificationofSemanticRelationships
Classifier Features F1SVM POS,stemming,syntacticpatterns 60.1MaxEnt POS,WordNet,morphologicalfeatures,noun
compoundsystem,thesauri,Googlen-grams77.6
SVM POS,WordNet,prefixes,morphologicalfeatures,dependencyparsefeatures,Levinclasses,PropBank,FrameNet,NomLex-Plus,Googlen-grams,paraphrases,TextRunner
82.2
RNN – 74.8MV-RNN – 79.1MV-RNN POS,WordNet,NER 82.4
3/1/18Lecture1,Slide43
![Page 45: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/45.jpg)
SceneParsing
• Themeaningofasceneimageisalsoafunctionofsmallerregions,
• howtheycombineaspartstoformlargerobjects,
• andhowtheobjectsinteract.
Similarprincipleofcompositionality.
44 3/1/18
![Page 46: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/46.jpg)
AlgorithmforParsingImages
SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)
Features
Grass Tree
Segments
SemanticRepresentations
People Building
ParsingNaturalSceneImagesParsingNaturalSceneImages
45 3/1/18
![Page 47: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/47.jpg)
Multi-classsegmentation
Method Accuracy
PixelCRF (Gouldetal.,ICCV2009) 74.3
Classifier onsuperpixel features 75.9
Region-basedenergy (Gouldetal.,ICCV2009) 76.4
Locallabelling (Tighe &Lazebnik,ECCV2010) 76.9
Superpixel MRF(Tighe &Lazebnik, ECCV2010) 77.5
SimultaneousMRF(Tighe &Lazebnik,ECCV2010) 77.5
RecursiveNeuralNetwork 78.1
StanfordBackgroundDataset(Gouldetal.2009)46 3/1/18
![Page 48: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · with Deep Learning CS224N/Ling284 Richard Socher Lecture 14: Tree Recursive Neural Networks ... 8 5 16 3/1/18](https://reader035.vdocuments.us/reader035/viewer/2022081616/5fe7d736e12f0d370906a523/html5/thumbnails/48.jpg)
Nextlecture
• Modeloverview,comparison,extensions,combinations,etc
3/1/18Lecture1,Slide47