cs224n-2018-lecture4 - web.stanford.edu · classification setup and notation • generally we have...
TRANSCRIPT
Natural Language Processingwith Deep Learning
CS224N/Ling284
Lecture 4: Word Window Classificationand Neural Networks
Richard Socher
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 2: Word Vectors
Organization
• Mainmidterm:Feb13• Alternativemidterm:FridayFeb9– toughchoice
• FinalprojectandPSet4canbothhavegroupsof3.• Highlydiscouraged(especiallyforPSet4)andneedmoreresults
sohigherpressureforyoutocoordinateanddeliver.
• PythonReviewsessiontomorrow(Friday)3-4:20pmatnVidiaauditorium
• Codingsession:NextMonday,6-9pm
• Projectwebsiteisup
1/18/182
OverviewToday:
• Classificationbackground
• Updatingwordvectorsforclassification
• Windowclassification&crossentropyerrorderivationtips
• Asinglelayerneuralnetwork!
• Max-Marginlossandbackprop
• Thiswillbeatoughlectureforsome!à OH1/18/183
Classificationsetupandnotation
• Generallywehaveatrainingdatasetconsistingofsamples
{xi,yi}Ni=1
• xi areinputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.• Dimensiond
• yi arelabels (oneofC classes) wetrytopredict,forexample:• classes:sentiment,namedentities,buy/selldecision• otherwords• later:multi-wordsequences
1/18/184
Classificationintuition
• Trainingdata:{xi,yi}Ni=1
• Simpleillustrationcase:• Fixed2Dwordvectorstoclassify• Usinglogisticregression• Lineardecisionboundary
• GeneralMLapproach: assumexi arefixed,trainlogisticregressionweights(onlymodifiesthedecisionboundary)
• Goal:Foreachx,predict:
VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
1/18/185
Detailsofthesoftmax classifier
Wecanteaseapartthepredictionfunctionintotwosteps:
1. Takethey’th rowofWandmultiplythatrowwithx:
Computeallfc forc=1,…,C
2. Applysoftmax functiontogetnormalizedprobability:
= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 )
1/18/186
Trainingwithsoftmax andcross-entropyerror
• Foreachtrainingexample{x,y},ourobjectiveistomaximizetheprobabilityofthecorrectclassy
• Hence,weminimizethenegativelogprobabilityofthatclass:
1/18/187
Background:Why“Crossentropy”error
• Assumingagroundtruth(orgoldortarget)probabilitydistributionthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:
• Becauseofone-hotp,theonlytermleftisthenegativelogprobabilityofthetrueclass
1/18/188
Sidenote:TheKLdivergence
• Cross-entropycanbere-writtenintermsoftheentropyandKullback-Leibler divergencebetweenthetwodistributions:
• BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontributiontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq
• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistributionsp andq
1/18/189
Classificationoverafulldataset
• Crossentropylossfunctionoverfulldataset{xi,yi}Ni=1
• Insteadof
• Wewillwritef inmatrixnotation:• Wecanstillindexelementsofitbasedonclass
1/18/1810
Classification:Regularization!
• Reallyfulllossfunctioninpracticeincludesregularization overallparameters𝜃:
• Regularizationpreventsoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel,++)
1/18/1811 modelpower
overfitting
Details:GeneralMLoptimization
• Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:
• Soweonlyupdatethedecisionboundary VisualizationswithConvNetJS byKarpathy
1/18/1812
Classificationdifferencewithwordvectors
• Commonindeeplearning:• LearnbothWandwordvectorsx
Verylarge!Dangerofoverfitting!
1/18/1813
Apitfallwhenretrainingwordvectors
• Setting: Wearetrainingalogisticregressionclassificationmodelformoviereviewsentimentusingsinglewords.
• Inthetrainingdatawehave“TV”and“telly”• Inthetestingdatawehave“television”• Thepre-trained wordvectorshaveallthreesimilar:
• Question:Whathappenswhenweretrainthewordvectors?
TVtelly
television
1/18/1814
Apitfallwhenretrainingwordvectors
• Question:Whathappenswhenwetrainthewordvectors?• Answer:
• Thosethatarein thetrainingdata movearound• “TV”and“telly”
• Wordsnotinthetrainingdatastay• “television”
1/18/1815
TV
telly
television
Thisisbad!
SowhatshouldIdo?
• Question:ShouldItrainmyownwordvectors?• Answer:
• Ifyouonlyhaveasmall trainingdataset,don’ttrainthewordvectors.
• Ifyouhavehaveaverylarge dataset,itmayworkbettertotrainwordvectorstothetask.
1/18/1816
TV
telly
television
Sidenoteonwordvectorsnotation
• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Usuallyobtainedfrommethodslikeword2vecorGlove
V
L=d ……
aardvarka…meta…zebra
• Thesearethewordfeaturesxword fromnowon
• Newdevelopment(laterintheclass):charactermodels:o
[]
1/18/1817
Windowclassification
• Classifyingsinglewordsisrarelydone.
• Interestingproblemslikeambiguityariseincontext!
• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."
• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway
1/18/1818
Windowclassification
• Idea:classifyawordinitscontextwindowofneighboringwords.
• Forexample,NamedEntityRecognitionisa4-wayclassificationtask:• Person,Location,Organization,None
• Therearemanywaystoclassifyasinglewordincontext• Forexample:averageallthewordsinawindow• Problem:thatwouldlosepositioninformation
1/18/1819
Windowclassification
• Trainsoftmax classifiertoclassifyacenterwordbytakingconcatenationofallwordvectorssurroundingit
• Example:Classify“Paris”inthecontextofthissentencewithwindowlength2:
…museumsinParisareamazing….
Xwindow =[xmuseums xin xParis xare xamazing ]T
• Resultingvectorxwindow =x∈ R5d,acolumnvector!
1/18/1820
Simplestwindowclassifier:Softmax
• Withx=xwindow wecanusethesamesoftmax classifierasbefore
• Withcrossentropyerrorasbefore:
• Buthowdoyouupdatethewordvectors?
same
predictedmodeloutputprobability
1/18/1821
Derivinggradientsforwindowmodel
• Shortanswer:Justtakederivativesasbefore
• Longanswer:Let’sgooverstepstogether(helpfulforPSet 1)
• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)
• andfc =c’th elementofthefvector
• Hard,thefirsttime,hencesometipsnow:)
1/18/1822
• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!
• Tip2:Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
• Examplerepeated:
Derivinggradientsforwindowmodel
1/18/1823
• Tip2continued:Knowthychainrule• Don’tforgetwhichvariablesdependonwhatandthatx
appearsinsideallelementsoff’s
• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc¹ y(alltheincorrectclasses)
Derivinggradientsforwindowmodel
1/18/1824
• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:
• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:
Derivinggradientsforwindowmodel
1/18/1825
• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij
• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation
• Tip8:Writethisoutinfullsumsifit’snotclear!
Derivinggradientsforwindowmodel
1/18/1826
• Whatisthedimensionalityofthewindowvectorgradient?
• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:
Derivinggradientsforwindowmodel
1/18/1827
• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:
• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]
• Wehave
Derivinggradientsforwindowmodel
1/18/1828
• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.
• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation
Derivinggradientsforwindowmodel
1/18/1829
• ThegradientofJwrt thesoftmax weightsW!
• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull
What’smissingfortrainingthewindowmodel?
1/18/1830
Anoteonmatriximplementations
• Therearetwoexpensiveoperationsinthesoftmaxclassifier:• Thematrixmultiplicationandtheexp
• Alargematrixmultiplication isalwaysmoreefficientthanaforloop!
• Examplecodeonnextslideà
1/18/1831
Anoteonmatriximplementations
• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix
• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop
1/18/1832
Anoteonmatriximplementations
• ResultoffastermethodisaCxNmatrix:
• Eachcolumnisanf(x)inournotation(unnormalized classscores)
• Youshouldspeed-testyourcodealottoo
• Tl;dr:Matricesareawesome!Matrixmultiplicationisbetterthanforloop
33
Softmax (=logisticregression)alonenotverypowerful
• Softmax onlygiveslineardecisionboundariesintheoriginalspace.
• Withlittledatathatcanbeagoodregularizer
• Withmoredataitisverylimiting!
1/18/1834
Softmax (=logisticregression)isnotverypowerful
• Softmax onlylineardecisionboundaries
• à Unhelpfulwhenproblemiscomplex
• Wouldn’titbecooltogetthesecorrect?
1/18/1835
NeuralNetsfortheWin!
• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!
1/18/1836
Fromlogisticregressiontoneuralnets
1/18/1837
Demystifyingneuralnetworks
Neuralnetworkscomewiththeirownterminologicalbaggage
Butifyouunderstandhowsoftmax modelswork
Thenyoualreadyunderstand theoperationofabasicneuron!
AsingleneuronAcomputationalunitwithn(3)inputs
and1outputandparametersW,b
Activationfunction
Inputs
Biasunitcorrespondstointerceptterm
Output
1/18/1838
Aneuronisessentiallyabinarylogisticregressionunit
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel
b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm
1/18/1839
f=nonlinearactivationfct.(e.g.sigmoid),w=weights,b=bias,h=hidden,x=inputs
Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…
Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!
1/18/1840
Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction
Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.
1/18/1841
Aneuralnetwork=runningseverallogisticregressionsatthesametime
Beforeweknowit,wehaveamultilayerneuralnetwork….
1/18/1842
Matrixnotationforalayer
Wehave
Inmatrixnotation
wheref isappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]
W12
b3
1/18/1843
Non-linearities (aka“f”):Whythey’reneeded
• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform
• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx
• Withmorelayers,theycanapproximatemorecomplexfunctions!
1/18/1844
Binaryclassificationwithunnormalized scores
• Revisitingourpreviousexample:Xwindow =[xmuseums xin xParis xare xamazing ]
• AssumewewanttoclassifywhetherthecenterwordisaLocation(NamedEntityRecognition)
• Similartoword2vec,wewillgooverallpositionsinacorpus.Butthistime,itwillbesupervisedandonlysomepositionsshouldgetahighscore.
• ThepositionsthathaveanactualNERlocationintheircenterarecalled“true”positions.
1/18/1845
BinaryclassificationforNER
• Example:NotallmuseumsinParisareamazing.
• Here:onetruewindow,theonewithParisinitscenterandallotherwindowsare“corrupt”intermsofnothavinganamedentitylocationintheircenter.
• “Corrupt“windowsareeasytofindandtherearemany:Anywindowthatisn’tspecificallylabeledasNERlocationinourcorpus
1/18/1846
ASingleLayerNeuralNetwork
• Asinglelayerisacombinationofalinearlayerandanonlinearity:
• Theneuralactivationsacanthenbeusedtocomputesomeoutput.
• Forinstance,aprobabilityviasoftmax:𝑝 𝑦 𝑥 = softmax 𝑊𝑎
• Oranunnormalized score(evensimpler):
1/18/1847
Summary:Feed-forwardComputation
Wecomputeawindow’sscore witha3-layerneuralnet:
• s=score("museumsinParisareamazing”)
xwindow =[xmuseums xin xParis xare xamazing ]
1/18/1848
Mainintuitionforextralayer
Thelayerlearnsnon-linearinteractionsbetweentheinputwordvectors.
Example:onlyif“museums” isfirstvectorshoulditmatterthat“in” isinthesecondposition
Xwindow =[xmuseums xin xParis xare xamazing ]
1/18/1849
Themax-marginloss
• Ideafortrainingobjective:Maketruewindow’sscorelargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize
• s =score(museumsinParisareamazing)• sc =score(NotallmuseumsinParis)
• Thisisnotdifferentiablebutitis Eachoptioncontinuous-->wecanuseSGD. iscontinuous
1/18/1850
Max-marginloss
• Objectiveforasinglewindow:
• EachwindowwithanNERlocationatitscentershouldhaveascore+1higherthananywindowwithoutalocationatitscenter
• xxx|ß 1à|ooo
• Forfullobjectivefunction:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows.
• Similartonegativesamplinginword2vec1/18/1851
Derivinggradientsforbackprop
AssumingcostJis>0,computethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x
à1/18/1852
Derivinggradientsforbackprop
• Forthisfunction:
• Let’sconsiderthederivativeofasingleweightWij
• Wij onlyappearsinsideai
• Forexample:W23 isonlyusedtocomputea2 nota1
53
x1 x2x3 +1
f(z1)=a1 a2 =f(z2)
s U2
W23b2
Derivinggradientsforbackprop
DerivativeofsingleweightWij:
54
x1 x2x3 +1
f(z1)=a1 a2 =f(z2)
s U2
W23b2
IgnoreconstantsinwhichWijdoesn’tappear
PulloutUi sinceit’sconstant,applychainrule
Applydefinitionofa
Justdefinederivativeoffasf’
Plugindefinitionofz
whereforlogisticf
Derivinggradientsforbackprop
DerivativeofsingleweightWij continued:
Localerrorsignal
Localinputsignal
55
x1 x2x3 +1
f(z1)=a1 a2 =f(z2)
s U2
W23b2
• Wewantallcombinationsofi =1,2 and j=1,2,3à ?
• Solution:Outerproduct:
whereisthe“responsibility”orerrorsignalcomingfromeachactivationa
Derivinggradientsforbackprop
• Sofar,derivativeofsingleWij only,butwewantgradientforfullW.
S
56
Derivinggradientsforbackprop
• Howtoderivegradientsforbiasesb?
57
x1 x2x3 +1
f(z1)=a1 a2 =f(z2)
s U2
W23b2
b
TrainingwithBackpropagation
That’salmostbackpropagationIt’stakingderivativesandusingthechainrule
Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers!
Example:lastderivativesofmodel,thewordvectorsinx
1/18/1858
TrainingwithBackpropagation
• Takederivativeofscorewithrespecttosingleelementofwordvector
• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderivative 1/18/1859
TrainingwithBackpropagation
• With,whatisthefullgradient?à
• Observations:Theerrormessage𝛿 thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer
1/18/1860
Puttingallgradientstogether:
• Remember:Fullobjectivefunctionforeachwindowwas:
• Forexample:gradientforjustU:
1/18/1861
Summary
Congrats!Superusefulbasiccomponentsandrealmodel
• Wordvectortraining
• Windows
• Softmax andcrossentropyerror à PSet1
• Scoresandmax-marginloss
• Neuralnetwork à PSet1
1/18/1862
Nextlecture:
Takingmoreanddeeperderivativesà FullBackprop
Highleveltipsforeasierderivations
Thenwehaveallthebasictoolsinplacetolearnaboutandhavefunwithmorecomplexanddeepermodels:)
1/18/1863