cs224n-2018-lecture4 - web.stanford.edu · classification setup and notation • generally we have...

63
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher

Upload: others

Post on 31-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Natural Language Processingwith Deep Learning

CS224N/Ling284

Lecture 4: Word Window Classificationand Neural Networks

Richard Socher

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Page 2: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Organization

• Mainmidterm:Feb13• Alternativemidterm:FridayFeb9– toughchoice

• FinalprojectandPSet4canbothhavegroupsof3.• Highlydiscouraged(especiallyforPSet4)andneedmoreresults

sohigherpressureforyoutocoordinateanddeliver.

• PythonReviewsessiontomorrow(Friday)3-4:20pmatnVidiaauditorium

• Codingsession:NextMonday,6-9pm

• Projectwebsiteisup

1/18/182

Page 3: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

OverviewToday:

• Classificationbackground

• Updatingwordvectorsforclassification

• Windowclassification&crossentropyerrorderivationtips

• Asinglelayerneuralnetwork!

• Max-Marginlossandbackprop

• Thiswillbeatoughlectureforsome!à OH1/18/183

Page 4: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Classificationsetupandnotation

• Generallywehaveatrainingdatasetconsistingofsamples

{xi,yi}Ni=1

• xi areinputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.• Dimensiond

• yi arelabels (oneofC classes) wetrytopredict,forexample:• classes:sentiment,namedentities,buy/selldecision• otherwords• later:multi-wordsequences

1/18/184

Page 5: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Classificationintuition

• Trainingdata:{xi,yi}Ni=1

• Simpleillustrationcase:• Fixed2Dwordvectorstoclassify• Usinglogisticregression• Lineardecisionboundary

• GeneralMLapproach: assumexi arefixed,trainlogisticregressionweights(onlymodifiesthedecisionboundary)

• Goal:Foreachx,predict:

VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

1/18/185

Page 6: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Detailsofthesoftmax classifier

Wecanteaseapartthepredictionfunctionintotwosteps:

1. Takethey’th rowofWandmultiplythatrowwithx:

Computeallfc forc=1,…,C

2. Applysoftmax functiontogetnormalizedprobability:

= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 )

1/18/186

Page 7: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Trainingwithsoftmax andcross-entropyerror

• Foreachtrainingexample{x,y},ourobjectiveistomaximizetheprobabilityofthecorrectclassy

• Hence,weminimizethenegativelogprobabilityofthatclass:

1/18/187

Page 8: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Background:Why“Crossentropy”error

• Assumingagroundtruth(orgoldortarget)probabilitydistributionthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:

• Becauseofone-hotp,theonlytermleftisthenegativelogprobabilityofthetrueclass

1/18/188

Page 9: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Sidenote:TheKLdivergence

• Cross-entropycanbere-writtenintermsoftheentropyandKullback-Leibler divergencebetweenthetwodistributions:

• BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontributiontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq

• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistributionsp andq

1/18/189

Page 10: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Classificationoverafulldataset

• Crossentropylossfunctionoverfulldataset{xi,yi}Ni=1

• Insteadof

• Wewillwritef inmatrixnotation:• Wecanstillindexelementsofitbasedonclass

1/18/1810

Page 11: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Classification:Regularization!

• Reallyfulllossfunctioninpracticeincludesregularization overallparameters𝜃:

• Regularizationpreventsoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel,++)

1/18/1811 modelpower

overfitting

Page 12: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Details:GeneralMLoptimization

• Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:

• Soweonlyupdatethedecisionboundary VisualizationswithConvNetJS byKarpathy

1/18/1812

Page 13: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Classificationdifferencewithwordvectors

• Commonindeeplearning:• LearnbothWandwordvectorsx

Verylarge!Dangerofoverfitting!

1/18/1813

Page 14: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Apitfallwhenretrainingwordvectors

• Setting: Wearetrainingalogisticregressionclassificationmodelformoviereviewsentimentusingsinglewords.

• Inthetrainingdatawehave“TV”and“telly”• Inthetestingdatawehave“television”• Thepre-trained wordvectorshaveallthreesimilar:

• Question:Whathappenswhenweretrainthewordvectors?

TVtelly

television

1/18/1814

Page 15: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Apitfallwhenretrainingwordvectors

• Question:Whathappenswhenwetrainthewordvectors?• Answer:

• Thosethatarein thetrainingdata movearound• “TV”and“telly”

• Wordsnotinthetrainingdatastay• “television”

1/18/1815

TV

telly

television

Thisisbad!

Page 16: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

SowhatshouldIdo?

• Question:ShouldItrainmyownwordvectors?• Answer:

• Ifyouonlyhaveasmall trainingdataset,don’ttrainthewordvectors.

• Ifyouhavehaveaverylarge dataset,itmayworkbettertotrainwordvectorstothetask.

1/18/1816

TV

telly

television

Page 17: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Sidenoteonwordvectorsnotation

• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Usuallyobtainedfrommethodslikeword2vecorGlove

V

L=d ……

aardvarka…meta…zebra

• Thesearethewordfeaturesxword fromnowon

• Newdevelopment(laterintheclass):charactermodels:o

[]

1/18/1817

Page 18: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Windowclassification

• Classifyingsinglewordsisrarelydone.

• Interestingproblemslikeambiguityariseincontext!

• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."

• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway

1/18/1818

Page 19: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Windowclassification

• Idea:classifyawordinitscontextwindowofneighboringwords.

• Forexample,NamedEntityRecognitionisa4-wayclassificationtask:• Person,Location,Organization,None

• Therearemanywaystoclassifyasinglewordincontext• Forexample:averageallthewordsinawindow• Problem:thatwouldlosepositioninformation

1/18/1819

Page 20: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Windowclassification

• Trainsoftmax classifiertoclassifyacenterwordbytakingconcatenationofallwordvectorssurroundingit

• Example:Classify“Paris”inthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….

Xwindow =[xmuseums xin xParis xare xamazing ]T

• Resultingvectorxwindow =x∈ R5d,acolumnvector!

1/18/1820

Page 21: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Simplestwindowclassifier:Softmax

• Withx=xwindow wecanusethesamesoftmax classifierasbefore

• Withcrossentropyerrorasbefore:

• Buthowdoyouupdatethewordvectors?

same

predictedmodeloutputprobability

1/18/1821

Page 22: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Derivinggradientsforwindowmodel

• Shortanswer:Justtakederivativesasbefore

• Longanswer:Let’sgooverstepstogether(helpfulforPSet 1)

• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

• andfc =c’th elementofthefvector

• Hard,thefirsttime,hencesometipsnow:)

1/18/1822

Page 23: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

• Tip2:Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

• Examplerepeated:

Derivinggradientsforwindowmodel

1/18/1823

Page 24: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Tip2continued:Knowthychainrule• Don’tforgetwhichvariablesdependonwhatandthatx

appearsinsideallelementsoff’s

• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc¹ y(alltheincorrectclasses)

Derivinggradientsforwindowmodel

1/18/1824

Page 25: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:

• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:

Derivinggradientsforwindowmodel

1/18/1825

Page 26: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij

• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation

• Tip8:Writethisoutinfullsumsifit’snotclear!

Derivinggradientsforwindowmodel

1/18/1826

Page 27: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Whatisthedimensionalityofthewindowvectorgradient?

• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:

Derivinggradientsforwindowmodel

1/18/1827

Page 28: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]

• Wehave

Derivinggradientsforwindowmodel

1/18/1828

Page 29: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.

• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation

Derivinggradientsforwindowmodel

1/18/1829

Page 30: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• ThegradientofJwrt thesoftmax weightsW!

• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull

What’smissingfortrainingthewindowmodel?

1/18/1830

Page 31: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Anoteonmatriximplementations

• Therearetwoexpensiveoperationsinthesoftmaxclassifier:• Thematrixmultiplicationandtheexp

• Alargematrixmultiplication isalwaysmoreefficientthanaforloop!

• Examplecodeonnextslideà

1/18/1831

Page 32: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Anoteonmatriximplementations

• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix

• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop

1/18/1832

Page 33: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Anoteonmatriximplementations

• ResultoffastermethodisaCxNmatrix:

• Eachcolumnisanf(x)inournotation(unnormalized classscores)

• Youshouldspeed-testyourcodealottoo

• Tl;dr:Matricesareawesome!Matrixmultiplicationisbetterthanforloop

33

Page 34: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Softmax (=logisticregression)alonenotverypowerful

• Softmax onlygiveslineardecisionboundariesintheoriginalspace.

• Withlittledatathatcanbeagoodregularizer

• Withmoredataitisverylimiting!

1/18/1834

Page 35: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Softmax (=logisticregression)isnotverypowerful

• Softmax onlylineardecisionboundaries

• à Unhelpfulwhenproblemiscomplex

• Wouldn’titbecooltogetthesecorrect?

1/18/1835

Page 36: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

NeuralNetsfortheWin!

• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!

1/18/1836

Page 37: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Fromlogisticregressiontoneuralnets

1/18/1837

Page 38: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Demystifyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

Butifyouunderstandhowsoftmax modelswork

Thenyoualreadyunderstand theoperationofabasicneuron!

AsingleneuronAcomputationalunitwithn(3)inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorrespondstointerceptterm

Output

1/18/1838

Page 39: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Aneuronisessentiallyabinarylogisticregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

1/18/1839

f=nonlinearactivationfct.(e.g.sigmoid),w=weights,b=bias,h=hidden,x=inputs

Page 40: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

1/18/1840

Page 41: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction

Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.

1/18/1841

Page 42: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Aneuralnetwork=runningseverallogisticregressionsatthesametime

Beforeweknowit,wehaveamultilayerneuralnetwork….

1/18/1842

Page 43: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Matrixnotationforalayer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]

W12

b3

1/18/1843

Page 44: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Non-linearities (aka“f”):Whythey’reneeded

• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform

• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx

• Withmorelayers,theycanapproximatemorecomplexfunctions!

1/18/1844

Page 45: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Binaryclassificationwithunnormalized scores

• Revisitingourpreviousexample:Xwindow =[xmuseums xin xParis xare xamazing ]

• AssumewewanttoclassifywhetherthecenterwordisaLocation(NamedEntityRecognition)

• Similartoword2vec,wewillgooverallpositionsinacorpus.Butthistime,itwillbesupervisedandonlysomepositionsshouldgetahighscore.

• ThepositionsthathaveanactualNERlocationintheircenterarecalled“true”positions.

1/18/1845

Page 46: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

BinaryclassificationforNER

• Example:NotallmuseumsinParisareamazing.

• Here:onetruewindow,theonewithParisinitscenterandallotherwindowsare“corrupt”intermsofnothavinganamedentitylocationintheircenter.

• “Corrupt“windowsareeasytofindandtherearemany:Anywindowthatisn’tspecificallylabeledasNERlocationinourcorpus

1/18/1846

Page 47: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomeoutput.

• Forinstance,aprobabilityviasoftmax:𝑝 𝑦 𝑥 = softmax 𝑊𝑎

• Oranunnormalized score(evensimpler):

1/18/1847

Page 48: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Summary:Feed-forwardComputation

Wecomputeawindow’sscore witha3-layerneuralnet:

• s=score("museumsinParisareamazing”)

xwindow =[xmuseums xin xParis xare xamazing ]

1/18/1848

Page 49: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Mainintuitionforextralayer

Thelayerlearnsnon-linearinteractionsbetweentheinputwordvectors.

Example:onlyif“museums” isfirstvectorshoulditmatterthat“in” isinthesecondposition

Xwindow =[xmuseums xin xParis xare xamazing ]

1/18/1849

Page 50: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Themax-marginloss

• Ideafortrainingobjective:Maketruewindow’sscorelargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• s =score(museumsinParisareamazing)• sc =score(NotallmuseumsinParis)

• Thisisnotdifferentiablebutitis Eachoptioncontinuous-->wecanuseSGD. iscontinuous

1/18/1850

Page 51: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Max-marginloss

• Objectiveforasinglewindow:

• EachwindowwithanNERlocationatitscentershouldhaveascore+1higherthananywindowwithoutalocationatitscenter

• xxx|ß 1à|ooo

• Forfullobjectivefunction:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows.

• Similartonegativesamplinginword2vec1/18/1851

Page 52: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Derivinggradientsforbackprop

AssumingcostJis>0,computethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

à1/18/1852

Page 53: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Derivinggradientsforbackprop

• Forthisfunction:

• Let’sconsiderthederivativeofasingleweightWij

• Wij onlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2 nota1

53

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

Page 54: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Derivinggradientsforbackprop

DerivativeofsingleweightWij:

54

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

IgnoreconstantsinwhichWijdoesn’tappear

PulloutUi sinceit’sconstant,applychainrule

Applydefinitionofa

Justdefinederivativeoffasf’

Plugindefinitionofz

Page 55: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

whereforlogisticf

Derivinggradientsforbackprop

DerivativeofsingleweightWij continued:

Localerrorsignal

Localinputsignal

55

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

Page 56: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

• Wewantallcombinationsofi =1,2 and j=1,2,3à ?

• Solution:Outerproduct:

whereisthe“responsibility”orerrorsignalcomingfromeachactivationa

Derivinggradientsforbackprop

• Sofar,derivativeofsingleWij only,butwewantgradientforfullW.

S

56

Page 57: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Derivinggradientsforbackprop

• Howtoderivegradientsforbiasesb?

57

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

b

Page 58: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

TrainingwithBackpropagation

That’salmostbackpropagationIt’stakingderivativesandusingthechainrule

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers!

Example:lastderivativesofmodel,thewordvectorsinx

1/18/1858

Page 59: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

TrainingwithBackpropagation

• Takederivativeofscorewithrespecttosingleelementofwordvector

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative 1/18/1859

Page 60: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

TrainingwithBackpropagation

• With,whatisthefullgradient?à

• Observations:Theerrormessage𝛿 thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer

1/18/1860

Page 61: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Puttingallgradientstogether:

• Remember:Fullobjectivefunctionforeachwindowwas:

• Forexample:gradientforjustU:

1/18/1861

Page 62: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Summary

Congrats!Superusefulbasiccomponentsandrealmodel

• Wordvectortraining

• Windows

• Softmax andcrossentropyerror à PSet1

• Scoresandmax-marginloss

• Neuralnetwork à PSet1

1/18/1862

Page 63: cs224n-2018-lecture4 - web.stanford.edu · Classification setup and notation • Generally we have a training dataset consisting of samples {x i,y i}N i=1 • x iare inputs, e.g

Nextlecture:

Takingmoreanddeeperderivativesà FullBackprop

Highleveltipsforeasierderivations

Thenwehaveallthebasictoolsinplacetolearnaboutandhavefunwithmorecomplexanddeepermodels:)

1/18/1863