online videos - cs224d.stanford.educs224d.stanford.edu/lectures/cs224d-lecture 4.pdf• “tv” and...
TRANSCRIPT
![Page 1: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/1.jpg)
OnlineVideos
• FERPA
• Signwaiverorsitonthesidesorintheback
• Offcameraquestiontimebeforeandafterlecture
• Questions?
4/5/16RichardSocherLecture1,Slide 1
![Page 2: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/2.jpg)
CS224dDeepNLP
Lecture4:WordWindowClassification
andNeuralNetworks
RichardSocher
![Page 3: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/3.jpg)
Feedbacksofar
• ~70%goodwithcurrentspeed,• 15%toofastà Pleasevisitofficehours• 15%tooslow
• Math,whenglossedoverà notrequired,foodforthoughtforadvancedstudents
• Lecturesdryà understandingbasicsimportant,• Startingnextweekwewillbecomemoreconceptual,introduce
complexmodelsandgainpracticalintuitions
4/5/16RichardSocherLecture1,Slide 3
![Page 4: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/4.jpg)
Feedbacksofar
• Giventhefeedback:clearlydefinewordvectorupdatestoday,movedeadlineofPSet1by2days
• Projectideas:• 2types• moreinfonextweek• myofficehour
• Detail:Intuitionforwordvectorcontextwindow• Relativecontextdifferencesmalleràmoresimilarvectors
4/5/16RichardSocherLecture1,Slide 4
![Page 5: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/5.jpg)
OverviewToday:
• Generalclassificationbackground
• Updatingwordvectorsforclassification
• Windowclassification&crossentropyerrorderivationtips
• Asinglelayerneuralnetwork!
• (Max-Marginlossandbackprop)
4/5/16RichardSocherLecture1,Slide 5
![Page 6: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/6.jpg)
Refresher:Classificationsetupandnotation
• Generallywehaveatrainingdatasetconsistingofsamples
{xi,yi}Ni=1
• xi - inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.
• yi - labelswetrytopredict,• e.g.otherwords• class:sentiment,namedentities,buy/selldecision,• later:multi-wordsequences
4/5/16RichardSocherLecture1,Slide 6
![Page 7: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/7.jpg)
Classificationintuition
• Trainingdata:{xi,yi}Ni=1
• Simpleillustrationcase:• Fixed2dwordvectorstoclassify• Usinglogisticregression• à lineardecisionboundaryà
• GeneralML:assumexisfixedandonlytrainlogisticregressionweightsWandonlymodifythedecisionboundary
4/5/16RichardSocherLecture1,Slide 7
VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
![Page 8: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/8.jpg)
Classificationnotation
• Crossentropylossfunctionoverdataset{xi,yi}Ni=1
• Whereforeachdatapair(xi,yi):
• Wecanwritef inmatrixnotation andindexelementsofitbasedonclass:
4/5/16RichardSocherLecture1,Slide 8
![Page 9: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/9.jpg)
Classification:Regularization!
• Reallyfulllossfunctionoveranydatasetincludesregularizationoverallparametersµ:
• Regularizationwillpreventoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingiterations
• Blue:trainingerror,red:testerror
4/5/16RichardSocherLecture1,Slide 9
![Page 10: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/10.jpg)
Details:GeneralMLoptimization
• Forgeneralmachinelearningµ usuallyonlyconsistsofcolumnsofW:
• Soweonlyupdatethedecisionboundary
4/5/16RichardSocherLecture1,Slide 10
VisualizationswithConvNetJS byKarpathy
![Page 11: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/11.jpg)
Classificationdifferencewithwordvectors
• Commonindeeplearning:• LearnbothWandwordvectorsx
4/5/16RichardSocherLecture1,Slide 11
Verylarge!
OverfittingDanger!
![Page 12: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/12.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Setting:Traininglogisticregressionformoviereviewsentimentandinthetrainingdatawehavethewords• “TV”and“telly”
• Inthetestingdatawehave• “television”
• Originallytheywereallsimilar(frompre-trainingwordvectors)
• Whathappenswhenwetrainthewordvectors?
4/5/16RichardSocherLecture1,Slide 12
TVtelly
television
![Page 13: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/13.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay
• Example:• Intrainingdata:“TV”and“telly”• Intestingdataonly:“television”
4/5/16RichardSocherLecture1,Slide 13
TVtelly
television:(
![Page 14: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/14.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Takehomemessage:
Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.
Ifyouhavehaveaverylargedataset,itmayworkbettertotrainwordvectorstothetask.
4/5/16RichardSocherLecture1,Slide 14
TVtelly
television
![Page 15: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/15.jpg)
Sidenoteonwordvectorsnotation
• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Mostlyfrommethodslikeword2vecorGlove
|V|
L =d ……
aardvarka…meta…zebra• Thesearethewordfeaturesxword fromnowon
• Conceptuallyyougetaword’svectorbyleftmultiplyingaone-hotvectore byL:x =Le2 d£ V¢ V£ 1
[]
15
![Page 16: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/16.jpg)
Windowclassification
• Classifyingsinglewordsisrarelydone.
• Interestingproblemslikeambiguityariseincontext!
• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."
• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway
4/5/16RichardSocherLecture1,Slide 16
![Page 17: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/17.jpg)
Windowclassification
• Idea:classifyawordinitscontextwindowofneighboringwords.
• Forexamplenamedentityrecognitioninto4classes:• Person,location,organization,none
• Manypossibilitiesexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosespositioninformation
4/5/16RichardSocherLecture1,Slide 17
![Page 18: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/18.jpg)
Windowclassification
• Trainsoftmax classifierbyassigningalabeltoacenterwordandconcatenatingallwordvectorssurroundingit
• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:
…museumsinParisareamazing….
Xwindow =[xmuseums xin xParis xare xamazing ]T
• Resultingvectorxwindow =x2 R5d,acolumnvector!
4/5/16RichardSocherLecture1,Slide 18
![Page 19: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/19.jpg)
Simplestwindowclassifier:Softmax
• Withx=xwindow wecanusethesamesoftmax classifierasbefore
• Withcrossentropyerrorasbefore:
• Buthowdoyouupdatethewordvectors?
4/5/16RichardSocherLecture1,Slide 19
same
predictedmodeloutputprobability
![Page 20: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/20.jpg)
Updatingconcatenatedwordvectors
• Shortanswer:Justtakederivativesasbefore
• Longanswer:Let’sgooverthestepstogether(you’llhavetofillinthedetailsinPSet 1!)
• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)
• andfc =c’th elementofthefvector
• Hard,thefirsttime,hencesometipsnow:)
4/5/16RichardSocherLecture1,Slide 20
![Page 21: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/21.jpg)
• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!
• Tip2:Knowthychainruleanddon’tforgetwhichvariablesdependonwhat:
• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc≠ y(alltheincorrectclasses)
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 21
![Page 22: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/22.jpg)
• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:
• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 22
![Page 23: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/23.jpg)
• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij
• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation
• Tip8:Writethisoutinfullsumsifit’snotclear!
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 23
![Page 24: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/24.jpg)
• Whatisthedimensionalityofthewindowvectorgradient?
• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 24
![Page 25: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/25.jpg)
• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:
• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]
• Wehave
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 25
![Page 26: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/26.jpg)
• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.
• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation
Updatingconcatenatedwordvectors
4/5/16RichardSocherLecture1,Slide 26
![Page 27: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/27.jpg)
• ThegradientofJwrt thesoftmax weightsW!
• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull
What’smissingfortrainingthewindowmodel?
4/5/16RichardSocherLecture1,Slide 27
![Page 28: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/28.jpg)
Anoteonmatriximplementations
4/5/16RichardSocher28
• Therearetwoexpensiveoperationsinthesoftmax:
• Thematrixmultiplication andtheexp
• Aforloopisneverasefficientwhenyouimplementitcomparedvs whenyouusealargermatrixmultiplication!
• Examplecodeà
![Page 29: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/29.jpg)
Anoteonmatriximplementations
4/5/16RichardSocher29
• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix
• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop
![Page 30: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/30.jpg)
Anoteonmatriximplementations
4/5/16RichardSocher30
• ResultoffastermethodisaCxNmatrix:
• Eachcolumnisanf(x)inournotation(unnormalized classscores)
• Matricesareawesome!
• Youshouldspeedtestyourcodealottoo
![Page 31: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/31.jpg)
Softmax (=logisticregression)isnotverypowerful
4/5/16RichardSocher31
• Softmax onlygiveslineardecisionboundariesintheoriginalspace.
• Withlittledatathatcanbeagoodregularizer
• Withmoredataitisverylimiting!
![Page 32: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/32.jpg)
Softmax (=logisticregression)isnotverypowerful
4/5/16RichardSocher32
• Softmax onlylineardecisionboundaries
• à Lamewhenproblemiscomplex
• Wouldn’titbecooltogetthesecorrect?
![Page 33: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/33.jpg)
NeuralNetsfortheWin!
4/5/16RichardSocher33
• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!
![Page 34: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/34.jpg)
Fromlogisticregressiontoneuralnets
34
![Page 35: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/35.jpg)
Demystifyingneuralnetworks
Neuralnetworkscomewiththeirownterminologicalbaggage
…justlikeSVMs
Butifyouunderstandhowsoftmax modelswork
Thenyoualreadyunderstand theoperationofabasicneuralnetworkneuron!
AsingleneuronAcomputationalunitwithn(3) inputs
and1outputandparametersW,b
Activationfunction
Inputs
Biasunitcorresponds tointerceptterm
Output
35
![Page 36: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/36.jpg)
Aneuronisessentiallyabinarylogisticregressionunit
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel
36
b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm
![Page 37: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/37.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…
Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!
37
![Page 38: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/38.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction
Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.
38
![Page 39: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/39.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametime
Beforeweknowit,wehaveamultilayerneuralnetwork….
39
![Page 40: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/40.jpg)
Matrixnotationforalayer
Wehave
Inmatrixnotation
wheref isappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]40
W12
b3
![Page 41: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/41.jpg)
Non-linearities (f):Whythey’reneeded
• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform
• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx
• Withmorelayers,theycanapproximatemorecomplexfunctions!
41
![Page 42: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/42.jpg)
Amorepowerfulwindowclassifier
• Revisiting
• Xwindow =[xmuseums xin xParis xare xamazing ]
4/5/16RichardSocherLecture1,Slide 42
![Page 43: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/43.jpg)
ASingleLayerNeuralNetwork
• Asinglelayerisacombinationofalinearlayerandanonlinearity:
• Theneuralactivationsacanthenbeusedtocomputesomefunction
• Forinstance,asoftmax probabilityoranunnormalized score:
43
![Page 44: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/44.jpg)
Summary:Feed-forwardComputation
44
Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)
Xwindow =[xmuseums xin xParis xare xamazing ]
![Page 45: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/45.jpg)
Nextlecture:
4/5/16RichardSocher45
Trainingawindow-basedneuralnetwork.
Takingmoredeeperderivativesà Backprop
Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodels:)
![Page 46: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/46.jpg)
Probablyfornextlecture…
4/5/16RichardSocher46
![Page 47: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/47.jpg)
Anotheroutputlayerandlossfunctioncombo!
47
• Sofar:softmax andcross-entropyerror(exp slow)
• Wedon’talwaysneedprobabilities,oftenunnormalized scoresareenoughtoclassifycorrectly.
• Also:Max-margin!
• Moreonthatinfuturelectures!
![Page 48: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/48.jpg)
NeuralNetmodeltoclassifygrammaticalphrases
4/5/16RichardSocher48
• Idea:Trainaneuralnetworktoproducehighscoresforgrammatical phrasesofspecificlengthandlowscoresforungrammaticalphrases
• s =score(catchillsonamat)
• sc =score(catchillsMenloamat)
![Page 49: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/49.jpg)
Anotheroutputlayerandlossfunctioncombo!
• Ideafortrainingobjective• Makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize
• Thisiscontinuous,canperformSGD49
![Page 50: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/50.jpg)
TrainingwithBackpropagation
AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x
50
![Page 51: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/51.jpg)
TrainingwithBackpropagation
• Let’sconsiderthederivativeofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23 isonlyusedtocomputea2
x1 x2x3 +1
a1 a2
s U2
W23
51
![Page 52: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/52.jpg)
TrainingwithBackpropagation
DerivativeofweightWij:
52
x1 x2x3 +1
a1 a2
s U2
W23
![Page 53: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/53.jpg)
whereforlogisticf
TrainingwithBackpropagation
DerivativeofsingleweightWij :
Localerrorsignal
Localinputsignal
53
x1 x2x3 +1
a1 a2
s U2
W23
![Page 54: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/54.jpg)
• Wewantallcombinationsofi =1,2 and j=1,2,3
• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa
TrainingwithBackpropagation
• FromsingleweightWij tofullW:
54
x1 x2x3 +1
a1 a2
s U2
W23
![Page 55: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/55.jpg)
TrainingwithBackpropagation
• Forbiasesb,weget:
55
x1 x2x3 +1
a1 a2
s U2
W23
![Page 56: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/56.jpg)
TrainingwithBackpropagation
56
That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!
Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers
Example:lastderivativesofmodel,thewordvectorsinx
![Page 57: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/57.jpg)
TrainingwithBackpropagation
• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwaslonger)
• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderivative57
![Page 58: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can](https://reader033.vdocuments.us/reader033/viewer/2022050517/5fa18bdcad44ad4e093768d3/html5/thumbnails/58.jpg)
Summary
4/5/16RichardSocher58