cs224n/ling284 - stanford university€¦ · • because h(p) is zero in our case (and even if it...

61
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher

Upload: others

Post on 29-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Natural Language Processing with Deep Learning

CS224N/Ling284

Lecture 4: Word Window Classification and Neural Networks

Christopher Manning and Richard Socher

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Page 2: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

OverviewToday:

•  Classifica(onbackground

•  Upda(ngwordvectorsforclassifica(on

•  Windowclassifica(on&crossentropyerrorderiva(on(ps

•  Asinglelayerneuralnetwork!

•  Max-Marginlossandbackprop

ThislecturewillhelpalotwithPSet1:)

Page 3: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Classifica6onsetupandnota6on

•  Generallywehaveatrainingdatasetconsis(ngofsamples

{xi,yi}Ni=1

•  xi-inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.

•  yi-labelswetrytopredict,forexample•  class:sen(ment,nameden((es,buy/selldecision,•  otherwords•  later:mul(-wordsequences

Page 4: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Classifica6onintui6on

•  Trainingdata:{xi,yi}Ni=1

•  Simpleillustra(oncase:•  Fixed2dwordvectorstoclassify•  Usinglogis(cregression• àlineardecisionboundaryà

•  GeneralML:assumexisfixed,trainlogis(cregressionweightsWàonlymodifythedecisionboundary

•  Goal:predictforeachx:where

Visualiza(onswithConvNetJSbyKarpathy!hZp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Page 5: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Detailsoftheso=max

•  Wecanteaseapart intotwosteps:

1.  Takethey’throwofWandmul(plythatrowwithx:

Computeallfcforc=1,…,C2.  Normalizetoobtainprobabilitywithso^maxfunc(on:

=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑓)↓𝑦 

Page 6: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Theso=maxandcross-entropyerror

•  Foreachtrainingexample{x,y},ourobjec(veistomaximizetheprobabilityofthecorrectclassy

•  Hence,weminimizethenega(velogprobabilityofthatclass:

Page 7: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Background:Why“Crossentropy”error

•  Assumingagroundtruth(orgoldortarget)probabilitydistribu(onthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:

•  Becauseofone-hotp,theonlytermle=isthenega6velogprobabilityofthetrueclass

Page 8: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Sidenote:TheKLdivergence

•  Cross-entropycanbere-wriZenintermsoftheentropyandKullback-Leiblerdivergencebetweenthetwodistribu(ons:

•  BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontribu(ontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq

•  TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistribu(onspandq

Page 9: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Classifica6onoverafulldataset

•  Crossentropylossfunc(onoverfulldataset{xi,yi}Ni=1

•  Insteadof

•  Wewillwritefinmatrixnota(on:•  Wecans(llindexelementsofitbasedonclass

Page 10: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Classifica6on:Regulariza6on!

•  Reallyfulllossfunc(onoveranydatasetincludesregulariza6onoverallparameters𝜃:

•  Regulariza(onwillpreventoverfigngwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)•  x-axis:morepowerfulmodelormoretrainingitera(ons

•  Blue:trainingerror,red:testerror

Page 11: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Details:GeneralMLop6miza6on

•  Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:

•  Soweonlyupdatethedecisionboundary Visualiza(onswithConvNetJSbyKarpathy

Page 12: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Classifica6ondifferencewithwordvectors

•  Commonindeeplearning:•  LearnbothWandwordvectorsx

Verylarge!

OverfigngDanger!

Page 13: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Losinggeneraliza6onbyre-trainingwordvectors

•  Segng:Traininglogis(cregressionformoviereviewsen(mentsinglewordsandinthetrainingdatawehave•  “TV”and“telly”

•  Inthetes(ngdatawehave•  “television”

•  Originallytheywereallsimilar(frompre-trainingwordvectors)

•  Whathappenswhenwetrainthewordvectors?

TVtelly

television

Page 14: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Losinggeneraliza6onbyre-trainingwordvectors

•  Whathappenswhenwetrainthewordvectors?•  Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay

•  Example:•  Intrainingdata:“TV”and“telly”•  Onlyintes(ngdata:“television” TV

telly

television:(

Page 15: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Losinggeneraliza6onbyre-trainingwordvectors

•  Takehomemessage:Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.Ifyouhavehaveaverylargedataset,itmayworkbeZertotrainwordvectorstothetask.

TVtelly

television

Page 16: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Sidenoteonwordvectorsnota6on

•  ThewordvectormatrixLisalsocalledlookuptable•  Wordvectors=wordembeddings=wordrepresenta(ons(mostly)•  Mostlyfrommethodslikeword2vecorGlove

V

L=d ……

aardvarka…meta…zebra•  Thesearethewordfeaturesxwordfromnowon

•  Newdevelopment(laterintheclass):charactermodels:o

[]

Page 17: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Windowclassifica6on

•  Classifyingsinglewordsisrarelydone.

•  Interes(ngproblemslikeambiguityariseincontext!

•  Example:auto-antonyms:•  "Tosanc(on"canmean"topermit"or"topunish.”•  "Toseed"canmean"toplaceseeds"or"toremoveseeds."

•  Example:ambiguousnameden((es:•  ParisàParis,FrancevsParisHilton•  HathawayàBerkshireHathawayvsAnneHathaway

Page 18: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Windowclassifica6on

•  Idea:classifyawordinitscontextwindowofneighboringwords.

•  Forexamplenameden(tyrecogni(oninto4classes:•  Person,loca(on,organiza(on,none

•  Manypossibili(esexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosesposi(oninforma(on

Page 19: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Windowclassifica6on

•  Trainso^maxclassifierbyassigningalabeltoacenterwordandconcatena(ngallwordvectorssurroundingit

•  Example:ClassifyParisinthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….Xwindow=[xmuseumsxinxParisxarexamazing]T

•  Resul(ngvectorxwindow=x∈ R5d,acolumnvector!

Page 20: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Simplestwindowclassifier:So=max

•  Withx=xwindowwecanusethesameso^maxclassifierasbefore

•  Withcrossentropyerrorasbefore:

•  Buthowdoyouupdatethewordvectors?

same

predictedmodeloutputprobability

Page 21: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Upda6ngconcatenatedwordvectors

•  Shortanswer:Justtakederiva(vesasbefore

•  Longanswer:Let’sgooverstepstogether(helpfulforPSet1)

•  Define:•  :so^maxprobabilityoutputvector(seepreviousslide)•  :targetprobabilitydistribu(on(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

•  andfc=c’thelementofthefvector

•  Hard,thefirst(me,hencesome(psnow:)

Page 22: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

•  Tip2:Chainrule!Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:•  Simpleexample:

Upda6ngconcatenatedwordvectors

Page 23: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Tip2con(nued:Knowthychainrule•  Don’tforgetwhichvariablesdependonwhatandthatx

appearsinsideallelementsoff’s

•  Tip3:Fortheso^maxpartofthederiva(ve:Firsttakethederiva(vewrtfcwhenc=y(thecorrectclass),thentakederiva(vewrtfcwhenc≠y(alltheincorrectclasses)

Upda6ngconcatenatedwordvectors

Page 24: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Tip4:Whenyoutakederiva(vewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpar(alderiva(ves:

•  Tip5:Tolaternotgoinsane&implementa(on!àresultsintermsofvectoropera(onsanddefinesingleindex-ablevectors:

Upda6ngconcatenatedwordvectors

Page 25: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpar(alderiva(vesofe.g.xiorWij

•  Tip7:Tocleanitupforevenmorecomplexfunc(onslater:Knowdimensionalityofvariables&simplifyintomatrixnota(on

•  Tip8:Writethisoutinfullsumsifit’snotclear!

Upda6ngconcatenatedwordvectors

Page 26: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Whatisthedimensionalityofthewindowvectorgradient?

•  xistheen(rewindow,5d-dimensionalwordvectors,sothederiva(vewrttoxhastohavethesamedimensionality:

Upda6ngconcatenatedwordvectors

Page 27: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

•  Let•  Withxwindow=[xmuseumsxinxParisxarexamazing]

•  Wehave

Upda6ngconcatenatedwordvectors

Page 28: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnameden((es.

•  Forexample,themodelcanlearnthatseeingxinasthewordjustbeforethecenterwordisindica(veforthecenterwordtobealoca(on

Upda6ngconcatenatedwordvectors

Page 29: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  ThegradientofJwrttheso^maxweightsW!

•  Similarsteps,writedownpar(alwrtWijfirst!•  Thenwehavefull

What’smissingfortrainingthewindowmodel?

Page 30: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Anoteonmatriximplementa6ons

•  Therearetwoexpensiveopera(onsintheso^max:

•  Thematrixmul(plica(onandtheexp

•  Aforloopisneverasefficientwhenyouimplementitcomparedtoalargematrixmul(plica(on!

•  Examplecodeà

Page 31: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Anoteonmatriximplementa6ons

•  Loopingoverwordvectorsinsteadofconcatena(ngthemallintoonelargematrixandthenmul(plyingtheso^maxweightswiththatmatrix

•  1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop

Page 32: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Anoteonmatriximplementa6ons

•  ResultoffastermethodisaCxNmatrix:

•  Eachcolumnisanf(x)inournota(on(unnormalizedclassscores)

•  Matricesareawesome!

•  Youshouldspeedtestyourcodealottoo

Page 33: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

So=max(=logis6cregression)alonenotverypowerful

•  So^maxonlygiveslineardecisionboundariesintheoriginalspace.

•  WithliZledatathatcanbeagoodregularizer

•  Withmoredataitisverylimi(ng!

Page 34: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

So=max(=logis6cregression)isnotverypowerful

•  So^maxonlylineardecisionboundaries

•  àLamewhenproblem iscomplex

•  Wouldn’titbecoolto getthesecorrect?

Page 35: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

NeuralNetsfortheWin!

•  Neuralnetworkscanlearnmuchmorecomplexfunc(onsandnonlineardecisionboundaries!

Page 36: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Fromlogis6cregressiontoneuralnets

Page 37: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Demys6fyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

Butifyouunderstandhowso^maxmodelswork

Thenyoualreadyunderstandtheopera(onofabasicneuron!

AsingleneuronAcomputa(onalunitwithn(3)inputs

and1outputandparametersW,b

Ac(va(onfunc(on

Inputs

Biasunitcorrespondstointerceptterm

Output

Page 38: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Aneuronisessen6allyabinarylogis6cregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,baretheparametersofthisneuroni.e.,thislogis(cregressionmodel

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

Page 39: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Aneuralnetwork=runningseverallogis6cregressionsatthesame6meIfwefeedavectorofinputsthroughabunchoflogis(cregressionfunc(ons,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadofAmewhatvariablestheselogisAcregressionsaretryingtopredict!

Page 40: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Aneuralnetwork=runningseverallogis6cregressionsatthesame6me…whichwecanfeedintoanotherlogis(cregressionfunc(on

ItisthelossfuncAonthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredicAngthetargetsforthenextlayer,etc.

Page 41: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Aneuralnetwork=runningseverallogis6cregressionsatthesame6me

Beforeweknowit,wehaveamul(layerneuralnetwork….

Page 42: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Matrixnota6onforalayer

Wehave

Inmatrixnota(on

wherefisappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]

W12

b3

Page 43: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Non-lineari6es(f):Whythey’reneeded

•  Example: function approximation, e.g., regression or classification •  Without non-linearities, deep neural

networks can’t do anything more than a linear transform

•  Extra layers could just be compiled down into a single linear transform: W1W2x = Wx

•  With more layers, they can approximate more complex functions!

Page 44: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Amorepowerful,neuralnetwindowclassifier

•  Revisi(ng

•  Xwindow=[xmuseumsxinxParisxarexamazing]

•  Assumewewanttoclassifywhetherthecenterwordisaloca(onornot

1/19/17RichardSocherLecture5,Slide44

Page 45: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

ASingleLayerNeuralNetwork

•  Asinglelayerisacombina(onofalinearlayerandanonlinearity:

•  Theneuralac(va(onsacanthenbeusedtocomputesomeoutput

•  Forinstance,aprobabilityviaso^max𝑝𝑦𝑥 = softmax(𝑊𝑎)

•  Oranunnormalizedscore(evensimpler)

45

Page 46: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Summary:Feed-forwardComputa6on

46

Compu(ngawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

xwindow=[xmuseumsxinxParisxarexamazing]

Page 47: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Mainintui6onforextralayer

47

Thelayerlearnsnon-linearinterac(onsbetweentheinputwordvectors.

Example:onlyif“museums”isfirstvectorshoulditmaZerthat“in”isinthesecondposi(on

Xwindow=[xmuseumsxinxParisxarexamazing]

Page 48: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Themax-marginloss

•  s=score(museumsinParisareamazing)•  sc=score(NotallmuseumsinParis)

•  Ideafortrainingobjec(ve:makescoreoftruewindowlargerandcorruptwindow’sscorelower(un(lthey’regoodenough):minimize

•  Thisiscon(nuous-->wecanuseSGD48

Page 49: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Max-marginObjec6vefunc6on

•  Objec(veforasinglewindow:

•  Eachwindowwithaloca(onatitscentershouldhaveascore+1higherthananywindowwithoutaloca(onatitscenter

•  xxx|ß1à|ooo

•  Forfullobjec(vefunc(on:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows

49

Page 50: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

AssumingcostJis>0,computethederiva(vesofsandscwrtalltheinvolvedvariables:U,W,b,x

50

Page 51: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

•  Let’sconsiderthederiva(veofasingleweightWij

•  Thisonlyappearsinsideai

•  Forexample:W23isonlyusedtocomputea2

x1 x2x3+1

a1 a2

s U2

W23

51

b2

Page 52: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

Deriva(veofweightWij:

52

x1 x2x3+1

a1 a2

s U2

W23

Page 53: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

whereforlogis(cf

TrainingwithBackpropaga6on

Deriva(veofsingleweightWij:

Localerrorsignal

Localinputsignal

53

x1 x2x3+1

a1 a2

s U2

W23

Page 54: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

•  Wewantallcombina(onsofi=1,2andj=1,2,3à?

•  Solu(on:Outerproduct:whereisthe“responsibility”orerrorsignalcomingfromeachac(va(ona

TrainingwithBackpropaga6on

•  FromsingleweightWijtofullW:

54

x1 x2x3+1

a1 a2

s U2

W23

S

Page 55: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

•  Forbiasesb,weget:

55

x1 x2x3+1

a1 a2

s U2

W23

Page 56: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

56

That’salmostbackpropaga(onIt’stakingderiva(vesandusingthechainrule

Remainingtrick:wecanre-usederiva(vescomputedforhigherlayersincompu(ngderiva(vesforlowerlayers!

Example:lastderiva(vesofmodel,thewordvectorsinx

Page 57: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

•  Takederiva(veofscorewithrespecttosingleelementofwordvector

•  Now,wecannotjusttakeintoconsidera(ononeaibecauseeachxjisconnectedtoalltheneuronsaboveandhencexjinfluencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderiva(ve57

Page 58: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

TrainingwithBackpropaga6on

•  With,whatisthefullgradient?à

•  Observa(ons:Theerrormessage𝛿thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer

58

Page 59: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Pu[ngallgradientstogether:

•  Remember:Fullobjec(vefunc(onforeachwindowwas:

•  Forexample:gradientforU:

59

Page 60: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Summary

Congrats!Superusefulbasiccomponentsandrealmodel

•  Wordvectortraining

•  Windows

•  So^maxandcrossentropyerror àPSet1

•  Scoresandmax-marginloss

•  Neuralnetwork àPSet1

Onemorehalfofamath-heavylecture

Thentherestwillbeeasierandmoreapplied:)

Page 61: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this

Nextlecture:

Projectadvice

Takingmoreanddeeperderiva6vesàFullBackprop

Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodelsandhavesomefun:)