cs224n/ling284 - stanford university€¦ · • because h(p) is zero in our case (and even if it...

Natural Language Processing with Deep Learning

CS224N/Ling284

Lecture 4: Word Window Classification and Neural Networks

Christopher Manning and Richard Socher

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

OverviewToday:

•  Classifica(onbackground

•  Upda(ngwordvectorsforclassifica(on

•  Windowclassifica(on&crossentropyerrorderiva(on(ps

•  Asinglelayerneuralnetwork!

•  Max-Marginlossandbackprop

ThislecturewillhelpalotwithPSet1:)

Classifica6onsetupandnota6on

•  Generallywehaveatrainingdatasetconsis(ngofsamples

{xi,yi}Ni=1

•  xi-inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.

•  yi-labelswetrytopredict,forexample•  class:sen(ment,nameden((es,buy/selldecision,•  otherwords•  later:mul(-wordsequences

Classifica6onintui6on

•  Trainingdata:{xi,yi}Ni=1

•  Simpleillustra(oncase:•  Fixed2dwordvectorstoclassify•  Usinglogis(cregression• àlineardecisionboundaryà

•  GeneralML:assumexisfixed,trainlogis(cregressionweightsWàonlymodifythedecisionboundary

•  Goal:predictforeachx:where

Visualiza(onswithConvNetJSbyKarpathy!hZp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Detailsoftheso=max

•  Wecanteaseapart intotwosteps:

1.  Takethey’throwofWandmul(plythatrowwithx:

Computeallfcforc=1,…,C2.  Normalizetoobtainprobabilitywithso^maxfunc(on:

=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑓)↓𝑦 

Theso=maxandcross-entropyerror

•  Foreachtrainingexample{x,y},ourobjec(veistomaximizetheprobabilityofthecorrectclassy

•  Hence,weminimizethenega(velogprobabilityofthatclass:

Background:Why“Crossentropy”error

•  Assumingagroundtruth(orgoldortarget)probabilitydistribu(onthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:

•  Becauseofone-hotp,theonlytermle=isthenega6velogprobabilityofthetrueclass

Sidenote:TheKLdivergence

•  Cross-entropycanbere-wriZenintermsoftheentropyandKullback-Leiblerdivergencebetweenthetwodistribu(ons:

•  BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontribu(ontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq

•  TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistribu(onspandq

Classifica6onoverafulldataset

•  Crossentropylossfunc(onoverfulldataset{xi,yi}Ni=1

•  Insteadof

•  Wewillwritefinmatrixnota(on:•  Wecans(llindexelementsofitbasedonclass

Classifica6on:Regulariza6on!

•  Reallyfulllossfunc(onoveranydatasetincludesregulariza6onoverallparameters𝜃:

•  Regulariza(onwillpreventoverfigngwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)•  x-axis:morepowerfulmodelormoretrainingitera(ons

•  Blue:trainingerror,red:testerror

Details:GeneralMLop6miza6on

•  Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:

•  Soweonlyupdatethedecisionboundary Visualiza(onswithConvNetJSbyKarpathy

Classifica6ondifferencewithwordvectors

•  Commonindeeplearning:•  LearnbothWandwordvectorsx

Verylarge!

OverfigngDanger!

Losinggeneraliza6onbyre-trainingwordvectors

•  Segng:Traininglogis(cregressionformoviereviewsen(mentsinglewordsandinthetrainingdatawehave•  “TV”and“telly”

•  Inthetes(ngdatawehave•  “television”

•  Originallytheywereallsimilar(frompre-trainingwordvectors)

•  Whathappenswhenwetrainthewordvectors?

TVtelly

television


•  Whathappenswhenwetrainthewordvectors?•  Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay

•  Example:•  Intrainingdata:“TV”and“telly”•  Onlyintes(ngdata:“television” TV

telly

television:(


•  Takehomemessage:Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.Ifyouhavehaveaverylargedataset,itmayworkbeZertotrainwordvectorstothetask.

TVtelly

television

Sidenoteonwordvectorsnota6on

•  ThewordvectormatrixLisalsocalledlookuptable•  Wordvectors=wordembeddings=wordrepresenta(ons(mostly)•  Mostlyfrommethodslikeword2vecorGlove

V

L=d ……

aardvarka…meta…zebra•  Thesearethewordfeaturesxwordfromnowon

•  Newdevelopment(laterintheclass):charactermodels:o

[]

Windowclassifica6on

•  Classifyingsinglewordsisrarelydone.

•  Interes(ngproblemslikeambiguityariseincontext!

•  Example:auto-antonyms:•  "Tosanc(on"canmean"topermit"or"topunish.”•  "Toseed"canmean"toplaceseeds"or"toremoveseeds."

•  Example:ambiguousnameden((es:•  ParisàParis,FrancevsParisHilton•  HathawayàBerkshireHathawayvsAnneHathaway

Windowclassifica6on

•  Idea:classifyawordinitscontextwindowofneighboringwords.

•  Forexamplenameden(tyrecogni(oninto4classes:•  Person,loca(on,organiza(on,none

•  Manypossibili(esexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosesposi(oninforma(on

Windowclassifica6on

•  Trainso^maxclassifierbyassigningalabeltoacenterwordandconcatena(ngallwordvectorssurroundingit

•  Example:ClassifyParisinthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….Xwindow=[xmuseumsxinxParisxarexamazing]T

•  Resul(ngvectorxwindow=x∈ R5d,acolumnvector!

Simplestwindowclassifier:So=max

•  Withx=xwindowwecanusethesameso^maxclassifierasbefore

•  Withcrossentropyerrorasbefore:

•  Buthowdoyouupdatethewordvectors?

same

predictedmodeloutputprobability

Upda6ngconcatenatedwordvectors

•  Shortanswer:Justtakederiva(vesasbefore

•  Longanswer:Let’sgooverstepstogether(helpfulforPSet1)

•  Define:•  :so^maxprobabilityoutputvector(seepreviousslide)•  :targetprobabilitydistribu(on(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

•  andfc=c’thelementofthefvector

•  Hard,thefirst(me,hencesome(psnow:)

•  Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

•  Tip2:Chainrule!Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:•  Simpleexample:


•  Tip2con(nued:Knowthychainrule•  Don’tforgetwhichvariablesdependonwhatandthatx

appearsinsideallelementsoff’s

•  Tip3:Fortheso^maxpartofthederiva(ve:Firsttakethederiva(vewrtfcwhenc=y(thecorrectclass),thentakederiva(vewrtfcwhenc≠y(alltheincorrectclasses)


•  Tip4:Whenyoutakederiva(vewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpar(alderiva(ves:

•  Tip5:Tolaternotgoinsane&implementa(on!àresultsintermsofvectoropera(onsanddefinesingleindex-ablevectors:


•  Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpar(alderiva(vesofe.g.xiorWij

•  Tip7:Tocleanitupforevenmorecomplexfunc(onslater:Knowdimensionalityofvariables&simplifyintomatrixnota(on

•  Tip8:Writethisoutinfullsumsifit’snotclear!


•  Whatisthedimensionalityofthewindowvectorgradient?

•  xistheen(rewindow,5d-dimensionalwordvectors,sothederiva(vewrttoxhastohavethesamedimensionality:


•  Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

•  Let•  Withxwindow=[xmuseumsxinxParisxarexamazing]

•  Wehave


•  Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnameden((es.

•  Forexample,themodelcanlearnthatseeingxinasthewordjustbeforethecenterwordisindica(veforthecenterwordtobealoca(on


•  ThegradientofJwrttheso^maxweightsW!

•  Similarsteps,writedownpar(alwrtWijfirst!•  Thenwehavefull

What’smissingfortrainingthewindowmodel?

Anoteonmatriximplementa6ons

•  Therearetwoexpensiveopera(onsintheso^max:

•  Thematrixmul(plica(onandtheexp

•  Aforloopisneverasefficientwhenyouimplementitcomparedtoalargematrixmul(plica(on!

•  Examplecodeà


•  Loopingoverwordvectorsinsteadofconcatena(ngthemallintoonelargematrixandthenmul(plyingtheso^maxweightswiththatmatrix

•  1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop


•  ResultoffastermethodisaCxNmatrix:

•  Eachcolumnisanf(x)inournota(on(unnormalizedclassscores)

•  Matricesareawesome!

•  Youshouldspeedtestyourcodealottoo

So=max(=logis6cregression)alonenotverypowerful

•  So^maxonlygiveslineardecisionboundariesintheoriginalspace.

•  WithliZledatathatcanbeagoodregularizer

•  Withmoredataitisverylimi(ng!

So=max(=logis6cregression)isnotverypowerful

•  So^maxonlylineardecisionboundaries

•  àLamewhenproblem iscomplex

•  Wouldn’titbecoolto getthesecorrect?

NeuralNetsfortheWin!

•  Neuralnetworkscanlearnmuchmorecomplexfunc(onsandnonlineardecisionboundaries!

Fromlogis6cregressiontoneuralnets

Demys6fyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

Butifyouunderstandhowso^maxmodelswork

Thenyoualreadyunderstandtheopera(onofabasicneuron!

AsingleneuronAcomputa(onalunitwithn(3)inputs

and1outputandparametersW,b

Ac(va(onfunc(on

Inputs

Biasunitcorrespondstointerceptterm

Output

Aneuronisessen6allyabinarylogis6cregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,baretheparametersofthisneuroni.e.,thislogis(cregressionmodel

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

Aneuralnetwork=runningseverallogis6cregressionsatthesame6meIfwefeedavectorofinputsthroughabunchoflogis(cregressionfunc(ons,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadofAmewhatvariablestheselogisAcregressionsaretryingtopredict!

Aneuralnetwork=runningseverallogis6cregressionsatthesame6me…whichwecanfeedintoanotherlogis(cregressionfunc(on

ItisthelossfuncAonthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredicAngthetargetsforthenextlayer,etc.

Aneuralnetwork=runningseverallogis6cregressionsatthesame6me

Beforeweknowit,wehaveamul(layerneuralnetwork….

Matrixnota6onforalayer

Wehave

Inmatrixnota(on

wherefisappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]

W12

b3

Non-lineari6es(f):Whythey’reneeded

•  Example: function approximation, e.g., regression or classification •  Without non-linearities, deep neural

networks can’t do anything more than a linear transform

•  Extra layers could just be compiled down into a single linear transform: W1W2x = Wx

•  With more layers, they can approximate more complex functions!

Amorepowerful,neuralnetwindowclassifier

•  Revisi(ng

•  Xwindow=[xmuseumsxinxParisxarexamazing]

•  Assumewewanttoclassifywhetherthecenterwordisaloca(onornot

1/19/17RichardSocherLecture5,Slide44

ASingleLayerNeuralNetwork

•  Asinglelayerisacombina(onofalinearlayerandanonlinearity:

•  Theneuralac(va(onsacanthenbeusedtocomputesomeoutput

•  Forinstance,aprobabilityviaso^max𝑝𝑦𝑥 = softmax(𝑊𝑎)

•  Oranunnormalizedscore(evensimpler)

45

Summary:Feed-forwardComputa6on

46

Compu(ngawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

xwindow=[xmuseumsxinxParisxarexamazing]

Mainintui6onforextralayer

47

Thelayerlearnsnon-linearinterac(onsbetweentheinputwordvectors.

Example:onlyif“museums”isfirstvectorshoulditmaZerthat“in”isinthesecondposi(on

Xwindow=[xmuseumsxinxParisxarexamazing]

Themax-marginloss

•  s=score(museumsinParisareamazing)•  sc=score(NotallmuseumsinParis)

•  Ideafortrainingobjec(ve:makescoreoftruewindowlargerandcorruptwindow’sscorelower(un(lthey’regoodenough):minimize

•  Thisiscon(nuous-->wecanuseSGD48

Max-marginObjec6vefunc6on

•  Objec(veforasinglewindow:

•  Eachwindowwithaloca(onatitscentershouldhaveascore+1higherthananywindowwithoutaloca(onatitscenter

•  xxx|ß1à|ooo

•  Forfullobjec(vefunc(on:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows

49

TrainingwithBackpropaga6on

AssumingcostJis>0,computethederiva(vesofsandscwrtalltheinvolvedvariables:U,W,b,x

50


•  Let’sconsiderthederiva(veofasingleweightWij

•  Thisonlyappearsinsideai

•  Forexample:W23isonlyusedtocomputea2

x1 x2x3+1

a1 a2

s U2

W23

51

b2


Deriva(veofweightWij:

52

x1 x2x3+1

a1 a2

s U2

W23

whereforlogis(cf


Deriva(veofsingleweightWij:

Localerrorsignal

Localinputsignal

53

x1 x2x3+1

a1 a2

s U2

W23

•  Wewantallcombina(onsofi=1,2andj=1,2,3à?

•  Solu(on:Outerproduct:whereisthe“responsibility”orerrorsignalcomingfromeachac(va(ona


•  FromsingleweightWijtofullW:

54

x1 x2x3+1

a1 a2

s U2

W23

S


•  Forbiasesb,weget:

55

x1 x2x3+1

a1 a2

s U2

W23


56

That’salmostbackpropaga(onIt’stakingderiva(vesandusingthechainrule

Remainingtrick:wecanre-usederiva(vescomputedforhigherlayersincompu(ngderiva(vesforlowerlayers!

Example:lastderiva(vesofmodel,thewordvectorsinx


•  Takederiva(veofscorewithrespecttosingleelementofwordvector

•  Now,wecannotjusttakeintoconsidera(ononeaibecauseeachxjisconnectedtoalltheneuronsaboveandhencexjinfluencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderiva(ve57


•  With,whatisthefullgradient?à

•  Observa(ons:Theerrormessage𝛿thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer

58

Pu[ngallgradientstogether:

•  Remember:Fullobjec(vefunc(onforeachwindowwas:

•  Forexample:gradientforU:

59

Summary

Congrats!Superusefulbasiccomponentsandrealmodel

•  Wordvectortraining

•  Windows

•  So^maxandcrossentropyerror àPSet1

•  Scoresandmax-marginloss

•  Neuralnetwork àPSet1

Onemorehalfofamath-heavylecture

Thentherestwillbeeasierandmoreapplied:)

Nextlecture:

Projectadvice

Takingmoreanddeeperderiva6vesàFullBackprop

Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodelsandhavesomefun:)

cs224n/ling284 - stanford university€¦ · • because h(p) is zero in our case (and even if it...

Documents