cs224n-2018-lecture4 - web.stanford.edu · classification setup and notation • generally we have...

Natural Language Processingwith Deep Learning

CS224N/Ling284

Lecture 4: Word Window Classificationand Neural Networks

Richard Socher

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Organization

• Mainmidterm:Feb13• Alternativemidterm:FridayFeb9– toughchoice

• FinalprojectandPSet4canbothhavegroupsof3.• Highlydiscouraged(especiallyforPSet4)andneedmoreresults

sohigherpressureforyoutocoordinateanddeliver.

• PythonReviewsessiontomorrow(Friday)3-4:20pmatnVidiaauditorium

• Codingsession:NextMonday,6-9pm

• Projectwebsiteisup

1/18/182

OverviewToday:

• Classificationbackground

• Updatingwordvectorsforclassification

• Windowclassification&crossentropyerrorderivationtips

• Asinglelayerneuralnetwork!

• Max-Marginlossandbackprop

• Thiswillbeatoughlectureforsome!à OH1/18/183

Classificationsetupandnotation

• Generallywehaveatrainingdatasetconsistingofsamples

{xi,yi}Ni=1

• xi areinputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.• Dimensiond

• yi arelabels (oneofC classes) wetrytopredict,forexample:• classes:sentiment,namedentities,buy/selldecision• otherwords• later:multi-wordsequences

1/18/184

Classificationintuition

• Trainingdata:{xi,yi}Ni=1

• Simpleillustrationcase:• Fixed2Dwordvectorstoclassify• Usinglogisticregression• Lineardecisionboundary

• GeneralMLapproach: assumexi arefixed,trainlogisticregressionweights(onlymodifiesthedecisionboundary)

• Goal:Foreachx,predict:

VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

1/18/185

Detailsofthesoftmax classifier

Wecanteaseapartthepredictionfunctionintotwosteps:

1. Takethey’th rowofWandmultiplythatrowwithx:

Computeallfc forc=1,…,C

2. Applysoftmax functiontogetnormalizedprobability:

= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 )

1/18/186

Trainingwithsoftmax andcross-entropyerror

• Foreachtrainingexample{x,y},ourobjectiveistomaximizetheprobabilityofthecorrectclassy

• Hence,weminimizethenegativelogprobabilityofthatclass:

1/18/187

Background:Why“Crossentropy”error

• Assumingagroundtruth(orgoldortarget)probabilitydistributionthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:

• Becauseofone-hotp,theonlytermleftisthenegativelogprobabilityofthetrueclass

1/18/188

Sidenote:TheKLdivergence

• Cross-entropycanbere-writtenintermsoftheentropyandKullback-Leibler divergencebetweenthetwodistributions:

• BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontributiontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq

• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistributionsp andq

1/18/189

Classificationoverafulldataset

• Crossentropylossfunctionoverfulldataset{xi,yi}Ni=1

• Insteadof

• Wewillwritef inmatrixnotation:• Wecanstillindexelementsofitbasedonclass

1/18/1810

Classification:Regularization!

• Reallyfulllossfunctioninpracticeincludesregularization overallparameters𝜃:

• Regularizationpreventsoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel,++)

1/18/1811 modelpower

overfitting

Details:GeneralMLoptimization

• Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:

• Soweonlyupdatethedecisionboundary VisualizationswithConvNetJS byKarpathy

1/18/1812

Classificationdifferencewithwordvectors

• Commonindeeplearning:• LearnbothWandwordvectorsx

Verylarge!Dangerofoverfitting!

1/18/1813

Apitfallwhenretrainingwordvectors

• Setting: Wearetrainingalogisticregressionclassificationmodelformoviereviewsentimentusingsinglewords.

• Inthetrainingdatawehave“TV”and“telly”• Inthetestingdatawehave“television”• Thepre-trained wordvectorshaveallthreesimilar:

• Question:Whathappenswhenweretrainthewordvectors?

TVtelly

television

1/18/1814

Apitfallwhenretrainingwordvectors

• Question:Whathappenswhenwetrainthewordvectors?• Answer:

• Thosethatarein thetrainingdata movearound• “TV”and“telly”

• Wordsnotinthetrainingdatastay• “television”

1/18/1815

TV

telly

television

Thisisbad!

SowhatshouldIdo?

• Question:ShouldItrainmyownwordvectors?• Answer:

• Ifyouonlyhaveasmall trainingdataset,don’ttrainthewordvectors.

• Ifyouhavehaveaverylarge dataset,itmayworkbettertotrainwordvectorstothetask.

1/18/1816

TV

telly

television

Sidenoteonwordvectorsnotation

• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Usuallyobtainedfrommethodslikeword2vecorGlove

V

L=d ……

aardvarka…meta…zebra

• Thesearethewordfeaturesxword fromnowon

• Newdevelopment(laterintheclass):charactermodels:o

[]

1/18/1817

Windowclassification

• Classifyingsinglewordsisrarelydone.

• Interestingproblemslikeambiguityariseincontext!

• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."

• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway

1/18/1818


• Idea:classifyawordinitscontextwindowofneighboringwords.

• Forexample,NamedEntityRecognitionisa4-wayclassificationtask:• Person,Location,Organization,None

• Therearemanywaystoclassifyasinglewordincontext• Forexample:averageallthewordsinawindow• Problem:thatwouldlosepositioninformation

1/18/1819


• Trainsoftmax classifiertoclassifyacenterwordbytakingconcatenationofallwordvectorssurroundingit

• Example:Classify“Paris”inthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….

Xwindow =[xmuseums xin xParis xare xamazing ]T

• Resultingvectorxwindow =x∈ R5d,acolumnvector!

1/18/1820

Simplestwindowclassifier:Softmax

• Withx=xwindow wecanusethesamesoftmax classifierasbefore

• Withcrossentropyerrorasbefore:

• Buthowdoyouupdatethewordvectors?

same

predictedmodeloutputprobability

1/18/1821

Derivinggradientsforwindowmodel

• Shortanswer:Justtakederivativesasbefore

• Longanswer:Let’sgooverstepstogether(helpfulforPSet 1)

• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

• andfc =c’th elementofthefvector

• Hard,thefirsttime,hencesometipsnow:)

1/18/1822

• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

• Tip2:Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

• Examplerepeated:


1/18/1823

• Tip2continued:Knowthychainrule• Don’tforgetwhichvariablesdependonwhatandthatx

appearsinsideallelementsoff’s

• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc¹ y(alltheincorrectclasses)


1/18/1824

• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:

• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:


1/18/1825

• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij

• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation

• Tip8:Writethisoutinfullsumsifit’snotclear!


1/18/1826

• Whatisthedimensionalityofthewindowvectorgradient?

• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:


1/18/1827

• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]

• Wehave


1/18/1828

• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.

• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation


1/18/1829

• ThegradientofJwrt thesoftmax weightsW!

• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull

What’smissingfortrainingthewindowmodel?

1/18/1830

Anoteonmatriximplementations

• Therearetwoexpensiveoperationsinthesoftmaxclassifier:• Thematrixmultiplicationandtheexp

• Alargematrixmultiplication isalwaysmoreefficientthanaforloop!

• Examplecodeonnextslideà

1/18/1831


• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix

• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop

1/18/1832


• ResultoffastermethodisaCxNmatrix:

• Eachcolumnisanf(x)inournotation(unnormalized classscores)

• Youshouldspeed-testyourcodealottoo

• Tl;dr:Matricesareawesome!Matrixmultiplicationisbetterthanforloop

33

Softmax (=logisticregression)alonenotverypowerful

• Softmax onlygiveslineardecisionboundariesintheoriginalspace.

• Withlittledatathatcanbeagoodregularizer

• Withmoredataitisverylimiting!

1/18/1834

Softmax (=logisticregression)isnotverypowerful

• Softmax onlylineardecisionboundaries

• à Unhelpfulwhenproblemiscomplex

• Wouldn’titbecooltogetthesecorrect?

1/18/1835

NeuralNetsfortheWin!

• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!

1/18/1836

Fromlogisticregressiontoneuralnets

1/18/1837

Demystifyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

Butifyouunderstandhowsoftmax modelswork

Thenyoualreadyunderstand theoperationofabasicneuron!

AsingleneuronAcomputationalunitwithn(3)inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorrespondstointerceptterm

Output

1/18/1838

Aneuronisessentiallyabinarylogisticregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

1/18/1839

f=nonlinearactivationfct.(e.g.sigmoid),w=weights,b=bias,h=hidden,x=inputs

Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

1/18/1840

Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction

Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.

1/18/1841

Aneuralnetwork=runningseverallogisticregressionsatthesametime

Beforeweknowit,wehaveamultilayerneuralnetwork….

1/18/1842

Matrixnotationforalayer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]

W12

b3

1/18/1843

Non-linearities (aka“f”):Whythey’reneeded

• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform

• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx

• Withmorelayers,theycanapproximatemorecomplexfunctions!

1/18/1844

Binaryclassificationwithunnormalized scores

• Revisitingourpreviousexample:Xwindow =[xmuseums xin xParis xare xamazing ]

• AssumewewanttoclassifywhetherthecenterwordisaLocation(NamedEntityRecognition)

• Similartoword2vec,wewillgooverallpositionsinacorpus.Butthistime,itwillbesupervisedandonlysomepositionsshouldgetahighscore.

• ThepositionsthathaveanactualNERlocationintheircenterarecalled“true”positions.

1/18/1845

BinaryclassificationforNER

• Example:NotallmuseumsinParisareamazing.

• Here:onetruewindow,theonewithParisinitscenterandallotherwindowsare“corrupt”intermsofnothavinganamedentitylocationintheircenter.

• “Corrupt“windowsareeasytofindandtherearemany:Anywindowthatisn’tspecificallylabeledasNERlocationinourcorpus

1/18/1846

ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomeoutput.

• Forinstance,aprobabilityviasoftmax:𝑝 𝑦 𝑥 = softmax 𝑊𝑎

• Oranunnormalized score(evensimpler):

1/18/1847

Summary:Feed-forwardComputation

Wecomputeawindow’sscore witha3-layerneuralnet:

• s=score("museumsinParisareamazing”)

xwindow =[xmuseums xin xParis xare xamazing ]

1/18/1848

Mainintuitionforextralayer

Thelayerlearnsnon-linearinteractionsbetweentheinputwordvectors.

Example:onlyif“museums” isfirstvectorshoulditmatterthat“in” isinthesecondposition

Xwindow =[xmuseums xin xParis xare xamazing ]

1/18/1849

Themax-marginloss

• Ideafortrainingobjective:Maketruewindow’sscorelargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• s =score(museumsinParisareamazing)• sc =score(NotallmuseumsinParis)

• Thisisnotdifferentiablebutitis Eachoptioncontinuous-->wecanuseSGD. iscontinuous

1/18/1850

Max-marginloss

• Objectiveforasinglewindow:

• EachwindowwithanNERlocationatitscentershouldhaveascore+1higherthananywindowwithoutalocationatitscenter

• xxx|ß 1à|ooo

• Forfullobjectivefunction:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows.

• Similartonegativesamplinginword2vec1/18/1851

Derivinggradientsforbackprop

AssumingcostJis>0,computethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

à1/18/1852


• Forthisfunction:

• Let’sconsiderthederivativeofasingleweightWij

• Wij onlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2 nota1

53

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2


DerivativeofsingleweightWij:

54

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

IgnoreconstantsinwhichWijdoesn’tappear

PulloutUi sinceit’sconstant,applychainrule

Applydefinitionofa

Justdefinederivativeoffasf’

Plugindefinitionofz

whereforlogisticf


DerivativeofsingleweightWij continued:

Localerrorsignal

Localinputsignal

55

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

• Wewantallcombinationsofi =1,2 and j=1,2,3à ?

• Solution:Outerproduct:

whereisthe“responsibility”orerrorsignalcomingfromeachactivationa


• Sofar,derivativeofsingleWij only,butwewantgradientforfullW.

S

56


• Howtoderivegradientsforbiasesb?

57

x1 x2x3 +1

f(z1)=a1 a2 =f(z2)

s U2

W23b2

b

TrainingwithBackpropagation

That’salmostbackpropagationIt’stakingderivativesandusingthechainrule

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers!

Example:lastderivativesofmodel,thewordvectorsinx

1/18/1858


• Takederivativeofscorewithrespecttosingleelementofwordvector

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative 1/18/1859


• With,whatisthefullgradient?à

• Observations:Theerrormessage𝛿 thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer

1/18/1860

Puttingallgradientstogether:

• Remember:Fullobjectivefunctionforeachwindowwas:

• Forexample:gradientforjustU:

1/18/1861

Summary

Congrats!Superusefulbasiccomponentsandrealmodel

• Wordvectortraining

• Windows

• Softmax andcrossentropyerror à PSet1

• Scoresandmax-marginloss

• Neuralnetwork à PSet1

1/18/1862

Nextlecture:

Takingmoreanddeeperderivativesà FullBackprop

Highleveltipsforeasierderivations

Thenwehaveallthebasictoolsinplacetolearnaboutandhavefunwithmorecomplexanddeepermodels:)

1/18/1863

cs224n-2018-lecture4 - web.stanford.edu · classification setup and notation • generally we have...

Documents