lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

Lecture7:Tricks+WordEmbeddings

AlanRi;er(many slides from Greg Durrett)

Recall:FeedforwardNNs

P (y|x) = softmax(Wg(V f(x)))

nfeatures

nfeaturesdxnmatrix

nonlinearity(tanh,relu,…)

nfeatures

dhiddenunits

dxnmatrix

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

nfeatures

dhiddenunits

soGmaxWf(x)

nfeatures

dhiddenunits

soGmaxWf(x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Recall:BackpropagaJon

dhiddenunits

soGmaxWf(x)

@L@W err(root)err(z)

Recall:BackpropagaJon

dhiddenunits

soGmaxWf(x)

@L@W err(root)@z

@Verr(z)

ThisLecture

‣ Training

‣WordrepresentaJons

‣ word2vec/GloVe

‣ EvaluaJngwordembeddings

TrainingTips

TrainingBasics

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

‣ Thislecture:somepracJcaltricks.TakedeeplearningoropJmizaJoncoursestounderstandthisfurther

HowdoesiniJalizaJonaffectlearning?

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniJalizaJonma;ers!

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiJvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaJve

IniJalizaJon

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

2)IniJalizetoolargeandcellsaresaturated

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

fan-in + fan-out,+

fan-in + fan-out

#‣ XavieriniJalizer:

fan-in + fan-out,+

fan-in + fan-out

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

fan-in + fan-out,+

fan-in + fan-out

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

‣ BatchnormalizaJon(IoffeandSzegedy,2015):periodicallyshiG+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ FormofstochasJcregularizaJon

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ OnelineinPytorch/Tensorflow

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopJmizaJon/hyperparameters

WordRepresentaJons

WordRepresentaJons‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

DiscreteWordRepresentaJons

Brownetal.(1992)

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

Brownetal.(1992)

enjoyablegreat

Brownetal.(1992)

enjoyablegreat

fishcat

dog…

Brownetal.(1992)

enjoyablegreat

fishcat

‣Maximize

dog…

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

enjoyablegreat

fishcat

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperJesshouldthesevectorshave?

WordEmbeddings

goodenjoyable

WordEmbeddings

goodenjoyable

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

WordEmbeddings

goodenjoyable

themoviewasgreat

themoviewasgood

WordEmbeddings

goodenjoyable

themoviewasgreat

themoviewasgood

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

word2vec/GloVe

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

Mikolovetal.(2013)

thedogbittheman

Mikolovetal.(2013)

d-dimensional wordembeddings

thedogbittheman

Mikolovetal.(2013)

thedogbittheman

soGmaxMulJplybyW

Mikolovetal.(2013)

size|V|xd

thedogbittheman

soGmaxMulJplybyW

Mikolovetal.(2013)

size|V|xd

thedogbittheman

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

size|V|xd

thedogbittheman

soGmaxMulJplybyW

Mikolovetal.(2013)

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

thedogbittheman

‣ Parameters:dx|V|(oned-lengthvectorpervocword),|V|xdoutputparameters(W)

soGmaxMulJplybyW

Mikolovetal.(2013)

size|V|xd

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

Mikolovetal.(2013)

Skip-Gram

soGmaxMulJplybyW

gold=dog

Mikolovetal.(2013)

Skip-Gram

soGmaxMulJplybyW

gold=dog

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Skip-Gram

soGmaxMulJplybyW

gold=dog

‣ Anothertrainingexample:bit->the

Mikolovetal.(2013)

Skip-Gram

soGmaxMulJplybyW

gold=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

Mikolovetal.(2013)

HierarchicalSoGmaxP (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

Mikolovetal.(2013)

HierarchicalSoGmax

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

HierarchicalSoGmax

Mikolovetal.(2013)

HierarchicalSoGmax

Mikolovetal.(2013)

HierarchicalSoGmax

Mikolovetal.(2013)

HierarchicalSoGmax

Mikolovetal.(2013)

HierarchicalSoGmax

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

HierarchicalSoGmax

Mikolovetal.(2013)

‣ log(|V|)binarydecisions

HierarchicalSoGmax

‣ HierarchicalsoGmax:

Mikolovetal.(2013)

HierarchicalSoGmax

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

Mikolovetal.(2013)

HierarchicalSoGmax

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

|V|xdparameters Mikolovetal.(2013)

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Mikolovetal.(2013)

(bit,the)=>+1

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

Mikolovetal.(2013)

(bit,a)=>-1(bit,fish)=>-1

Mikolovetal.(2013)

P (y = 1|w, c) = ew·c

ew·c + 1

Mikolovetal.(2013)

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

P (y = 1|w, c) = ew·c

ew·c + 1

Mikolovetal.(2013)

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

logP (y = 0|wi, c)

Mikolovetal.(2013)

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

logP (y = 0|wi, c)sampled

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

Levyetal.(2014)

wordpaircounts

Levyetal.(2014)

wordpaircounts

|V| |V|

word vecs

Levyetal.(2014)

wordpaircounts

|V| |V|

contextvecsword vecs

Levyetal.(2014)

wordpaircounts

|V| |V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaJon…canweinterpretitthisway?

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|Mij = PMI(wi, cj)� log k

numnegaJvesamples

Levyetal.(2014)

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Levyetal.(2014)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Levyetal.(2014)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

Levyetal.(2014)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

‣ …andit’saweightedfactorizaJonproblem(weightedbywordfreq)

GloVe(GlobalVectors)

Penningtonetal.(2014)

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

wordpaircounts

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaJons)

wordpaircounts

Preview:Context-dependentEmbeddings

Petersetal.(2018)

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

Petersetal.(2018)

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

Petersetal.(2018)

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe

EvaluaJon

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

‣ Similarity:similarwordsareclosetoeachother

goodenjoyable

‣ Analogy:

goodenjoyable

‣ Analogy:goodistobestassmartisto???

goodenjoyable

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisJncJonsdon’tma;erinpracJce

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

Changetal.(2017)

HypernymyDetecJon

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

HypernymyDetecJon

Changetal.(2017)

HypernymyDetecJon

‣ word2vec(SGNS)worksbarelybe;erthanrandomguessinghere

Changetal.(2017)

Analogies

(king-man)+woman=queen

Analogies

king+(woman-man)=queen

Analogies

‣Whywouldthisbe?

Analogies

‣Whywouldthisbe?

Analogies

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

Analogies

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Analogies

Levyetal.(2015)

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Faruquietal.(2015)

Originalvectorforfalse

Adaptedvectorforfalse

Faruquietal.(2015)

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ OGenworkspre;ywell

UsingWordEmbeddings

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Fasterbecausenoneedtoupdatetheseparameters

UsingWordEmbeddings

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Approach3:iniJalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,butnotusedforELMo

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanJcs

Takeaways

‣ Lotstotunewithneuralnetworks

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Takeaways

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

Takeaways

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

Takeaways

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

Takeaways

‣ NextJme:RNNsandCNNs

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

Documents

ci sfb (4-) nn2 onduline classic sheets - roofing...

geo lec7 volcanoes

lec7-sem2-cvswk3-20140920 (1).pdf

lec7 print turing machines stanford

lec7 transmission lines and waveguides (ii)

ci sfb (4-) nn2 onduline classic sheets€¦ · onduline ®...

lec7 silicones

theory of computation lec7 pda

satellite lec7

lec7 clean

cse 203-lec7

lec7 data representation

ch6 lec7&8

mae 640 lec7

user guide - images-na.ssl-images-amazon.com€¦ · nn2...

cs668 lec7 interconnection

281 lec7 genome_organization

lec7 sap-1

ete411 lec7

mae 493n 593t lec7