lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

Post on 28-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture7:Tricks+WordEmbeddings

AlanRi;er(many slides from Greg Durrett)

Recall:FeedforwardNNs

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

nfeatures

f(x)

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeatures

f(x)

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeaturesdxnmatrix

f(x)

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeaturesdxnmatrix

f(x)

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix

f(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

Wf(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

zf(x)

ThisLecture

‣ Training

‣WordrepresentaJons

‣ word2vec/GloVe

‣ EvaluaJngwordembeddings

TrainingTips

TrainingBasics

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

‣ Thislecture:somepracJcaltricks.TakedeeplearningoropJmizaJoncoursestounderstandthisfurther

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniJalizaJonma;ers!

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

HowdoesiniJalizaJonaffectlearning?

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiJvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaJve

HowdoesiniJalizaJonaffectlearning?

IniJalizaJon

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

2)IniJalizetoolargeandcellsaresaturated

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

2)IniJalizetoolargeandcellsaresaturated

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

2)IniJalizetoolargeandcellsaresaturated

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

2)IniJalizetoolargeandcellsaresaturated

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

‣ BatchnormalizaJon(IoffeandSzegedy,2015):periodicallyshiG+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

2)IniJalizetoolargeandcellsaresaturated

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ FormofstochasJcregularizaJon

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasJcregularizaJon

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasJcregularizaJon

‣ OnelineinPytorch/Tensorflow

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopJmizaJon/hyperparameters

WordRepresentaJons

WordRepresentaJons‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

DiscreteWordRepresentaJons

Brownetal.(1992)

DiscreteWordRepresentaJons

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

Brownetal.(1992)

DiscreteWordRepresentaJons

0

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

1

Brownetal.(1992)

DiscreteWordRepresentaJons

good

enjoyablegreat

0

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

isgo

11

1

0

0

Brownetal.(1992)

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

dog…

isgo

0

0 1 1

11

1

1

0

0

Brownetal.(1992)

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

‣Maximize

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperJesshouldthesevectorshave?

WordEmbeddings

goodenjoyable

bad

dog

great

is

WordEmbeddings

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

WordEmbeddings

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

word2vec/GloVe

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

Mikolovetal.(2013)

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

Mikolovetal.(2013)

d-dimensional wordembeddings

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

Mikolovetal.(2013)

d-dimensional wordembeddings

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

‣ Parameters:dx|V|(oned-lengthvectorpervocword),|V|xdoutputparameters(W)

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

Mikolovetal.(2013)

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

Mikolovetal.(2013)

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

HierarchicalSoGmaxP (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

|V|xdparameters Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

P (y = 1|w, c) = ew·c

ew·c + 1

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)sampled

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V|

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

word vecs

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaJon…canweinterpretitthisway?

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

numnegaJvesamples

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

‣ …andit’saweightedfactorizaJonproblem(weightedbywordfreq)

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

GloVe(GlobalVectors)

Penningtonetal.(2014)

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

wordpaircounts

|V|

|V|

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaJons)

wordpaircounts

|V|

|V|

Preview:Context-dependentEmbeddings

Petersetal.(2018)

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

EvaluaJon

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:goodistobestassmartisto???

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisJncJonsdon’tma;erinpracJce

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

Changetal.(2017)

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ word2vec(SGNS)worksbarelybe;erthanrandomguessinghere

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

Analogies

queen

king

Analogies

queen

king

woman

man

Analogies

queen

king

woman

man

(king-man)+woman=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

king+(woman-man)=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

king+(woman-man)=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

king+(woman-man)=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Analogies

Levyetal.(2015)

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

UsingSemanJcKnowledge

Faruquietal.(2015)

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

UsingWordEmbeddings

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ OGenworkspre;ywell

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Fasterbecausenoneedtoupdatetheseparameters

‣ OGenworkspre;ywell

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Approach3:iniJalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,butnotusedforELMo

‣ OGenworkspre;ywell

ComposiJonalSemanJcs

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanJcs

Takeaways

Takeaways

‣ Lotstotunewithneuralnetworks

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ NextJme:RNNsandCNNs

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

top related