tutorial on neural probabilistic language models piotr mirowski, microsoft bing london south england...

Tutorial on neural probabilistic language models

Piotr Mirowski, Microsoft Bing LondonSouth England NLP Meetup @ UCL

April 30, 2014

2

Ackowledgements• AT&T Labs Research

o Srinivas Bangaloreo Suhrid Balakrishnano Sumit Chopra (now at Facebook)

• New York Universityo Yann LeCun (now at Facebook)

• Microsofto Abhishek Arun

3

About the presenter• NYU (2005-2010)

o Deep learning for time series• Epileptic seizure prediction• Gene regulation networks• Text categorization of online news• Statistical language models

• Bell Labs (2011-2013)o WiFi-based indoor geolocationo SLAM and roboticso Load forecasting in smart grids

• Microsoft Bing (2013-)o AutoSuggest (Query Formulation)

Objective of this tutorialUnderstand deep learning approaches

to distributional semantics:word embeddings and continuous space language models

5

Outline• Probabilistic Language Models (LMs)

o Likelihood of a sentence and LM perplexityo Limitations of n-grams

• Neural Probabilistic LMso Vector-space representation of wordso Neural probabilistic language modelo Log-Bilinear (LBL) LMs (loss function maximization)

• Long-range dependencieso Enhancing LBL with linguistic featureso Recurrent Neural Networks (RNN)

• Applicationso Speech recognition and machine translationo Sentence completion and linguistic regularities

• Bag-of-word-vector approacheso Auto-encoders for texto Continuous bag-of-words and skip-gram models

• Scalability with large vocabularieso Tree-structured LMso Noise-contrastive estimation

6



• Neural Probabilistic LMso Vector-space representation of wordso Neural probabilistic language modelo Log-Bilinear (LBL) LMs





7

ProbabilisticLanguage Models

• Probability of a sequence of words:

• Conditional probability of an upcoming word:

• Chain rule of probability:

• (n-1)th order Markov assumption

),,...,,()( 121 Tt wwwwPWP

),..,,|()...,|()|()(),,...,,( 21213121121 TTTt wwwwPwwwPwwPwPwwwwP

T

ttntnttTt wwwwPwwwwP

1121121 ),...,,|(),,...,,(

),...,,( 121 tT wwwwP

T

tttTt wwwwPwwwwP

1121121 ),...,,|(),,...,,(

8

Learning probabilisticlanguage models

• Learn joint likelihood of training sentencesunder (n-1)th order Markov assumptionusing n-grams

• Maximize the log-likelihood:o Assuming a parametric model θ

• Could we take advantage of higher-order history?

T

t

tntt

T

tttTt wPwwwwPwwwwP

1

11

1121121 )|(),...,,|(),,...,,( w

word history 1211

1 ,...,,

tntnttnt wwww

target word tw

T

t

tnttwP

1

11 ),|(log θw

9

Evaluating language models: perplexity

• How well can we predict next word?

o A random predictor would give each word probability 1/Vwhere V is the size of the vocabulary

o A better model of a text should assign a higher probability to the word that actually occurs

• Perplexity:

Slide courtesy of Abhishek Arun

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

T

t

ttwPppx

1

11 )|( log

T

1- exp w

10

Limitations of n-grams• Conditional likelihood of seeing

a sub-sequence of length n in available training data

• Limitation: discrete model (each word is a token)o Incomplete coverage of the training dataset

Vocabulary of size V words: Vn possible n-grams (exponential in n)

o Semantic similarity between word tokens is not exploited

the cat sat on the mat

the cat sat on the satthe cat sat on the hat

15.0)|( 15

tttwP w

05.0)|( 15

tttwP w

0)|( 15

tttwP w

my cat sat on the mat ?)|( 15

tttwP w

the cat sat on the rug ?)|( 15

tttwP w

11

Workarounds for n-grams

• Smoothingo Adding non-zero offset to probabilities of unseen wordso Example: Kneyser-Ney smoothing

• Back-offo No such trigram? try bigrams…o No such bigram? try unigrams…

• Interpolationo Mix unigram, bigram, trigram, etc…

[Katz, 1987; Chen & Goodman, 1996; Stolcke, 2002]

12








13

Continuous Space Language Models

• Word tokens mapped to vectors in a low-dimensional space

• Conditional word probabilities replaced bynormalized dynamical models on vectors of word embeddings

• Vector-space representation enables semantic/syntactic similarity between words/sentenceso Use cosine similarity as semantic word similarityo Find nearest neighbours: synonyms, antonymso Algebra on words: {king} – {man} + {woman} = {queen}?

14

Vector-space representation of

wordstw“One-hot” of “one-of-V”

representation of a word token at position t in the text corpus, with vocabulary of size V

1

v

V

vzzv

1

D

Vector-space representation of any word v in the vocabularyusing a vector of dimension D

Also calleddistributed representation

11

tntz

zt-

1

zt-

2

zt-

1

Vector-space representation of the tth word history:e.g., concatenation of n-1 vectors of size D

tz

ẑt

Vector-space representationof the prediction of target word wt

(we predict a vector of size D)

15

Learning continuous space language

models• Input:

o word history (one-hot or distributed representation)

• Output:o target word (one-hot or distributed representation)

• Function that approximates word likelihood:o Linear transform o Feed-forward neural networko Recurrent neural networko Continuous bag-of-wordso Skip-gramo …

16

Learning continuous space language

models• How do we learn the word

representations z for each word in the vocabulary?

• How do we learn the model that predicts the next word or its representation ẑt

given a word history?

• Simultaneous learning of model and representation

17

• Bi-Linear scoring function at position t:

o Parametric model θ predicts next wordo Bias bv for word v related to unigram probabilities of word v

o Given a predicted vector ẑt, the actual predicted word is the 1-nearest neighbour of ẑt

o Exhaustive search in large vocabularies (V in millions) can be computationally expensive…

Vector-space representation of

words• Compare two words using vector representations:

o Dot producto Cosine similarityo Euclidean distance

[Mnih & Hinton, 2007]

vvTtt

t bvsvsvs zzzθw θ

,;,1

1

18

• Bi-Linear scoring function at position t:

o Parametric model θ predicts next wordo Bias bv for word v related to unigram probabilities of word v

o Given a predicted vector ẑt, the actual predicted word is the 1-nearest neighbour of ẑt

o Exhaustive search in large vocabularies (V in millions) can be computationally expensive…

Word probabilities fromvector-space

representation• Normalized probability:

o Using softmax function

V

v

vst

t vs

e

evwP

1'

,1

1 ',|tz

tz

w

vvTtt

t bvsvsvs zzzθw θ

,;,1

1


19

• Log-likelihood model:o Numerically more stable

• Loss function to maximize:o Log-likelihood

o In general, loss defined as: score of the right answer + normalization term

o Normalization term is expensive to compute

Loss function

V

v

vsttt ewswwPL

1

11 log|log θ

θw

T

t

tt

T

t

ttTt wPwPwwwwP

1

11

1

11121 )|(log)|(log),,...,,(log ww

V

v

vs

wst

te

ewwP

1

11|

θ

θ

w

20

Neural Probabilistic Language Model

word embedding space ℜD

discrete word space {1, ..., V}

V=18k words


R R R R R

wordembedding

in dimensionD=30

zt-5 zt-4 zt-3 zt-2 zt-1

wt-5 wt-4 wt-3 wt-2 wt-1 wt

Neural network100 hidden units

V output unitsfollowed by

softmax

function z_hist = Embedding_FProp(model, w)% Get the embeddings for all words in wz_hist = model.R(:, w);z_hist = reshape(z_hist, length(w)*model.dim_z, 1);

[Bengio et al, 2001, 2003; Schwenk et al, “Connectionist language modelling for large vocabulary continuous speech recognition”, ICASSP 2002]

A Bh

21




V=18k words


R R R R R

wordembedding

in dimensionD=30





softmax

b

atnt

bBhs

bAzh

1

1

function s = NeuralNet_FProp(model, z_hist)% One hidden layer neural networko = model.A * z_hist + model.bias_a;h = tanh(o);S = model.B * h + model.bias_b;


A Bh

22




V=18k words


R R R R R

wordembedding

in dimensionD=30





softmax


v

vs

wstntt e

ewP

t

θ

θ

w 11|

function p = Softmax_FProp(s)% Probability estimationp_num = exp(s);p = p_num / sum(p_num);

A Bh

23




V=18k words


R R R R R

wordembedding

in dimensionD=30

A Bhzt-5 zt-4 zt-3 zt-2 zt-1




softmax

b

atnt

bBhs

bAzh

1

1

Outperforms best n-grams(Class-based Kneyser-Neyback-off 5-grams) by 7%

Took months to train(in 2001-2002) on AP Newscorpus (14M words)


v

vs

wstntt e

ewP

t

θ

θ

w 11|

Complexity: (n-1)×D + (n-1)×D×H + H×V

24

Log-BilinearLanguage Model



V=18k words


R R R R R

wordembedding

in dimensionD=100




Cc

tntt bCzz

11

E

R

ztẑt

function z_hat = LBL_FProp(model, z_hist)% Simple linear transformZ_hat = model.C * z_hist + model.bias_c;

Simple matrixmultiplication

25




V=18k words


R R R R R

wordembedding

in dimensionD=100




v

vs

wstntt e

ewP

t

θ

θ

w 11|

C E

R

ztẑt

vvTt bvs zzθ

function s = ... Score_FProp(z_hat, model)s = model.R’ * z_hat + model.bias_v;


26




V=18k words


R R R R R

wordembedding

in dimensionD=100




Slightly better thanbest n-grams(Class-based Kneyser-Neyback-off 5-grams)Takes days to train(in 2007) on AP Newscorpus (14 million words)


v

vs

wstntt e

ewP

t

θ

θ

w 11|

Cc

tntt bCzz

11

E

R

ztẑt

vvTt bvs zzθ

Complexity: (n-1)×D + (n-1)×D×D + D×V

27

Nonlinear Log-BilinearLanguage Model



V=18k words


R R R R R

wordembedding

in dimensionD=100





softmax

[Mnih & Hinton, Neural Computation, 2009]

v

vs

wstntt e

ewP

t

θ

θ

w 11|

E

R

ztẑt

vvTt bvs zzθ

bt

atnt

bBhz

bAzh

11

Outperforms best n-grams(Class-based Kneyser-Neyback-off 5-grams) by 24%

Took weeks to train(in 2009-2010) on AP Newscorpus (14M words)

A Bh

Complexity: (n-1)×D + (n-1)×D×H + H×D + D×V

Learning neural language

models• Maximize the log-likelihood of observed data,

w.r.t. parameters θ of the neural language model

• Parameters θ (in a neural language model):o Word embedding matrix R and bias bv

o Neural weights: A, bA, B, bB

• Gradient descent with learning rate η:

V

v

vsttt ewswwPL

1

11 log|log θ

θw

θwθ

,|logmaxarg 11 t

t wwP

θθθ

tL

29

• Maximum Likelihood learning:

o Gradient of log-likelihood w.r.t. parameters θ:

o Use the chain rule of gradients

Maximizing the loss function

V

v

vsttt ewswwPL

1

11 log|log θ

θw

V

v

vs

wst

te

ewwP

1

11|

θ

θ

w

11|log

t

tt wwPL

wθθ

V

v

tt vsvPwsL

1

11| θθ θw

θθ

30

• Maximum Likelihood learning:

o Gradient of log-likelihood w.r.t. parameters θ:

o Neural net: back-propagate gradient

Maximizing the loss function:

example of LBL

V

v

vs

wst

te

ewwP

1

11|

θ

θ

w

V

v

tt vsvPwsL

1

11| θθ θw

θθ

vvTt bvs zzθ

function [dL_dz_hat, dL_dR, dL_dbias_v, w] = ... Loss_BackProp(z_hat, model, p, w)% Gradient of loss w.r.t. word bias parameterdL_dbias_v = -p;dL_dbias_v(w) = 1 - p;% Gradient of loss w.r.t. prediction of (N)LBL modeldL_dz_hat = model.R(:, w) – model.R * p;% Gradient of loss w.r.t. vocabulary matrix RdL_dR = –z_hat * p’;dL_dR(:, w) = z_hat * (1 – p(w));

R=(zv)

1

D

1 V

t

tL

z

θ

z

zθ

t

t

tt LL

Learning neural language

models

1. Forward-propagatethrough word embeddingsand through model

2. Estimate word likelihood (loss)3. Back-propagate loss4. Gradient step to update model

Randomly choose a mini-batch(e.g., 1000 consecutive words)

32




V=18k words


R R R R R

wordembedding

in dimensionD=100





softmax


E

R

ztẑt

FProp1. Look-up embeddings

of the wordsin the n-gram using R

2. Forward propagatethrough the neural net

3. Look-up ALL vocabularywords using Rand computeenergy and probabilities(computationally expensive)

A Bh

33




V=18k words


R R R R R

wordembedding

in dimensionD=100





softmax


E

R

ztẑt

BackProp1. Compute gradients

of loss w.r.t. outputof the neural net,back-propagatethrough neural netlayers B and A(computationally expensive)

2. Back-propagatefurther down to wordembeddings R

3. Compute gradientsof loss w.r.t. wordsof all vocabulary,back-propagate to R

A Bh

Stochastic Gradient Descent (SGD)

• Choice of the learning hyperparameterso Learning rate?o Learning rate decay?o Regularization (L2-norm) of the parameters?o Momentum term on the parameters?

• Use cross-validation on validation seto E.g., on AP News (16M words)

• Training set: 14M words• Validation set: 1M words• Test set: 1M words

35

Limitations of these neural language

models• Computationally expensive to train

o Bottleneck: need to evaluate probability of each word over the entire vocabulary

o Very slow training time (days, weeks)

• Ignores long-range dependencieso Fixed time windows o Continuous version of n-grams

36








37

Adding language features to neural LMs

C


DT NN VBD IN DT

R R R R R

F F F F F

E

R

A Bh


zt


ẑt

Additional featurescan be added as inputsto the neural net / linearprediction function.We tried POS (part-of-speech tags)and super-tags derivedfrom incomplete parsing.

the cat sat on theDT NN VBD IN DT

[Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT;

Bangalore & Joshi (1999) “Supertagging: an approach to almost parsing”, Computational Linguistics]



discrete POSfeatures {0,1}P

feature embedding space ℜF

featureembedding

wordembedding

38

Constraining word representations


http://wordnet.princeton.edu]

C


DT NN VBD IN DT

R R R R R

F F F F F

E

R

A Bh

zt


WordNet graph of words


ẑt

Using the WordNet hierarchicalsimilarity between words,we tried to force some wordsto remain similar to a smallset of WordNet neighbours.No significant change intraining time or languagemodel performance.





featureembedding

wordembedding

http://wordnet.princeton.edu/

39

Topic mixtures of language models

C(1)




DT NN VBD IN DT



R R R R R

F F F F Ffeature

embeddingword

embedding

A(1) B(1)h(1)

E

R

A(k) B(k)h(k)

C(k)

θ1

θ1

θk

θk

f

sentence ordocument

topic(5 topics)


zt


ẑt

We pre-computed theunsupervised topicmodel representation ofeach sentence in trainingusing LDA (Latent DirichletAllocation) [Blei et al, 2003]with 5 topics.On test data, estimate topic usingtrained LDA model.


David Blei (2003) "Latent Dirichlet Allocation", JMLR]

Enables to model long-rangedependencies at sentence level.

40

Word embeddings obtained on Reuters

• Example of word embeddings obtained using our language model on the Reuters corpus(1.5 million words, vocabulary V=12k words), vector space of dimension D=100

• For each word, the 10 nearest neighbours in the vector space retrieved using cosine similarity:

[Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT]

41

Word embeddings obtained on AP News

Example of word embeddings obtained using our LM on AP News(14M words, V=17k), D=100

The word embedding matrix R was projected in 2D by Stochastic t-SNE[Van der Maaten, JMLR 2008]

[Mirowski (2010) “Time series modelling with hidden variables and gradient-based algorithms”, NYU PhD thesis]

42





43





44





45

Recurrent Neural Net (RNN) language model


in dimension D=30 to 250

discrete word space {1, ..., M}

M>100k words


V

W hzt-1


zt

1-layerneural network

with D output units

Time-delay

[Mikolov et al, 2010, 2011]

U

v

vo

wo

ttntt e

ewP

)(1

1| yw

ttt UwWzz 1

tVzoo

xe

x

1

1

Handles longer word history(~10 words) as well as 10-gram feed-forward NNLM

Training algorithm: BPTTBack-Propagation Through Time

Word embedding

matrix

Complexity: D×D + D×D + D×V

46

Context-dependent RNN language model


in dimension D=200

discrete word space {1, ..., M}

M>100k words


V

W hzt-1


zt

1-layerneural network

with D output units

Time-delay

f

sentence ordocument

topic(K=40 topics)

[Mikolov & Zweig, 2012]

F

U

v

vo

wo

ttntt e

ewP

)(1

1| yw

tttt FfUwWzz 1

tt GfVzo oG

xe

x

1

1

Compute topicmodel representationword-by-word on last 50 wordsusing approximate LDA [Blei et al, 2003]with K topics.Enables to model long-rangedependencies at sentence level.

47

Perplexity of RNN language models

[Mirowski, 2010; Mikolov & Zweig, 2012;RNN toolbox: http://research.microsoft.com/en-us/projects/rnn/default.aspx]

AP NewsV=17k vocabularyTrain on 14M wordsValidate on 1M wordsTest on 1M words

Model Test ppx

Kneyser-Ney back-off 5-grams 123.3

Nonlinear LBL (100d)[Mnih & Hinton, 2009, using our implementation]

104.4

NLBL (100d) + 5 topics LDA[Mirowski, 2010, using our implementation]

98.5

RNN (200d) + 40 topics LDA[Mikolov & Zweig, 2012, using RNN toolbox]

86.9

Penn TreeBankV=10k vocabularyTrain on 900k wordsValidate on 80k wordsTest on 80k words

http://research.microsoft.com/en-us/projects/rnn/default.aspx



48








49

#topics POSWord

accuracyMethod

- - 63.7% AT&T Watson [Goffin et al, 2005]

- - 63.5% KN 5-grams on 100-best list

- - 66.6% Oracle: best of 100-best list

- - 57.8% Oracle: worst of 100-best list

0 - 64.1%

Log-Bilinear models with nonlinearityand optional POS tag inputs

and LDA topic model mixtures

0 F=34 64.1%

0 F=3 64.1%

5 - 64.2%

5 F=34 64.6%

5 F=3 64.6%

HUB-4 TV broadcast transcriptsVocabulary V=25k(with proper nouns & numbers)

Train on 1M wordsValidate on 50k wordsTest on 800 sentences

Performance of LBL on speech recognition

Re-rank top 100candidate sentences, provided for each spoken sentenceby a speech recognitionsystem (acoustic model + simple trigram)

[Mirowski et al, 2010]

50

Performance of RNN on machine translation

[Auli et al, 2013]

[Image credits: Auli et al (2013) “Joint Language and Translation Modeling with

Recurrent Neural Networks”, EMNLP]

RNN with 100 hidden nodesTrained using 20-step BPTTUses lattice rescoringRNN trained on 2M words already improves over n-gram trained on 1.15B words

51

Syntactic and Semantic tests with RNN

[Mikolov, Yih and Zweig, 2013]

Z1 ẑZ2 Z3 Zv- + =

cosinesimilarity

Vector offset method

Observed that word embeddings obtained by RNN-LDAhave linguistic regularities “a” is to “b” as “c” is to _Syntactic: king is to kings as queen is to queensSemantic: clothing is to shirt as dish is to bowl

[Image credits: Mikolov et al (2013) “Efficient Estimation of Word Representation

in Vector Space”, arXiv]

52

Microsoft Research Sentence Completion Task

[Zweig & Burges, 2011; Mikolov et al, 2013a; http://research.microsoft.com/apps/pubs/default.aspx?id=157031 ]



1040 sentences with missing word;5 choices for each missing word.

Language model trained on 500 novels(Project Gutenberg) provided 30 alternative wordsfor each missing word;Judges selected top 4 impostor words.

Human performance: 90% accuracy

All red-headed men who are above the age of [ 800 | seven | twenty-one | 1,200 | 60,000] years, are eligible.

That is his [ generous | mother’s | successful | favorite | main ] fault, but on the whole he’s a good worker.

http://research.microsoft.com/apps/pubs/default.aspx?id=157031



53

Semantic-syntactic word evaluation task

[Mikolov et al, 2013a, 2013b; http://code.google.com/p/word2vec]



http://code.google.com/p/word2vec

54

Semantic-syntactic word evaluation task





55








Semantic Hashing

[Hinton & Salakhutdinov, “Reducing the dimensionality of data with neural networks, Science, 2006;Salakhutdinov & Hinton, “Semantic Hashing”, Int J Approx Reason, 2007]

2000

500

250

125

2

125

250

500

2000

Semi-supervised learning

of auto-encoders• Add classifier

module to the codes

• When a input X(t) has a label Y(t), back-propagate the prediction error on Y(t) to the code Z(t)

• Stack the encoders• Train layer-wise

[Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008;Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]

y(t) y(t+1)

z(1)(t) z(1)(t+1)documentclassifier f1

x(t) x(t+1)

y(t) y(t+1)


y(t) y(t+1)


auto-encoder g3,h3

auto-encoder g2,h2

auto-encoder g1,h1

Randomwalk

word histograms

Semi-supervised learning of auto-

encoders

[Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008;Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]

Performance on document retrieval task:Reuters-21k dataset (9.6k training, 4k test),vocabulary 2k words, 10-class classification

Comparison with:• unsupervised techniques

(DBN: Semantic Hashing, LSA) + SVM• traditional technique: word TF-IDF + SVM

Deep Structured Semantic Models for

web search

[Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]

s: “racing car”Input word/phrase

dim = 5MBag-of-words vector

dim = 50K

d=500Letter-tri-gram embedding matrix

Letter-tri-gram coeff.matrix (fixed)

d=500

Semantic vector

d=300

t1: “formula one”

dim = 5M

dim = 50K

d=500

d=500

d=300

t2: “ford model t”

dim = 5M

dim = 50K

d=500

d=500

d=300

Compute Cosine similarity between semantic vectors cos(s,t1) cos(s,t2)

W1

W2

W3

W4

Deep Structured Semantic Models for

web search

Semantic hashing[Salakhutdinov & Hinton, 2007]

[Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]

Deep StructuredSemantic Model

[Huang, He, Gao et al, 2013]

Results on a web ranking task (16k queries)Normalized discounted cumulative gains

61

Continuous Bag-of-Words




V>100k words

the cat on the sat

W

h

wt-2 wt-1 wt+1 wt+2 wt

Simple sum

[Mikolov et al, 2013a; Mnih & Kavukcuoglu, 2013;http://code.google.com/p/word2vec ]

v

vo

woct

ttctt e

ewP

)(

11 ,| ww

c

cictzh

Who

Extremely efficient estimation ofword embeddings in matrix Uwithout a Language Model.Can be used as input to neural LM.Enables much larger datasets, e.g.,Google News (6B words, V=1M)

Word embedding

matrices

Complexity: 2C×D + D×V

UU U

Complexity: 2C×D + D×log(V) (hierarchical softmax using tree factorization)

U


62

Skip-gramword embedding

space ℜD



V>100k words

[Mikolov et al, 2013a, 2013b; Mnih & Kavukcuoglu, 2013;http://code.google.com/p/word2vec ]

v

cvs

cws

tct e

ewwP ,

,

|θ

θ

inputtToutputvcvs ,,, zzθ

Word embedding

matrices

Complexity: 2C×D + 2C×D×V

Complexity: 2C×D + 2C×D×log(V) (hierarchical softmax using tree factorization)

the cat on the sat

U

zt

wt-2 wt-1 wt+1 wt+2 wt

inputt ,z

WW W W

Extremely efficient estimation ofword embeddings in matrix Uwithout a Language Model.Can be used as input to neural LM.Enables much larger datasets, e.g.,Google News (33B words, V=1M)

Complexity: 2C×D + 2C×D×(k+1) (negative sampling with k negative examples)


63

Vector-space word representation without LM


Word and phrase representationlearned by skip-gram exhibit linear structure that enables analogies with vector arithmetics.

This is due to training objective, input and output (before softmax) are in linear relationship.

The sum of vectors in the loss functionis the sum of log-probabilities (or log of product of probabilities), i.e., comparable to the AND function.

[Image credits: Mikolov et al (2013) “Distributed Representations of Words and Phrases and their Compositionality”, NIPS]


64

Examples of Word2Vec embeddings

Example of word embeddings obtained using Word2Vec on the 3.2B word Wikipedia:• Vocabulary V=2M• Continuous vector

space D=200• Trained using

CBOW

debt aa decrease met slow france jesus xboxdebts aaarm increase meeting slower marseille christ playstation

repayments samavat increases meet fast french resurrection wiirepayment obukhovskii decreased meets slowing nantes savior xblamonetary emerlec greatly had slows vichy miscl wiiware

payments gunss decreasing welcomed slowed paris crucified gamecuberepay dekhen increased insisted faster bordeaux god nintendo

mortgage minizini decreases acquainted sluggish aubagne apostles kinect

repaid bf reduces satisfied quicker vend apostle dsiware

refinancingmortardepth reduce first pace vienne bickertonite eshop

bailouts ee increasing persuaded slowly toulouse pretribulational dreamcast



65

Performance on the semantic-syntactic task


Word and phrase representation learned by skip-gram exhibit linear structure that enables analogies with vector arithmetics.Due to training objective, input and output (before softmax) in linear relationship.Sum of vectors is like sum of log-probabilities, i.e. log of product of probabilities,i.e., AND function.





66








67

Computational bottleneck

of large vocabularies• Bulk of computation at

word prediction and at input word embedding layers

• Large vocabularies:o AP News (14M words; V=17k)o HUB-4 (1M words; V=25k)o Google News (6B words,

V=1M)o Wikipedia (3.2B, V=2M)

• Strategies to compress output softmax

scoring function vsts tv ,1

1 w

target word )(tw

word history 11tw

tsgvwP vt

t 11|w

V

v

s

s

vv

v

e

esg

1''

softmax

68

Reducing the bottleneck of large

vocabularies

• Replace rare words, numbers by <unk> token

• Subsample frequent words during trainingo Speed-up 2x to 10xo Better accuracy for rare words

• Hierarchical Softmax (HS)• Noise-Contrastive Estimation (NCE) and

Negative Sampling (NS)

[Morin & Bengio, 2005, Mikolov et al, 2011, 2013b; Mnih & Teh 2012, Mnih & Kavukcuoglu, 2013]

69

Hierarchical softmaxby grouping words

[Mikolov et al, 2011, Auli et al, 2013]

scoring function

target word )(tw

word history 11tw

softmax

• Group words into disjoint classes:o E.g., 20 classes

with frequency binningo Use unigram frequencyo Top 5% words (“the”) go to class

1o Following 5% words go to class 2

• Factorize word probability into:o Class probabilityo Class-conditional

word probability

• Speed-up factor:o O(|V|) to O(|C|+max|VC|)

vcsgcsgvwP tt ,| 1

1 θθw

cvPcPvwP tttt ,||| 1

11

11

1 www

θwθ ;,11 vsvs t

V

v

vs

vs

e

evsg

1'

'θ

θ

θ

vsgvwP tt θw 1

1|

70

Hierarchical softmaxby grouping words

[Mikolov et al, 2011, Auli et al, 2013]

scoring function θwθ ;,11 vsvs t

target word )(tw

V

v

vs

vs

e

evsg

1'

'θ

θ

θ

word history 11tw

softmax

vcsgcsgvwP tt ,| 1

1 θθw

cvPcPvwP tttt ,||| 1

11

11

1 www

[Image credits: Mikolov et al (2011) “Extensions of Recurrent Neural Network

Language Model”, ICASSP]

vsgvwP tt θw 1

1|

71

Hierarchical softmaxusing WordNet

[Image credits: Morin & Bengio (2005) “Hierarchical Probabilistic Neural Network

Language Model”, AISTATS]

[Morin & Bengio, 2005; http://wordnet.princeton.edu]

• Use WordNet to extract IS-A relationshipso Manually select

one parent per childo In the case of multiple

children,cluster them to obtain a binary tree

• Hard to design• Hard to adapt

to other languages

http://wordnet.princeton.edu/

72

Hierarchical softmaxusing Huffman trees

[Image credits: Wikipedia, Wikimedia Commonshttp://en.wikipedia.org/wiki/File:Huffman_tree_2.svg

]

“this is an example of a huffman tree”

[Mikolov et al, 2013a, 2013b]

• Frequency-based binning

http://en.wikipedia.org/wiki/File:Huffman_tree_2.svg

http://en.wikipedia.org/wiki/File:Huffman_tree_2.svg

73

Hierarchical softmaxusing Huffman trees

[Mikolov et al, 2013a, 2013b]

• Replace comparison with V vectorsof target wordsby comparison with log(V) vectors

predicted wordvector tz

target word )(tw

word history 11tw

xe

x

1

1sigmoid

1

1,,1,

11 ˆ1|

wL

j

Tjwnjwnchjwn

tt vwP zzw

path to target wordat node j

jtwn ),(

vector at node j of target word jwn ,z

wL

1

j jwn ,

wLwn , 1,wnch 1,wnch

1,wn

74

• Conditional probability of word w in the data:

• Conditional probability that word w comes from data D and not from the noise distribution:

o Auxiliary binary classification problem:• Positive examples (data) vs. negative examples (noise)

o Scaling factor k: noisy samples k times more likely than data samples• Noise distribution: based on unigram word probabilities

o Empirically, model can cope with un-normalized probabilities:

Noise-Contrastive Estimation

V

v

vs

wst

te

ewwP

1

11|

θ

θ

w

wkPwP

wPwDP

noised

dtt

t

1

1

11

11,|1

w

w

w

[Mnih & Teh, 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al, 2013a, 2013b]

wkPe

ewDP

noisews

wst

θ

θ

w 11,|1

wstd ewPwP

tθθww

,| 11

11

75

• Conditional probability that word w comes from data D and not from the noise distribution:

o Auxiliary binary classification problem:• Positive examples (data) vs. negative examples (noise)

o Scaling factor k: noisy samples k times more likely than data samples

• Noise distribution: based on unigram word probabilitieso Introduce log of difference between:

• score of word w under data distribution• and unigram distribution score of word w



wkPe

ewDP

noisews

wst

θ

θ

w 11,|1

wswDP tθw 1

1,|1

wkPwsws noiselog θθ

xe

x

1

1

76


• New loss function to maximize:

• Compare to Maximum Likelihood learning:

V

v

tt vsvPwsL

1

11| θθ θw

θθ

ik

ii

t vsvswswsL

θθθθ θθθ

1

1'


wkPwP

wPwDP

noised

dtt

t

1

1

11

11,|1

w

w

w

wkPe

ewDP

noisews

wst

θ

θ

w 11,|1

11

11 ,|0log,|1log' 1

1

t

Pt

Pt wDPkEwDPEL

noiset

d

www

77

Negative sampling• Noise contrastive estimation

• Negative samplingo Remove normalization term in probabilities

• Compare to Maximum Likelihood learning:


wkPe

ewDP

noisews

wst

θ

θ

w 11,|1

wswDP tθw 1

1,|1

11

11 ,|0log,|1log' 1

1

t

Pt

Pt wDPkEwDPEL

noiset

d

www

k

iiPt vsEwsL

noise1

loglog' θθ

V

v

vst ewsL

1log θ

θ

78

Speed-up over full softmax

[Mnih & Teh, 2012; Mikolov et al, 2010-2012, 2013b]

LBL with full softmax,trained on APNews data,14M words, V=17k7days

Skip-gram (context 5)with phrases, trainedusing negative sampling,on Google data,33G words, V=692k + phrases1 day


LBL (2-gram, 100d) with full softmax, 1 day

RNN (HS) 50 classes

145.4

0.5

LBL (2-gram, 100d) withnoise contrastive estimation1.5 hoursRNN (100d) with50-class hierarchical softmax0.5 hours (own experience)

[Image credits: Mnih & Teh (2012) “A fast and simple algorithm for training neura probabilistic language models”, ICML]

PennTreeBankdata(900k words,V=10k)

Thank you!• Further references: following this slide• Basic (N)LBL Matlab code available on demand• Contact:

[email protected]• Acknowledgements:

Sumit Chopra (AT&T Labs Research / Facebook)Srinivas Bangalore (AT&T Labs Research)Suhrid Balakrishnan (AT&T Labs Research)Yann LeCun (NYU / Facebook)Abhishek Arun (Microsoft Bing)

mailto:[email protected]

80

References• Basic n-grams with smoothing and

backtracking (no word vector representation):o S. Katz, (1987)

"Estimation of probabilities from sparse data for the language model component of a speech recognizer",IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35, no. 3, pp. 400–401https://www.mscs.mu.edu/~cstruble/moodle/file.php/3/papers/01165125.pdf

o S. F. Chen and J. Goodman (1996)"An empirical study of smoothing techniques for language modelling",ACLhttp://acl.ldc.upenn.edu/P/P96/P96-1041.pdf?origin=publication_detail

o A. Stolcke (2002)"SRILM - an extensible language modeling toolkit”ICSLP, pp. 901–904http://my.fit.edu/~vkepuska/ece5527/Projects/Fall2011/Sundaresan,%20Venkata%20Subramanyan/srilm/doc/paper.pdf

https://www.mscs.mu.edu/~cstruble/moodle/file.php/3/papers/01165125.pdf

https://www.mscs.mu.edu/~cstruble/moodle/file.php/3/papers/01165125.pdf

http://acl.ldc.upenn.edu/P/P96/P96-1041.pdf?origin=publication_detail



http://my.fit.edu/~vkepuska/ece5527/Projects/Fall2011/Sundaresan,%20Venkata%20Subramanyan/srilm/doc/paper.pdf



81

References• Neural network language models:

o Y. Bengio, R. Ducharme, P. Vincent and J.-L. Jauvin (2001, 2003)"A Neural Probabilistic Language Model",NIPS (2000) 13:933-938J. Machine Learning Research (2003) 3:1137-115http://www.iro.umontreal.ca/~lisa/pointeurs/BengioDucharmeVincentJauvin_jmlr.pdf

o F. Morin and Y. Bengio (2005)“Hierarchical probabilistic neural network language model",AISTATShttp://core.kmi.open.ac.uk/download/pdf/22017.pdf#page=255

o Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, J.-L. Gauvain (2006)"Neural Probabilistic Language Models",Innovations in Machine Learning, vol. 194, pp 137-186http://rd.springer.com/chapter/10.1007/3-540-33486-6_6

http://www.iro.umontreal.ca/~lisa/pointeurs/BengioDucharmeVincentJauvin_jmlr.pdf



http://core.kmi.open.ac.uk/download/pdf/22017.pdf#page=255

http://core.kmi.open.ac.uk/download/pdf/22017.pdf#page=255

http://rd.springer.com/chapter/10.1007/3-540-33486-6_6

http://rd.springer.com/chapter/10.1007/3-540-33486-6_6

82

References• Linear and/or nonlinear (neural network-based)

language models:o A. Mnih and G. Hinton (2007)

"Three new graphical models for statistical language modelling",ICML, pp. 641–648, http://www.cs.utoronto.ca/~hinton/absps/threenew.pdf

o A. Mnih, Y. Zhang, and G. Hinton (2009)"Improving a statistical language model through non-linear prediction",Neurocomputing, vol. 72, no. 7-9, pp. 1414 – 1418http://www.sciencedirect.com/science/article/pii/S0925231209000083

o A. Mnih and Y.-W. Teh (2012)"A fast and simple algorithm for training neural probabilistic language models“ICML, http://arxiv.org/pdf/1206.6426

o A. Mnih and K. Kavukcuoglu (2013)“Learning word embeddings efficiently with noise-contrastive estimation“NIPShttp://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf

http://www.cs.utoronto.ca/~hinton/absps/threenew.pdf

http://www.sciencedirect.com/science/article/pii/S0925231209000083

http://arxiv.org/pdf/1206.6426



http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf



83

References• Recurrent neural networks

(long-term memory of word context):o Tomas Mikolov, M Karafiat, J Cernocky, S Khudanpur (2010)

"Recurrent neural network-based language model“Interspeech

o T. Mikolov, S. Kombrink, L. Burger, J. Cernocky and S. Khudanpur (2011)“Extensions of Recurrent Neural Network Language Model“ICASSP

o Tomas Mikolov and Geoff Zweig (2012)"Context-dependent Recurrent Neural Network Language Model“IEEE Speech Language Technologies

o Tomas Mikolov, Wen-Tau Yih and Geoffrey Zweig (2013)"Linguistic Regularities in Continuous SpaceWord Representations"NAACL-HLThttps://www.aclweb.org/anthology/N/N13/N13-1090.pdf

o http://research.microsoft.com/en-us/projects/rnn/default.aspx

https://www.aclweb.org/anthology/N/N13/N13-1090.pdf

https://www.aclweb.org/anthology/N/N13/N13-1090.pdf



84

References• Applications:

o P. Mirowski, S. Chopra, S. Balakrishnan and S. Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT

o G. Zweig and C. Burges (2011)“The Microsoft Research Sentence Completion Challenge”MSR Technical Report MSR-TR-2011-129

o http://research.microsoft.com/apps/pubs/default.aspx?id=157031 o M. Auli, M. Galley, C. Quirk and G. Zweig (2013)

“Joint Language and Translation Modeling with Recurrent Neural Networks”EMNLP

o K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi and D. Yu (2013)“Recurrent Neural Networks for Language Understanding”Interspeech



85

References• Continuous Bags of Words, Skip-Grams,

Word2Vec:o Tomas Mikolov et al (2013)

“Efficient Estimation of Word Representation in Vector Space“arXiv.1301.3781v3

o Tomas Mikolov et al (2013)“Distributed Representation of Words and Phrases and their Compositionality”arXiv.1310.4546v1, NIPS

o http://code.google.com/p/word2vec


87

ProbabilisticLanguage Models

• Goal: score sentences according to their likelihoodo Machine Translation:

• P(high winds tonight) > P(large winds tonight)o Spell Correction

• The office is about fifteen minuets from my house• P(about fifteen minutes from) > P(about fifteen minuets from)

o Speech Recognition• P(I saw a van) >> P(eyes awe of an)• Re-ranking n-best lists of sentences produced by an acoustic

model, taking the best

• Secondary goal: sentence completion or generation


89

Intuitive viewof perplexity

• How well can we predict next word?

o A random predictor would give each word probability 1/Vwhere V is the size of the vocabulary

o A better model of a text should assign a higher probability to the word that actually occurs

• Perplexity:o “how many words are likely to happen, given the context”o Perplexity of 1 means that the model recites the text by hearto Perplexity of V means that the model produces uniform random guesseso The lower the perplexity, the better the language model

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100


Stochastic gradient descent

[LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998;Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]

Dimensionality reduction and

invariant mapping

[Hadsell, Chopra & LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR, 2006]

Similarlylabelledsamples

Dissimilarcodes

Auto-encoder

Code

Input

Target= input

Code

Input

“Bottleneck” codei.e., low-dimensional,

typically dense,distributed

representation

“Overcomplete” codei.e., high-dimensional,

always sparse,distributed

representation

Target= input

Auto-encoderCode

Input

Codeprediction

Encoding“energy”

Decoding“energy”

Inputdecoding

Auto-encoderCode

Input

Codeprediction

Encodingenergy

Decodingenergy

Inputdecoding

Auto-encoderloss function

Encoding energy Decoding energy

Encoding energy Decoding energy

For one sample t

For all T samples

How do we get the codes Z?

coefficient ofthe encoder error

We note W={C, bC, D, bD}

Auto-encoderbackprop w.r.t. codes

Code

Input

Codeprediction

Encodingenergy

[Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]

tutorial on neural probabilistic language models piotr mirowski, microsoft bing london south england...

Documents

o wifi

vocabulary o

o deep learning

text o continuous bag

robotics o load forecasting

indoor geolocation o

word probability

word history target