natural language processing€¦ · what is nlp, why is it important? process, analyze and/or...

119
Large Scale Learning for Natural Language Processing ´ Edouard Grave Facebook AI Research [email protected]

Upload: others

Post on 27-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Large Scale Learning forNatural Language Processing

Edouard Grave

Facebook AI [email protected]

Page 2: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is NLP, why is it important?

• Process, analyze and/or produce natural language

• Interact with computers using natural language

• Natural Language Understanding (NLU): language as input

• Natural Language Generation (NLG): language as output

• Many applications (lot of information in text):• text classification, spam detection, topic identification• machine translation• information retrieval, web search• medical records, scientific articles

• Large scale? Wikipedia: 3B words, Common Crawl: 24TB

Page 3: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Overview

• Text classification

• Word representation

• Language modeling

Page 4: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Text Classification

Page 5: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Is this spam?

Page 6: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Is this review positive?

Page 7: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is this article about?

Page 8: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is text classification?

• Given a piece of text w = (w1, ...,wn), assign it a class y ∈ Y.

• Spam detection:

Y = {spam, notspam}• Sentiment analysis:

Y = {positive, negative} or Y = {∗, ∗∗, ∗ ∗ ∗, ∗ ∗ ∗∗}• Language identification:

Y = {English, French, German, ...}• Topic classification:

Y = {1960s comedy films, Australian cricketers, ...}

• Class are sometimes also called labels or categories.

Page 9: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is text classification?

• From now, we map classes to integers, so Y = {1, ...,C}.

• Potential solution: rule based classifier:

IF "the" IN w AND "is" IN w

THEN lang = "English"

• This works well, but misses many English documents

• Also: many rules to write, hard and expensive to maintain

• Instead: use machine learning

• Assume training/learning data: (wi , yi )

Page 10: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

How to represent documents?

• Represent documents as bags of words:

{ what a great movie } = { a movie great what }

• Order does not matter

• A word can appear multiple time (bags != sets)

{ what a great movie } 6= { what a great great movie }

• A document is represented as word counts x ∈ RV :

xj is the number of times word j appears in document

Page 11: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is a word?

The Bengal and Siberian tigers are amongst the tallest catsin shoulder height.

• How many words? 12? 13? 14?

• Type v.s. Tokens:• Type is an element of the vocabulary• Token is an instance of type in running text

• How to split this input in tokens? tokenization

• Are tigers and tiger the same type? lemmatization

• What about The and the? normalization

Page 12: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

What is a word?

The Bengal and Siberian tigers are amongst the tallest catsin shoulder height.

• How many words? 12? 13? 14?

• Type v.s. Tokens:• Type is an element of the vocabulary• Token is an instance of type in running text

• How to split this input in tokens? tokenization

• Are tigers and tiger the same type? lemmatization

• What about The and the? normalization

Page 13: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Tokenization

The tiger’s closest living relatives were previously thoughtto be the lion, leopard and jaguar.

• Split text into tokens: cannot just split on spaces

[The] [tiger] [’s] [closest] [living] [relatives]

[were] [previously] [thought] [to] [be] [the] [lion]

[,] [leopard] [and] [jaguar] [.]

• Issues for tokenization:• isn’t → [isn’t] or [is] [not] or [is] [n’t] ?• low-frequency → [low-frequency] or [low] [frequency] ?

• Some arbitrary choices: always use same tokenizer!

• e.g.: Moses (perl) or spaCy, NLTK, Stanford CoreNLP (python)

Page 14: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Tokenization is language dependent

• How to tokenize French word l’avion?

[l’] [avion] or [le] [avion]

• How to tokenize Finnish word rautatieasema (railway station)?

[rauta] [tie] [asema] ([iron] [road] [station])

• What about Japanese or Chinese?

• How to tokenize Vietnamese words ca phe (coffee)?

[ca phe]

Page 15: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Normalization

• Case folding: The v.s. the

• Periods in words: U.S.A. v.s. USA

• Keep or remove punctuation? Stopwords?

• Inflected variants: walk v.s. walking, is v.s. are

• Date? Numers?• 2018-03-21 → YYYY-MM-DD• 21 March 2018 → 2018-03-21• 3.1415 → D.DDDD• 300,000 → DDD,DDD

• Normalization is application dependent:• U.S. v.s. us might be important (e.g. named entity detection)• Keeping punctuation might be important (e.g. sentiment analysis)

Page 16: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression

• Log probability ratio (i.e. decision function) is linear:

log

(P(Y = 1 | x)

P(Y = −1 | x)

)= w>x

• Then, we get

log

(P(Y = 1 | x)

1− P(Y = 1 | x)

)= w>x

P(Y = 1 | x)

1− P(Y = 1 | x)= exp(w>x)

P(Y = 1 | x) = exp(w>x)− exp(w>x)P(Y = 1 | x)

P(Y = 1 | x) =exp(w>x)

1 + exp(w>x)

Page 17: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Sigmoid function a.k.a. logistic function

• The logistic or sigmoid function is defined by

σ(x) =exp(x)

1 + exp(x)=

1

1 + exp(−x)

• We also have

P(Y = −1 | X = x) =1

1 + exp(w>x)

• Hence

P(Y = 1 | X = x) = σ(w>x)

P(Y = −1 | X = x) = σ(−w>x)

Page 18: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Sigmoid function a.k.a. logistic function

4 3 2 1 0 1 2 3 40.0

0.2

0.4

0.6

0.8

1.0Sigmoid function

Page 19: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: objective function

• Learning objective: negative (log) likelihood estimator:

minw−1

nlog

(∏i

P(Y = yi | X = xi )

)

• We get

minw−1

n

∑i

logP(Y = yi | X = xi )

• Using the logistic function

minw−1

n

∑i

log σ(yi w>xi )

• And we get

minw

1

n

∑i

log(1 + exp(−yi w>xi ))

Page 20: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: objective function

• Learning objective for logistic regression:

minw

1

n

∑i

log(1 + exp(−yi w>xi ))

• The function x 7→ log(1 + exp(−x)) is called the logistic loss.

4 3 2 1 0 1 2 3 40.00.51.01.52.02.53.03.54.04.5

Logistic loss function

Page 21: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: optimization

• How to find the min?

As opposed to naive Bayes or least squares, no closed form solution!

• Use tool from convex optimization: gradient descent.

• We have

σ′(x) =− exp(−x)

(1 + exp(−x))2=

1

1 + exp(−x)× − exp(−x)

1 + exp(−x)

• Thusσ′(x) = σ(x)(1− σ(x))

• And∂ log σ(x)

∂x= 1− σ(x)

• Derivative is non-increasing, hence log-likelihood is concave (andloss is convex)

Page 22: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: optimization

• Thus, the gradient of our objective function is

∇wL(w) = −1

n

n∑i=1

(1− σ(yi w>xi ))yi xi

• We can optimize our objective function with gradient descent:

wt+1 = wt − ηt∇wL(wt)

• This algorithm requires to do a full pass over the data at each step.

• This algorithm is also slow to converge: error O(1/t)

Page 23: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: optimization

• Instead of first order optimization (i.e. use only gradient), usesecond order information (i.e. hessian)

• We have

∇2wL(w) =

1

n

∑i

(1− σ(yi w>xi ))σ(yi w

>xi )xi x>i

= X>DX

where D = diag[(1− σ(yi w

>x))σ(yi w>x)]

• Then, we apply Newton step:

wt+1 = wt − (X>DX)−1∇wL(wt)

• Newton method has a quadratic convergence rate: O(exp(−ρ2t))

• But! d × d matrix inversion: O(d3). Does not scale to large models

Page 24: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: optimization

• Method of choice in machine learning: stochastic gradient descent.

• Instead of computing gradient on full dataset, use only one example:

wt+1 = wt + ηt(1− σ(yi w>xi ))yi xi

• Slow convergence, but very efficient to evaluate.

• In machine learning, do not need to optimize below estimation error(see Bottou and Bousquet (2008), The Tradeoffs of Large ScaleLearning)

• Some references about optimization:• M. Hardt course at Berkeley, ee227c.github.io• F. Bach tutorial, www.di.ens.fr/~fbach/fbach mlss 2018.pdf

Page 25: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: optimization

Credit: https://leon.bottou.org/projects/sgd

Page 26: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Logistic regression: multiclass case

• Logistic regression has straightforward extension to multiple classes

P(Y = k | X = x) =exp(w>k x)∑k ′ exp(w>k ′x)

• Need to learn one parameter vectors per class.

• Note: since∑

k P(Y = k | X = x) = 1, we could set w1 = 0

P(Y = k | X = x) =exp(−w>1 x) exp(w>k x)

exp(−w>1 x)∑

k ′ exp(w>k ′x)

=exp((wk −w1)>x)∑k ′ exp((wk ′ −w1)>x)

• But does not change in practice.

Page 27: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Softmax function

• The multivariate function f defined by

fi (s) =exp(si )∑i ′ exp(si ′)

is called the softmax.

• Computational stability: remove the max m = maxi si

exp(si −m)∑i ′ exp(si ′ −m)

• Appears in many applications: linear discriminant analysis, neuralnetworks, etc.

Page 28: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Softmax function

• The gradient is equal to:

∂fi∂sj

(s) = fi (s)(δij − fj (s))

• Then∂ log fi∂sj

(s) = δij − fj (s)

• Finally∇s log fi (s) = ei − f (s)

Page 29: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Class-based softmax

• Limitation: expensive to compute when large number of classes.

• Can we do better? Yes, by approximating the softmax function.

• Idea: partition the classes to L subsets, ck is subset of label k

P(Y = k | X = x,C = ck )P(C = ck | X = x)

• If each subset contains√K classes, then, train complexity if O(

√K )

• How to partition the classes? Random, frequency, similarity

Page 30: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• Idea: organize classes in binary tree, classes corresponding to leaves

• Then: learn binary classifiers for nodes of tree (left or right?)

• A class k is encoded by the path Pk from root to corresponding leaf,and corresponding decisions yik (go left or right):

P(Y = k | X = x) =∏

i∈Pk

σ(yikw>i x)

n

ci cj

wn −wn

• If tree is balanced: depth is O(log(K ))

Page 31: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• The complexity at train time is log(K ).

• What about prediction? How to find the most probable label?

• We can do depth first search.

• Each time a path has a smaller probability than best so far, stop.

• Start exploring the most probable branch.

Page 32: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• The complexity at train time is log(K ).

• What about prediction? How to find the most probable label?

• We can do depth first search.

• Each time a path has a smaller probability than best so far, stop.

• Start exploring the most probable branch.

Page 33: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• Python inspired pseudo code:

def dfs(x, score, node, best):

if score < best.score:

return best

if node.is_leaf:

return (score, node.label)

s = sigmoid(dot(node.w, x))

best = dfs(x, score * s, node.left, best)

best = dfs(x, score * (1-s), node.right, best)

return best

Page 34: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• Use log to prevent underflow

def dfs(x, score, node, best):

if score < best.score:

return best

if node.is_leaf:

return (score, node.label)

s = sigmoid(dot(node.w, x))

best = dfs(x, score + log(s), node.left, best)

best = dfs(x, score + log(1-s), node.right, best)

return best

Page 35: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Hierarchical softmax

• Predict categories from stack exchange questions

• 735 labels, 10,000 word types

Model PR at 1 RE at 1 Time (sec)

Full softmax 56.8 24.6 29.1Hierarchical softmax 57.1 24.7 5.1

• On dataset with 300,000 labels: from hours to minutes

Page 36: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Matrix factorization

• If number of classes and features is large:

large number of parameters.

• The size of the weight matrix W is K × V !

• Might not fit in memory for some problems.

• Instead, replace W by low rank matrix UV>,

U ∈ RK×d and V ∈ RV×d

• Then, Wx = U(V>x).

• If x is the count of words, then, V>x =∑

i vwi

• Average the vectors of words appearing in document.

• Next part, we will see how to learn such vectors from raw text.

Page 37: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Bigram features

• the cat ate the mouse versus the mouse ate the cat

• Same bag of words representations, very different meanings!

• Important for sentiment analysis:• I did not wait and I like[d] the restaurant• I wait[ed] and I did not like the restaurant

• What can we do? Use n-gram features!

[the] [cat] [ate] [the] [mouse]

[the cat] [cat ate] [ate the] [the mouse]

Model AG DBP Yelp F. Yah. A. Amz. F.

unigram features 91.5 98.1 60.4 72.0 55.8bigram features 92.5 98.6 63.9 72.3 60.2

Table: Test accuracy [%] on classification datasets.

Page 38: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Bigram features

• the cat ate the mouse versus the mouse ate the cat

• Same bag of words representations, very different meanings!

• Important for sentiment analysis:• I did not wait and I like[d] the restaurant• I wait[ed] and I did not like the restaurant

• What can we do? Use n-gram features!

[the] [cat] [ate] [the] [mouse]

[the cat] [cat ate] [ate the] [the mouse]

Model AG DBP Yelp F. Yah. A. Amz. F.

unigram features 91.5 98.1 60.4 72.0 55.8bigram features 92.5 98.6 63.9 72.3 60.2

Table: Test accuracy [%] on classification datasets.

Page 39: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Bigram features

• the cat ate the mouse versus the mouse ate the cat

• Same bag of words representations, very different meanings!

• Important for sentiment analysis:• I did not wait and I like[d] the restaurant• I wait[ed] and I did not like the restaurant

• What can we do? Use n-gram features!

[the] [cat] [ate] [the] [mouse]

[the cat] [cat ate] [ate the] [the mouse]

Model AG DBP Yelp F. Yah. A. Amz. F.

unigram features 91.5 98.1 60.4 72.0 55.8bigram features 92.5 98.6 63.9 72.3 60.2

Table: Test accuracy [%] on classification datasets.

Page 40: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

fastText

• Open source implementation of text classification

• Multiclass logistic regression, binary hierarchical softmax

• n-gram word features and subword features (see next part)

• Stochastic gradient descent

• Matrix factorization

• Model compression (e.g. langid in less than 1MB)

> fasttext supervised -input data.train.txt -output model

> fasttext predict model.bin data.test.txt

• More information on www.fasttext.cc

Page 41: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

fastText – Experiments

Model AG DBP Yelp F. Yah. A. Amz. F.

BoW (Zhang et al., 2015a) 88.8 96.6 58.0 68.9 54.6ngrams (Zhang et al., 2015a) 92.0 98.6 56.3 68.5 54.3ngrams TFIDF (Zhang et al., 2015a) 92.4 98.7 54.8 68.5 52.4char-CNN (Zhang et al., 2015a) 87.2 98.3 62.0 71.2 59.5char-CRNN (Xiao and Cho, 2016) 91.4 98.6 61.8 71.7 59.2VDCNN (Conneau et al., 2016) 91.3 98.7 64.7 73.4 63.0

fastText, h = 10 91.5 98.1 60.4 72.0 55.8fastText, h = 10, bigram 92.5 98.6 63.9 72.3 60.2

Table: Test accuracy [%] on classification datasets.

Page 42: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Compressing text models

• Parameters of model: embeddings U and classifiers V

• With large vocabulary and/or large output space:

U and V can (still) be large

• Use compression technique to reduce memory footprint

• Product quantization approximates a vector x ∈ Rd by

x = [q1(x1), ..., qp(xp)],

where xi are subvectors in Rd/p and qi are k-means quantizers.

Page 43: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

k-means quantization

• Given a set of vectors x1, ..., xn ∈ Rd/p, run k-means

• Approximate a vector xi by the nearest centroid

• If number of centroids k = 256:• need 4× 256× d bytes to store centroids• need p bytes per vector, instead of 4× d

• Total memory: 1024× d + n × p v.s. 4× n × d

• For NLP: use d/p = 2 or d/p = 4 in product quantization

• Leads to memory savings of 8× to 16×

Page 44: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Parameter quantization

2 4 8k

93.5

94.0

94.5

95.0

95.5

96.0

96.5

accu

racy

Sogou

FullPQOPQLSH, normPQ, normOPQ, norm

2 4 8k

69.5

70.0

70.5

71.0

71.5

72.0

72.5

accu

racy

Yahoo

Figure: Accuracy as a function of the memory per word vector. An extra byte isrequired when we encode the norm separately.

Page 45: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Dictionary pruning (advanced)

Second strategy to reduce memory footprint: feature selection.

In our case, select K words / ngrams from a trained model.

Find the closest sparse model:

minU‖U− U‖2 s.t. U ∈ SK ,

where SK is the set of matrices with at most K nonzero columns.

Corresponds to keeping the K columns with largest norms.

Page 46: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Dictionary pruning

-2-10

AG Amazon full

-2-10

Amazon polarity DBPedia

-2-10

Sogou Yahoo

100kB 1MB 10MB 100MB-2-10

Yelp full

100kB 1MB 10MB 100MB

Yelp polarity

Full PQ Pruned Zhang et al. (2015) Xiao & Cho (2016)

Figure: Loss of accuracy vs model size. We compare models with different levelof pruning and the full fastText model. We also compare with (Zhang et al.,2015b) and (Xiao and Cho, 2016). Note: the size is in log scale.

Page 47: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Model compression

Dataset full 64KiB 32KiB 16 KiB

AG 65M 92.1 91.4 90.6 89.1Amazon full 108M 60.0 58.8 56.0 52.9Amazon pol. 113M 94.5 93.3 92.1 89.3DBPedia 87M 98.4 98.2 98.1 97.4Sogou 73M 96.4 96.4 96.3 95.5Yahoo 122M 72.1 70.0 69.0 69.2Yelp full 78M 63.8 63.2 62.4 58.7Yelp pol. 77M 95.7 95.3 94.9 93.2

Average diff. [%] 0 -0.8 -1.7 -3.5

Table: Performance of very small models. We use a quantization with k = 1,hashing and an extreme pruning. The last row shows the average drop ofperformance for different size.

Page 48: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Large scale text classification: recap

• Based on linear logistic regression

• Scale to large datasets: stochastic gradient descent

• Scale to large number of features/classes:• Computation: hierachical softmax• Memory: low rank linear model

• Compress (text) classification models:• Parameter quantization (less memory per parameter)• Vocabulary pruning (remove parameters entirely)

• All techniques: applicable to other applications!

Page 49: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word Representations

Page 50: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word vectors: motivations

• Traditional way to represent words as atomic symbols with aunique integer is associated with each word:

{1=movie, 2=hotel, 3=apple, 4=movies, 5=art}

• Equivalent to represent words as 1-hot vectors:

movie = [1, 0, 0, 0, 0]

hotel = [0, 1, 0, 0, 0]

. . .

art = [0, 0, 0, 0, 1]

Page 51: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word vectors: motivations

• Implicit assumption: word vectors are an orthonormal basis• orthogonal (xT y = 0)• normalized (xT x = 1)

• Problem: Not very informative:• Weird to consider “movie” and “movies” as independent entities• Or to consider all words equidistant:

‖dog− cat‖ = ‖dog− moon‖

Page 52: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Distributional hypothesis

“You shall know a word by the company it keeps” Firth (1957)

• Meaning of a word: set of contexts in which it occurs in texts

Page 53: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 54: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 55: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 56: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 57: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 58: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 59: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example: What is the meaning of “bardiwac”?

• He handed her her glass of bardiwac.

• Beef dishes are made to complement the bardiwacs.

• Nigel staggered to his feet, face flushed from too much bardiwac.

• Malbec, one of the lesser-known bardiwac grapes, responds well toAustralia’s sunshine.

• I dined off bread and cheese and this excellent bardiwac.

• The drinks were delicious: blood-red bardiwac as well as light, sweetRhenish.

→ bardiwac is a heavy red alcoholic beverage made from grapes

Source: Distributional Semantic Models, Stefan Evert, 2015.

Page 60: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Distributional word representation in a nutshell

• Define what is the context of a word

• Count how many times target word occurs with contexts

• Build vector out of (a function of) these context occurrence counts

Caveat:

• similar vectors represent words with similar distributions in contexts

• Distributional hypothesis: bridging assumption from distributionalrepresentation to semantic representation

Source: Foundations of Distributional Semantic Models, Stefan Evert andAlessandro Lenci, 2009.

Page 61: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Distributional word representation in a nutshell

• Define what is the context of a word

• Count how many times target word occurs with contexts

• Build vector out of (a function of) these context occurrence counts

Caveat:

• similar vectors represent words with similar distributions in contexts

• Distributional hypothesis: bridging assumption from distributionalrepresentation to semantic representation

Source: Foundations of Distributional Semantic Models, Stefan Evert andAlessandro Lenci, 2009.

Page 62: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Distributional word representation in a nutshell

• Define what is the context of a word

• Count how many times target word occurs with contexts

• Build vector out of (a function of) these context occurrence counts

Caveat:

• similar vectors represent words with similar distributions in contexts

• Distributional hypothesis: bridging assumption from distributionalrepresentation to semantic representation

Source: Foundations of Distributional Semantic Models, Stefan Evert andAlessandro Lenci, 2009.

Page 63: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Learning distributed word representation

• Directly learning low dimensional vectors

• Key idea 1 (Collobert and Weston, 2008)learning distributed word vectors as a discriminative problem

• Key idea 2 (Mikolov et al., 2013a)efficient online training to scale to large dataset

• State-of-the-art model: word2vec by Mikolov et al. (2013a)

Page 64: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: the skipgram and cbow models

• word2vec: context is a fixed size window around the word

• Skipgram predict context from the word

sun still glitters although evening has arrived in Kuhmo.

• Continuous Bag of Word (Cbow) predict word from the context

sun still glitters although evening

+

has arrived in Kuhmo.

Page 65: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: word vectors as a discriminative problem

• Given a dataset of N tokens and a vocabulary of V words

• Each word i in the vocabulary is associated with a word vectorxi ∈ Rd and a context vector yi ∈ Rd , with d � V

• Denote by X the matrix with the i-th row equal to xi (same for Y)

• We denote by xwn and ywn the vectors associated with n-th token ofthe dataset

Page 66: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Skipgram as a discriminative problem

• Skipgram predicts each word c from context Cn of n-th token

• Discriminate correct word from context against rest of vocabulary

• Frame as a minimization problem:

minX∈RV×d , Y∈RV×d

1

N

N∑n=0

1

|Cn|∑c∈Cn

`(xwn , yc )

where:

`(x, y) = −xT y + log

(V∑

k=1

exp(yTk x)

)is the negative log-softmax function

Page 67: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Cbow as a discriminative problem

• Cbow predicts word associated with token n based on its context Cn

• The context is represented as a Bag-of-Word (BoW)

• Discriminate correct word from rest of the vocabulary

• Frame as a minimization problem:

minX∈RV×d , Y∈RV×d

1

N

N∑n=0

`( 1

|Cn|∑c∈Cn

xc︸ ︷︷ ︸Context BoW

, ywn

)

where ` is the negative log-softmax

Page 68: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Optimization of word2vec

Gradient descent

Xt+1 ← Xt − αt1

N

N∑n=1

∇X`(xn, yn)

→ Requires a pass over dataset for one gradient: O(N)

Stochastic gradient descent with predefined sequential scheduler

- loop over the N tokens in dataset, take gradient step at each token

- Repeat process for E epoch. Total number of iteration T = NE

- t-th update:Xt+1 ← Xt − αt∇X`(xn, yn)

with n = t/N

Page 69: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Optimization of word2vec

t

αt

α0

αT

Learning rate scheduler (αt)t

• set α0 and number of iteration T :

αt =(

1− t

T

)α0

Page 70: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Optimization of word2vec: hogwild

Hogwild parallelizes this algorithm over P processes:

• Split dataset in P subsets.

• Read P subsets in parallel.

• Share parameters between processes

• Each process• Compute gradient for one token• Update shared parameters without synchronization

→ Update parameters sequentially and in parallel without sync

→ Scale well with number of processes!

Page 71: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: efficient distributed training

• Computing softmax over the whole vocabulary is slow O(V )

→ Replace it by negative sampling

• Negative sampling (skipgram) sample K � V words Nn that doesnot appear in the context of xn and replace softmax by sum of1-versus-all losses:

`(xn, yc)← σ(xn, yc ) +1

K

∑k∈Nn

σ(−xn, yk )

where σ(x, y) = log(1 + exp(−xT y)) is the negative log-sigmoid

• Sample negatives based on word frequency to match datadistribution:

pnegative(w) ∝ freq0.75(w)

• Same for cbow

Page 72: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: efficient distributed training

sun still glitters although evening has arrived in Kuhmo.

• Instead of fixing the window size |Cn|, sample it

• Uniform sample w in {1, . . . ,wmax} and Cn = 2w

Page 73: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: efficient distributed training

sun still glitters although evening has arrived in Kuhmo.

• Instead of fixing the window size |Cn|, sample it

• Uniform sample w in {1, . . . ,wmax} and Cn = 2w

Page 74: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: efficient distributed training

sun still glitters although evening has arrived in Kuhmo.

• Instead of fixing the window size |Cn|, sample it

• Uniform sample w in {1, . . . ,wmax} and Cn = 2w

Page 75: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word2vec: efficient distributed training

• Word frequency in corpora follows a Zipf distribution

• Zipf distribution ranked by frequency, each word is x times lessfrequent than previous one.

Example: p(the) = 0.1, p(a)=0.05, p(is)=0.025....

→ a subset of vocabulary (≈ 2k words) covers > 80% of dataset

→ 80% of training spent on learning 2k word vectors out of 2M

• Discard words during training based on frequency (t ∈ [10−5, 10−3]):

p(discard | w) = max

(0, 1−

√t

freq(w)

)

Page 76: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Example of nearest neighbors

• Trained on 1B tokens from Wikipedia, dimension 300

moon score

mars 0.615moons 0.611lunar 0.602sun 0.602venus 0.583

talking score

discussing 0.663telling 0.657joking 0.632thinking 0.627talked 0.624

blue score

red 0.704yellow 0.677purple 0.676green 0.655pink 0.612

Page 77: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word vector analogies

Credit: Mikolov et al. (2013)

Page 78: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Analogies as intrinsic evaluation of word representation

• Word vector analogies:

king − man + woman = ?

• Frame as a retrieval problem:

→ Normalize word embeddings xi ← xi/‖xi‖→ Find the closest vectors w.r.t. l2 distance:

xd = argmaxi (xc + xa − xb)>xi

x1

x2

man

woman

king?

Page 79: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Analogies as intrinsic evaluation of word representation

• Semantic analogies:• capital-common-countries:

Athens : Greece :: Helsinki : Finland

• currency:Japan : yen :: Sweden : krona

• family:father : mother :: uncle : aunt

• Syntactic analogies:• gram2-opposite:

logical : illogical :: clear : unclear

• gram3-comparative:strong : stronger :: good : better

• gram5-present-participle:think : thinking :: listen : listening

Page 80: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Impact of dimension

100 200 300 400

Semantic 73.7 80.8 82.2 82.6Syntactic 69.6 74.4 75.0 74.8Total 71.2 76.9 77.8 77.9

Figure: Accuracy on the analogy dataset of Mikolov et al. (2013b)

• Take home message:

dimension 300 is good enough for most applications

Page 81: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Extensions: GloVe (Pennington et al., 2014)

• GloVe (Global Vector) is a word2vec model trained with adifferent loss:

minX ,Y ,b

∑i , j∈V

f (Cij )(

xTi yj + bi + bj − logCij

)2- (bi )i∈V are scalars to learn

- Cij co-occurence counts of words i and j in same context

- f reweighting function:

f (C ) = min(

1, (C/Ccutoff )3/4)

similar to discount factor of word2vec Cij

f (Cij )

Ccutoff

English vectors available at nlp.stanford.edu/projects/glove/

Page 82: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Extensions: fastText (Bojanowski et al., 2017)

• Represent a word as bag of character n-grams:

skiing = { ^skiing$, ^ski, skii, kiin, iing ing$ }• Gw is the set of n-grams appearing in word w .

s(w , c) =∑

g∈Gw

g>c.

(It includes the word w in the set of n-grams)

• Advantage 1 Get word vectors for out-of-vocabulary words usingsubwords!

• Advantage 2 Generalize well to text with typos or agglutinativelanguages

Pre-trained vectors in 90 languages available at www.fasttext.cc

Page 83: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Technical details of fastText

• n-grams between 3 and 6 characters

• Hashing to map n-grams to integers in 1 to K

• Same training / sampling procedure as in word2vec

→ Less than 2× slower than word2vec skipgram!

Pre-trained vectors in 90 languages available at www.fasttext.cc

Page 84: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – word analogy (A is to B as C is to ?)

• All models trained on Wikipedia:

sg cbow ours

CsSemantic 25.7 27.6 27.5Syntactic 52.8 55.0 77.8

DeSemantic 66.5 66.8 62.3Syntactic 44.5 45.0 56.4

EnSemantic 78.5 78.2 77.8Syntactic 70.1 69.9 74.9

ItSemantic 52.3 54.7 52.3Syntactic 51.5 51.8 62.7

Table: Accuracy of our model and baselines on word analogy tasks for Czech,German, English and Italian. We report results for semantic and syntacticanalogies separately.

Page 85: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – Size of n-grams (German)

2 3 4 5 6

2 59 55 56 59 603 60 58 60 624 62 62 635 64 646 65

Semantic Analogies

2 3 4 5 6

2 45 50 53 54 553 51 55 55 564 54 56 565 56 566 54

Syntactic Analogies

Table: Study of the effect of sizes of n-grams considered on performance.

• Short n-grams (4-5): better for syntax

• Long n-grams (6+): better for semantics

Page 86: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Further Extensions

• Position vectors multiply input word vectors of cbow by positionvectors (Mnih and Kavukcuoglu, 2013):

hC =1

|P|∑p∈P

dp � xn+c

dp =learnable position vectors. � = pointwise multiplication.

• Reminder, regular cbow:

hC =1

|P|∑p∈P

xn+c

Page 87: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Further Extensions

• Phrase vectors pre-processing of dataset to convert withprobability, bigrams with high MI into token (Mikolov et al., 2013b):

New York → New YorkRepeat process:

New York University → New York University

• Score to merge two tokens:

score(wi , wj ) =count(wi wj )− δ

count(wi )× count(wj )

where δ is a discount factor to prevent phrases of infrequent words

Models in 150+ languages from Grave et al. (2018) available at www.fasttext.cc

Page 88: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Evaluation of these extensions

Semantic Syntactic Total

cbow 79 73 76cbow + phrases 82 78 80cbow + phrases + position 87 82 85

Models trained on Common Crawl (Mikolov et al., 2017)

Page 89: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Impact of training data

• Wikipedia: high quality but small• 28 languages with more than 100M tokens• Hindi: only 39M tokens

• Crawl: noisy but larger and more domains

• Preprocessing: language id / deduplication / tokenization

language wiki crawl

German 1.3B 65BFrench 1.1B 68BJapanese 1.0B 92BRussian 0.8B 102BSpanish 0.8B 72B

language wiki crawl

Italian 0.7B 36BPolish 0.4B 21BPortuguese 0.4B 35BChinese 0.4B 30BCzech 0.2B 13B

Table: Dataset sizes (number of tokens) for Wikipedia and Crawl.

Page 90: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Impact of training data

Model Dataset Analogy Similarity (RW) QA

fastText Wiki + news 87 0.52 78.9fastText Crawl 85 0.58 79.8

Results from Mikolov et al. (2017). Analogy: accuracy on theGoogle analogy dataset. Similarity (RW): Spearman rankcorrelation on the Stanford Rare Word dataset. QA: F1 score on theSQuAD question answering dataset. Pre-trained word vectors wereused to initialize the lookup table of the RNN DrQA model fromChen et al. (2017).

Page 91: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Large scale word representations: recap

• Learn word representations on large amount of text (terabytes)

• Same model as text classification:

→ predict context word instead of class

• Scale to large vocabulary: negative sampling

• Exploit Zipf distribution: discard frequent words

• Parallelize stochastic gradient descent: hogwild

Page 92: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Language Modeling

Page 93: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Statistical language modeling

• Learn probability distribution over sequences of words:

p(w1, ...,wT ) =T∏

t=1

p(wt | wt−1, ...,w1)

• Different models for the conditional distribution:• n-gram models (Katz, 1987; Kneser and Ney, 1995; Goodman, 2001b)• maximum entropy (Rosenfeld, 1996)• feed forward neural nets (Bengio et al., 2003)• recurrent neural nets (Elman, 1990; Mikolov et al., 2010)

• Model for natural language generation:• Machine translation (Brown et al., 1993)• Speech recognition (Bahl et al., 1983)

Page 94: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Recurrent neural network language models(Mikolov et al., 2010; Graves, 2013; Chung et al., 2014)

• Assuming ht ∈ Rd encodes history wt , ...,w1:

p(wt+1 | wt , ...,w1) = softmax(Oht)

• Using Elman recurrent network (Elman, 1990),

ht = σ(Lxt + Rht−1).

• Computationally intensive:• Computing the softmax: Oht

• Updating the hidden state: Rht

ht

xt

yt

L

R

O

Page 95: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Class-based hierarchical softmaxGoodman (2001a); Morin and Bengio (2005)

Figure: Class-based hierarchical softmax

• Assign each word w to a unique class C(w).

• If each class contains√k words, cost reduced from O(k) to O(

√k)

• Other hierarchies, based on word similarity or word frequency.

• How to batch?

Page 96: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Negative samplingJozefowicz et al. (2016)

• Variant of the word2vec negative sampling

• We want to approximate

exp(w>k h)∑Vi=1 exp(w>i h)

• We sample n elements N without replacement from {1, ...,V }:

exp(w>k h)∑i∈N∪{k} exp(w>i h)

• Efficient to batch: N ∪ {k1, ..., kB}• What about test time?

Page 97: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

GPU computational modelGrave et al. (2016)

������

����

�����

������

��

� �� �� �� �� �� �� �� �� ��� ��� ��� ��� ��� ���

�����������������

��������

�������������������

����������

Figure: GPU timings for multiplying two matrices, as a function of the numberof columns of the second matrix.

Computational model for softmax over k elements, with batchsize n:

g(k , n) = c1 + λmax (0, kn − c0) .

Page 98: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

GPU computational modelGrave et al. (2016)

������

����

�����

������

��

� �� �� �� �� �� �� �� �� ��� ��� ��� ��� ��� ���

�����������������

��������

�������������������

����������

Figure: GPU timings for multiplying two matrices, as a function of the numberof columns of the second matrix.

Take home message: small clusters (k < k0) are a bad idea.

Page 99: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Word distribution follows a Zipf lawGrave et al. (2016)

100 101 102 103 104 105

log(rank)

10-7

10-6

10-5

10-4

10-3

10-2

10-1

log(f

requency

)

europarl-de

0 1000020000 3000040000 50000

rank

0.0

0.2

0.4

0.6

0.8

1.0

cdf

europarl-de

Left: frequency of words versus rank order (log-log scale).Right: cumulative distribution function.

Small part of the vocabulary account for most word occurences:

• German Europarl, 80% of occurences: 1400 words

Computation should be (very) efficient for frequent words.

Page 100: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Two cluster exampleGrave et al. (2016)

Figure: Adaptive softmax

Partition vocabulary into: head (size kh) and tail (size kt).

Overall cost:g(kh + 1, n) + g(kt ,mtn),

where mt is the probability mass of the tail.

Page 101: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Two cluster exampleGrave et al. (2016)

101 102 103 104 1050

20

40

60

80

100

com

p.

tim

e (

ms)

europarl-de

101 102 103 104 1050

20

40

60

80

100

com

p.

tim

e (

ms)

europarl-es

Figure: Cost for the two cluster hierarchical softmax, as a function of kh,assuming our simple computational model. We observe 5× speedup over thefull softmax.

Page 102: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Generalization to multiple clustersGrave et al. (2016)

Figure: Adaptive softmax

We can generalize to multiple “tail” clusters.

We find the optimal assignment by minimizing the computational cost,using dynamic programming.

Page 103: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Generalization to multiple clustersGrave et al. (2016)

5 10 15 20

# clusters

0

5

10

15

20

com

p.

tim

e (

ms)

europarl-de

5 10 15 20

# clusters

0

5

10

15

20

com

p.

tim

e (

ms)

europarl-es

Figure: Optimal computation time, as a function of the number of clusters. Notmuch gain after 5 clusters.

Smaller number of clusters: better approximation of the full softmax,better perplexity.

Page 104: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – Finnish EuroparlGrave et al. (2016)

���

����

����

����

����

����

����

�� ���� ���� ���� ���� ���� ���� ���� ����

����������

����������

��������������������

����������������

����

Figure: Comparison to baselines, on Finnish Europarl (vocab. size 250,000). Weuse a RNN with 512 LSTM units.

Page 105: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – Billion Word BenchmarkGrave et al. (2016)

Model Test PPL

Interpolated Kneser-Ney 5-gram (Chelba et al., 2013) 67.6RNN-2048 + BlackOut sampling (Ji et al., 2015) 68.3Sparse NMF (Shazeer et al., 2015) 52.9RNN-1024 + MaxEnt 9-gram (Chelba et al., 2013) 51.3LSTM-2048-512 (Jozefowicz et al., 2016) 43.72-layer LSTM-8192-1024 (Jozefowicz et al., 2016) 30.0

Ours (LSTM-2048) 43.9Ours (2-layer LSTM-2048) 39.8

Table: Comparison to state-of-the-art on the Billion Word benchmark. Ourresults are obtained after 5 epochs.

Page 106: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Adaptive computation for recurrent neural network

Goal: model sequences with different time scales (text, music, video).

Adapt the amount of computation at each time step.

Character level language modeling:

• the prime...

• the prime mini...

Related work:

• Fixed schedule (Mozer, 1993; Koutnik et al., 2014; Mikolov et al.,2014; Bojanowski et al., 2015)

• Dynamic schedule (Schmidhuber, 1992; Chung et al., 2016; Graves,2016)

Page 107: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Adaptive computation for recurrent neural network

Scheduler! Scheduler!

xt+1!xt!

ht-1! ht! ht+1!

Figure: Two time steps of the VCRNN. At each step t, the scheduler takes inthe current hidden vector ht−1 and input vector xt and decides on a number ofdimensions to use d . The unit then uses the first d dimensions of ht−1 and xt

to compute the first d elements of the new hidden state ht , and carries theremaining D − d dimensions over from ht−1.

Page 108: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Technical details

• Regular update:

ht = g(xt ,ht−1)= σ(Lxt + Rht−1)

• Update:

ht = mt � g(mt � xt ,mt � ht−1) + (1−mt)� ht

• Scheduler:

• From input xt and hidden ht , predict amount of computation in [0, 1]

• Gating applied on the input xt and hidden ht

• For learning, replace hard mask by soft mask (differentiable)

• Annealing of the soft mask

• `1-regularization of the scheduler

Page 109: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – Qualitative results

d a y s e v e r y o n e i s l o o k i n g f o r a w a y t o g e t v i ew e r s mo r e e x c i0.4

0.6

0.8

e n a k o n c i z d l o u h a v é h o a n am á h a v é h o p r o c e s u . N á v r a t d o t é t o0.4

0.6

0.8

d i e d e u t l i c h ü b e r d em An t e i l d e s L u f t v e r k e h r s l i e g t , d e r e b0.4

0.6

0.8

Figure: Per-character computation by VCRNN. Top: English. Middle: Czech.Bottom: German.

Page 110: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Experiments – Quantitative results

100 200 300 400 500hidden dimension

1.6

1.7

1.8

1.9

bit p

er c

har

Europarl-cs

ElmanGuide VCRNNLearn VCRNN

100 200 300 400 500hidden dimension

1.4

1.5

1.6

1.7

bit p

er c

har

Europarl-de

ElmanGuide VCRNNLearn VCRNN

Figure: Bits per character for different computational loads on the EuroparlCzech (left) and German (right) datasets.

Page 111: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Large scale language modeling: recap

• Similar techniques to scale to large vocabulary:

→ but need to adapt to GPU computation model

• Adaptive computation: less resources for easier examples

Page 112: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

Large scale learning: recap

Scaling to

• large data: stochastic gradient descent

• multiple cores: hogwild

• many classes: hierarchical softmax

• many classes: sampling based techniques

• large models: low rank parametrization

• large models: parameter quantization

• large models: vocabulary pruning

Page 113: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References I

Bahl, L. R., Jelinek, F., and Mercer, R. L. (1983). A maximum likelihoodapproach to continuous speech recognition. PAMI.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neuralprobabilistic language model. JMLR.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enrichingword vectors with subword information. Transactions of theAssociation for Computational Linguistics, 5:135–146.

Bojanowski, P., Joulin, A., and Mikolov, T. (2015). Alternativestructures for character-level rnns. arXiv preprint arXiv:1511.06303.

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L.(1993). The mathematics of statistical machine translation: Parameterestimation. Computational linguistics.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P.,and Robinson, T. (2013). One billion word benchmark for measuringprogress in statistical language modeling. arXiv preprintarXiv:1312.3005.

Page 114: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References IIChung, J., Ahn, S., and Bengio, Y. (2016). Hierarchical multiscale

recurrent neural networks. arXiv preprint arXiv:1609.01704.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empiricalevaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555.

Collobert, R. and Weston, J. (2008). A unified architecture for naturallanguage processing: Deep neural networks with multitask learning. InProc. ICML.

Conneau, A., Schwenk, H., Barrault, L., and Lecun, Y. (2016). Verydeep convolutional networks for natural language processing. arXivpreprint arXiv:1606.01781.

Elman, J. L. (1990). Finding structure in time. Cognitive science.

Firth, J. R. (1957). Papers in linguistics, 1934-1951. Oxford UniversityPress.

Goodman, J. (2001a). Classes for fast maximum entropy training. InICASSP.

Page 115: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References III

Goodman, J. T. (2001b). A bit of progress in language modeling.Computer Speech & Language.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018).Learning word vectors for 157 languages. arXiv preprintarXiv:1802.06893.

Grave, E., Joulin, A., Cisse, M., Grangier, D., and Jegou, H. (2016).Efficient softmax approximation for gpus. arXiv preprintarXiv:1609.04309.

Graves, A. (2013). Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850.

Graves, A. (2016). Adaptive computation time for recurrent neuralnetworks. arXiv preprint arXiv:1603.08983.

Ji, S., Vishwanathan, S., Satish, N., Anderson, M. J., and Dubey, P.(2015). Blackout: Speeding up recurrent neural network languagemodels with very large vocabularies. arXiv preprint arXiv:1511.06909.

Page 116: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References IV

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y.(2016). Exploring the limits of language modeling. arXiv preprintarXiv:1602.02410.

Katz, S. M. (1987). Estimation of probabilities from sparse data for thelanguage model component of a speech recognizer. ICASSP.

Kneser, R. and Ney, H. (1995). Improved backing-off for m-gramlanguage modeling. In ICASSP.

Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). Aclockwork rnn. arXiv preprint arXiv:1402.3511.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficientestimation of word representations in vector space. arXiv preprintarXiv:1301.3781.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A.(2017). Advances in pre-training distributed word representations.arXiv preprint arXiv:1712.09405.

Page 117: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References VMikolov, T., Joulin, A., Chopra, S., Mathieu, M., and Ranzato, M.

(2014). Learning longer memory in recurrent neural networks. arXivpreprint arXiv:1412.7753.

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S.(2010). Recurrent neural network based language model. InINTERSPEECH.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.(2013b). Distributed representations of words and phrases and theircompositionality. In Adv. NIPS.

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddingsefficiently with noise-contrastive estimation. In Advances in neuralinformation processing systems, pages 2265–2273.

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neuralnetwork language model. In Aistats.

Mozer, M. C. (1993). Induction of multiscale temporal structure.Advances in neural information processing systems, pages 275–275.

Page 118: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References VI

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014 conferenceon empirical methods in natural language processing (EMNLP), pages1532–1543.

Rosenfeld, R. (1996). A maximum entropy approach to adaptivestatistical language modeling. Computer, Speech and Language.

Schmidhuber, J. (1992). Learning complex, extended sequences using theprinciple of history compression. Neural Computation.

Shazeer, N., Pelemans, J., and Chelba, C. (2015). Sparse non-negativematrix language modeling for skip-grams. In Proceedings ofInterspeech, pages 1428–1432.

Xiao, Y. and Cho, K. (2016). Efficient character-level documentclassification by combining convolution and recurrent layers. arXivpreprint arXiv:1602.00367.

Zhang, X., Zhao, J., and LeCun, Y. (2015a). Character-levelconvolutional networks for text classification. In Adv. NIPS.

Page 119: Natural Language Processing€¦ · What is NLP, why is it important? Process, analyze and/or produce natural language Interact with computers using natural language Natural Language

References VII

Zhang, X., Zhao, J., and LeCun, Y. (2015b). Character-levelconvolutional networks for text classification. In NIPS.