a neural probabilistic language model 2014-12-16 keren ye

A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Upload: henry-randall

Post on 17-Dec-2015




0 download


Page 1: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Model

2014-12-16Keren Ye

Page 2: A Neural Probabilistic Language Model 2014-12-16 Keren Ye


• N-gram Models

• Fighting the Curse of Dimensionality

• A Neural Probabilistic Language Model

• Continuous Bag of Words(Word2vec)

Page 3: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

n-gram models

• Construct tables of conditional probabilities for the next word

• Combinations of the last n-1 words


11 |ˆ|ˆ



t wwPwwP

Page 4: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

n-gram models

• i.e. “I like playing basketball”

– Unigram(1-gram)

– Bigram(2-gram)

– Trigram(3-gram)

playingbasketballPplayinglikeIbasketballP |ˆ,,|ˆ

basketballPplayinglikeIbasketballP ˆ,,|ˆ

playinglikebasketballPplayinglikeIbasketballP ,|ˆ,,|ˆ

Page 5: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

n-gram models

• Disadvantages

– It is not taking into account contexts farther than 1 or 2 words

– It is not taking into account the similarity between words

• i.e.“The cat is walking in the bedroom”(training corpus)

• “A dog was running in a room”(?)

Page 6: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

n-gram models

• Disadvantages

– Curse of Dimensionality

Page 7: A Neural Probabilistic Language Model 2014-12-16 Keren Ye


• N-gram Models

• Fighting the Curse of Dimensionality

• A Neural Probabilistic Language Model

• Continuous Bag of Words(Word2vec)

Page 8: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Fighting the Curse of Dimensionality

• Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in )

• Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence

• Learn simultaneously the word feature vectors and the parameters of that probability function


Page 9: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Fighting the Curse of Dimensionality

• Word feature vectors

– Each word is associated with a point in a vector space

– The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)

Page 10: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Fighting the Curse of Dimensionality

• Probability function

– Using a multi-layer neural network to predict the next word given the previous ones, in the experiments

– This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data

Page 11: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Fighting the Curse of Dimensionality

• Why does it work?

– If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from

• The cat is walking in the bedroom

– to and likewise to

• A dog was running in a room

• The cat is running in a room

• A dog is walking in a bedroom

• ….

Page 12: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Fighting the Curse of Dimensionality


– Neural Network Language Model

Page 13: A Neural Probabilistic Language Model 2014-12-16 Keren Ye


• N-gram Models

• Fighting the Curse of Dimensionality

• A Neural Probabilistic Language Model

• Continuous Bag of Words(Word2vec)

Page 14: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Denotations

– The training set is a sequence of words , where the vocabulary V is a large but finite set

– The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood

– The only constraint on model is that for any choice of , the sum

Tww ...1 Vwt

111 |ˆ,..., t

tntt wwPwwf1


1,...,,1 11 V

i ntt wwif

Page 15: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Objective function

– Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term



ntt ;,...,log1



Page 16: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Model

– We decompose the function in two parts

• A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary

• The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context, , to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability

111 |ˆ,..., t

tntt wwPwwf


11 ,..., tnt wCwC

1111 ,...,,,...,, nttntt wCwCigwwif

Page 17: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Page 18: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Model details (two hidden layers)

– The shared word features layer C, which has no non-linearity (it would not add anything useful)

– The ordinary hyperbolic tangent hidden layer

Page 19: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Model details (formal description)

– The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1




nttt i



ewwwP 11,...,|ˆ

Page 20: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Model details (formal description)

– The are the unnormalized log-probabilities for each output word i , computed as follows, with parameters b, W, U, d and H

• Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections)

• And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C


)tanh( HxdUWxby

11 ,..., tnt wCwCx

Page 21: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Page 22: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Parameters Brief Dimensions

b Output biases |V|

d Hidden layer bieses h

W No direct connections 0

U Hidden-to-output weights |V|*h matrix

H Word features to output weights h*(n-1)m matrix

C Word features |V|*m matrix

)tanh( HxdUWxby

Page 23: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Page 24: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Stochastic gradient ascent

– Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window

CHUWdb ,,,,,

11,...,|ˆlog nttt wwwP

Page 25: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

• Parallel Implementation

– Data-Parallel Processing

• Relied on synchronization commands – slow

• No locks – noise seems to be very small and did not apparently slow down training

– Parameter-parallel Processing

• Parallelize across the parameters

Page 26: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Page 27: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

A Neural Probabilistic Language Mode

Page 28: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Continuous Bag of Words(Word2vec)

• Bag of words

– Traditional solution for the problem of Curse of Dimensionality





Page 29: A Neural Probabilistic Language Model 2014-12-16 Keren Ye


• N-gram Models

• Fighting the Curse of Dimensionality

• A Neural Probabilistic Language Model

• Continuous Bag of Words(Word2vec)

Page 30: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Continuous Bag of Words(Word2vec)

• Continuous Bag of Words

Page 31: A Neural Probabilistic Language Model 2014-12-16 Keren Ye

Continuous Bag of Words(Word2vec)

• Distinctness

– Projection layer

• Sum vs Concatenate

• Order of words

– Hidden layer

• tanh vs NULL

– Hierarchical Softmax

Page 32: A Neural Probabilistic Language Model 2014-12-16 Keren Ye
