language model (d3l1 deep learning for speech and language upc 2017)
TRANSCRIPT
![Page 1: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/1.jpg)
[course site]
Day 3 Lecture 1
Language Model
Marta R. Costa-jussà
![Page 2: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/2.jpg)
2
Previous concepts from this course
● Word embeddings
● Feed-forward network and softmax
● Recurrent neural network (handle variable-length sequences)
![Page 3: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/3.jpg)
3
What is the most probable sentence?
Two birds are flyingTwo beards are flying
![Page 4: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/4.jpg)
Probability of a sentence
•
•
–
![Page 5: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/5.jpg)
5
A language model finds the probability of a sentence
● Given a sentence (w1, w2, … wT),● What is p(w1, w2, . . . ,wT) =?
![Page 6: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/6.jpg)
An n-gram language model
![Page 7: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/7.jpg)
Chain rule probability and Markov simplifying assumption
p(w1, w2, . . . ,wT) = p(wT|w(T-1),w(T-2)...w1) p(w(T-1)|w(T-2),w(T-3)...w1) … p(w1)
Markov simplying assumption: The current word only depends on n previous words.
p(wt|w(t-1)w(t-2)..w1) ~ p(wt|w(t-1))
![Page 8: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/8.jpg)
Objective
![Page 9: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/9.jpg)
9
An n-gram-based language model
••
•
![Page 10: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/10.jpg)
10
An n-gram-based language model
![Page 11: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/11.jpg)
Any issues with the above model?
![Page 12: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/12.jpg)
Some examples..."<s> que la fuerza te acompañe </s>", = may the force be with you
bigrams and trigrams like:fuerza tela fuerza tefuerza te acompañete acompañe </s>do not appear in the big corpus of El Periodico (40 M words) BUT PROBABILITY OF THE SENTENCE SHOULD NOT BE ZERO!!!!
![Page 13: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/13.jpg)
13
Sparse counts are a big problem
● Backing off avoids zero probabilities
![Page 14: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/14.jpg)
14
Sparse counts are a big problem
● Smoothing avoids zero probabilities
![Page 15: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/15.jpg)
Any other issue?
![Page 16: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/16.jpg)
16
Lack of generalization
![Page 17: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/17.jpg)
A neural language modelTo generalize to un-seen n-grams
![Page 18: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/18.jpg)
18
A neural language model
Find a function that takes as input n-1 words and returns a conditional probability of the next one
![Page 19: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/19.jpg)
Architecture: neural language model
w_(t-1) w_(t-2) w_(t-n)Figure: K. Cho, DL4MT course, 2015
![Page 20: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/20.jpg)
Figure: K. Cho, DL4MT course, 2015
Architecture: representation of input words
our goal is to put the least amount of prior knowledge
![Page 21: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/21.jpg)
Word One-hot encoding
economic 000010...
growth 001000...
has 100000...
slowed 000001...
Step 1: One-hot encodingNatural language words can be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K=|V|).
From previous lectures
![Page 22: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/22.jpg)
Architecture: continuous word representation
input vectors are multiplied by the weight matrix (E), to obtain continuous vectors
this weight matrix (E) is also called word embedding and should reflect the meaning of a word
Figure: K. Cho, DL4MT course, 2015
![Page 23: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/23.jpg)
Architecture: context vector
we get a sequence of continuous vectors, by concatenating the continuous representations of the input words
context vector
Figure: K. Cho, DL4MT course, 2015
![Page 24: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/24.jpg)
Architecture: nonlinear projection
the context vector is fed through one or several transformation layers (which extract features),
here, just we used a simple transformation layer, with W and b as parameters
Figure: K. Cho, DL4MT course, 2015
![Page 25: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/25.jpg)
Architecture: output probability distribution
LM has a categorical distribution, where the probability of the k-th event happening is denoted as
the function needs to return a K-dimensional vector and corresponds to the probability of the i-th word in the vocabulary for the next wordFigure: K. Cho, DL4MT course, 2015
![Page 26: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/26.jpg)
Why this model is generalizing to unseen events?
![Page 27: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/27.jpg)
27
Further generalization comes from embeddings
two
three
![Page 28: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/28.jpg)
Recurrent Language ModelTo further generalize to un-seen n-grams
![Page 29: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/29.jpg)
29
A neural language model
Still assumes the n-th order Markov propertyit looks only as n-1 past words
In France , there are around 66 million people and they speak French.
![Page 30: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/30.jpg)
30
How we can modelate variable-length input?
![Page 31: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/31.jpg)
31
How we can modelate variable-length input?
We directly model the original conditional probabilities
Via Recursion:
summarizes the history from w1 to w(t-1)
Initial conditionRecursion
![Page 32: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/32.jpg)
32
How we can modelate variable-length input?
We directly model the original conditional probabilities
Via Recursion:
summarizes the history from w1 to w(t-1)
Initial conditionRecursion
The RNN is capable of summarizing a variable-length input sequence (w) into a memory state (h)
![Page 33: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/33.jpg)
33
Example
p(Mary, buys, the,apple)
(1) Intialization: h0=0 -> p(Mary)=g(h0)(2) Recursion
(a) h1=f(h0,Mary) ->p(buys|Mary)=g(h1)(b) h2=f(h1,buys) -> p(the|Mary,buys)=g(h2)(c) h3=f(h2,the) -> p(apple|Mary,buys,the)=g(h3)
(3) Output: p(Mary,buys,the,apple)=g(h0)g(h1)g(h2)g(h3)
It works for any number of context wordsREAD, UPDATE, PREDICT
![Page 34: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/34.jpg)
34
A recurrent neural language model
Can modelate the probability of a variable-length input sequence
conditional probability that we want to compute
![Page 35: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/35.jpg)
35
A recurrent neural language model
what we need
(1) Transition function(2) Output function
![Page 36: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/36.jpg)
36
(Naive) Transition function
Inputs: one-hot vector+ hidden state Parameters: input weight matrix + transition weight matrix + bias vector (b)
![Page 37: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/37.jpg)
37
(Naive) Transition function
Continuous-space Representation of word
Linear Transformation of the Previous Hidden State
nonlinear transformation
![Page 38: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/38.jpg)
38
Output function
Input: hidden state (ht)Parameters: output matrix bias vector (c)
![Page 39: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/39.jpg)
39
Output function
This summary vector is affine-transformed followed by a softmax nonlinear function to compute the conditional probability of each word.
![Page 40: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/40.jpg)
Measure to evaluatePerplexity
![Page 41: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/41.jpg)
41
Perplexity: a measure to evaluate language modelingPerplexity measures how high a probability the language model assigns to correct next words in the test corpus “on average”. A better language model is the one with a lower perplexity.
Perplexity measures as well how complex is a task equals size of vocabulary (V) PP=V
![Page 42: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/42.jpg)
42
Comparing language models
LM Hidden Layers PPL
n-gram-based 131.2
+feed-forward 600 112.5
+RNN 600 108.1
+LSTM 600 92.0
Results from Sundermeyer et al, 2015
![Page 43: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/43.jpg)
43
Applications of language modeling
Speech recognitionMachine translation Handwriting recognitionInformation retrieval ...
![Page 44: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/44.jpg)
44
A bad language model
![Page 45: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/45.jpg)
45
Summary● Language modeling consists in assigning a probability to a
sequence of words.● We can model a sequence of words with n-grams, feed
-forward networks and recurrent networks. ● Feed-forward networks are able to generalise unseen
contexts● RNN are able to use variable contexts
![Page 46: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/46.jpg)
46
Learn moreYoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (March 2003), 1137-1155.
Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 23, 3 (March 2015), 517-529. DOI=http://dx.doi.org/10.1109/TASLP.2015.2400218
![Page 47: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/47.jpg)
Thanks ! Q&A ?
https://www.costa-jussa.com
![Page 48: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/48.jpg)
![Page 49: Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022021813/5899c4731a28ab45548b5733/html5/thumbnails/49.jpg)
Architecture: neural language model