analysis of a neural language model eric doi cs 152: neural networks harvey mudd college
TRANSCRIPT
Analysis of a Neural Language Model
Eric Doi
CS 152: Neural Networks
Harvey Mudd College
Project Goals
Implement a neural network language model
Perform classification between English and Spanish (scrapped)
Produce results supporting work by Bengio et. al
Interpret learned parameters
Review
Problem: Modeling the joint probability function of sequences of words in a language to make predictions
which word maximizes ?QuickTime™ and a
decompressorare needed to see this picture.
Review
Problem: Modeling the joint probability function of sequences of words in a language to make predictions
which word maximizes ?QuickTime™ and a
decompressorare needed to see this picture.
Exercise 1:
US president has "no hard feelings" about the Iraqi journalist who flung _______
Review
Problem: Modeling the joint probability function of sequences of words in a language to make predictions
which word maximizes ?QuickTime™ and a
decompressorare needed to see this picture.
Exercise 1:
US president has "no hard feelings" about the Iraqi journalist who flung shoes
Review
Problem: Modeling the joint probability function of sequences of words in a language to make predictions
which word maximizes ?QuickTime™ and a
decompressorare needed to see this picture.
Exercise 2:
in an algorithm that seems to 'backpropagate errors', ______
Review
Problem: Modeling the joint probability function of sequences of words in a language to make predictions
which word maximizes ?QuickTime™ and a
decompressorare needed to see this picture.
Exercise 2:
in an algorithm that seems to 'backpropagate errors', hence
Review
Conditional probability
N-Gram assumption
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Review N-gram does handle sparse data well However, there are problems:
Narrow consideration of context (~1–2 words) Does not consider semantic/grammatical
similarity:“A cat is walking in the bedroom”“A dog was running in a room”
Neural Network Approach
The general idea:1. Associate with each word in the vocabulary
(e.g. size 17,000) a feature vector (30–100 features)
2. Express the joint probability function of word sequences in terms of feature vectors
3. Learn simultaneously the word feature vectors and the parameters of the probability function
Data Preparation Input text needs/benefits from
preprocessing Treat punctuation as words Ignore case Strip any irrelevant data Assemble vocabulary Combine infrequent words (e.g. frequency ≤ 3) Encode numerically
Data Preparation
Parliament proceedings
Neural Architecture
QuickTime™ and a decompressor
are needed to see this picture.
1) C, a word -> Feature vector table
2) A neural network learning the function
QuickTime™ and a decompressor
are needed to see this picture.
Feature Vector Lookup Table
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Like a shared one-hot encoding layer
Neural network
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Optional direct connections
Note, feature vectors are the only connection to words
Hidden layer models interactions
Final Layer
High amount of computation Final layer passes through softmax normalization
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Parameters
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Training
We want to find parameters that maximize the training corpus log-likelihood:
Regularization term Run through the full sequence, moving the
viewing window
QuickTime™ and a decompressor
are needed to see this picture.
Training
Perform stochastic (on-line) gradient ascent using backpropagation
Learning rate decreases as
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Results
Perplexity as a measure of success:
= geometric avg of Measures surprise; a perplexity of 10
means as surprised as when presented with 1 of 10 equally probable outcomes.
Perplexity = 1 => perfect prediction Perplexity ≥ V => failure
QuickTime™ and a decompressor
are needed to see this picture.
Set 1: Train 1000, Test 1000, V = 82
n h m Perplexity
Blindnet 0 0 0 25.3
NNet1 3 3 2 34.6
NNet2 3 3 2 25.9
NNet5 4 20 20 46.9
Set 2: Train 10000, Test 1000, v = 413
n h m Perplexity
Blindnet2 0 0 0 73.7
NNet4 3 3 2 73.9
NNet6 2 50 30
Unigram Modeling
Bias values of the output layer reflect the overall frequencies of the words
Looking at output words with the highest bias values:
freq nnet.b nnet2.b blindnet.b
SUM 1856 1837 1935 1848
∆SUMfreq 0 -19 79 -8
Analyzing features: m = 2
Looked at highest/lowest 10 for both features
Considered the role of overall frequency *rare_word* 5 times as frequent as ‘the,’
but not correlated to high feature values
Analyzing features: m = 2
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
-2.00 -1.00 0.00 1.00 2.00
F2 Highmrthesbeenofhavecanalikeonce
F2 Lowpartwith,thatknownotwhichduringasone
F1 Highsessionlikewealloneatofduringyouthursday
F1 Lowpartmadami.thisonce'agendasria
Analyzing features: m = 2
-5.00
-4.00
-3.00
-2.00
-1.00
0.00
1.00
2.00
3.00
4.00
-4.00 -2.00 0.00 2.00 4.00 6.00
F2 Highyoupresidentandparliamentlike,thatcaseyearif
F2 Lowathebe-mrmrsihavetherefor
F1 Highwouldabeonallshouldwhichmadamtothe
F1 Lowpresident.thatiyearyousessionitwhoone
Analyzing features: m = 2
-8.00
-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
8.00
10.00
-6.00 -4.00 -2.00 0.00 2.00 4.00 6.00
F2 Hightheathisyounows-presidentbei
F2 Low,whichinordershouldbeenparliamentshallrequestbecause
F1 Hightohaveontheandmadamnotbeenthatin
F1 Lowwehoweveribeforememberspresidentdowhichprinciplewould
Difficulties
Computation-intense; hard to run thorough tests
QuickTime™ and a decompressor
are needed to see this picture.
Future Work
Simpler sentences Clustering to find meaningful groups of
words in higher feature dimensions Search across multiple neural networks
References Bengio, “A Neural Probabilistic Language Model.”
2003. Bengio, “Taking on the Curse of Dimensionality in
Joint Distributions Using Neural Networks. 2000.
Questions?