analysis of a neural language model eric doi cs 152: neural networks harvey mudd college

Analysis of a Neural Language Model

Eric Doi

CS 152: Neural Networks

Harvey Mudd College

Project Goals

Implement a neural network language model

Perform classification between English and Spanish (scrapped)

Produce results supporting work by Bengio et. al

Interpret learned parameters

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Review




Exercise 1:

US president has "no hard feelings" about the Iraqi journalist who flung _______

Review




Exercise 1:

US president has "no hard feelings" about the Iraqi journalist who flung shoes

Review




Exercise 2:

in an algorithm that seems to 'backpropagate errors', ______

Review




Exercise 2:

in an algorithm that seems to 'backpropagate errors', hence

Review

Conditional probability

N-Gram assumption

QuickTime™ and a decompressor

are needed to see this picture.



Review N-gram does handle sparse data well However, there are problems:

Narrow consideration of context (~1–2 words) Does not consider semantic/grammatical

similarity:“A cat is walking in the bedroom”“A dog was running in a room”

Neural Network Approach

The general idea:1. Associate with each word in the vocabulary

(e.g. size 17,000) a feature vector (30–100 features)

2. Express the joint probability function of word sequences in terms of feature vectors

3. Learn simultaneously the word feature vectors and the parameters of the probability function

Data Preparation Input text needs/benefits from

preprocessing Treat punctuation as words Ignore case Strip any irrelevant data Assemble vocabulary Combine infrequent words (e.g. frequency ≤ 3) Encode numerically

Data Preparation

Parliament proceedings

Neural Architecture



1) C, a word -> Feature vector table

2) A neural network learning the function



Feature Vector Lookup Table





Like a shared one-hot encoding layer

Neural network





Optional direct connections

Note, feature vectors are the only connection to words

Hidden layer models interactions

Final Layer

High amount of computation Final layer passes through softmax normalization





Parameters









Training

We want to find parameters that maximize the training corpus log-likelihood:

Regularization term Run through the full sequence, moving the

viewing window



Training

Perform stochastic (on-line) gradient ascent using backpropagation

Learning rate decreases as





Results

Perplexity as a measure of success:

= geometric avg of Measures surprise; a perplexity of 10

means as surprised as when presented with 1 of 10 equally probable outcomes.

Perplexity = 1 => perfect prediction Perplexity ≥ V => failure



Set 1: Train 1000, Test 1000, V = 82

n h m Perplexity

Blindnet 0 0 0 25.3

NNet1 3 3 2 34.6

NNet2 3 3 2 25.9

NNet5 4 20 20 46.9

Set 2: Train 10000, Test 1000, v = 413

n h m Perplexity

Blindnet2 0 0 0 73.7

NNet4 3 3 2 73.9

NNet6 2 50 30

Unigram Modeling

Bias values of the output layer reflect the overall frequencies of the words

Looking at output words with the highest bias values:

freq nnet.b nnet2.b blindnet.b

SUM 1856 1837 1935 1848

∆SUMfreq 0 -19 79 -8

Analyzing features: m = 2

Looked at highest/lowest 10 for both features

Considered the role of overall frequency *rare_word* 5 times as frequent as ‘the,’

but not correlated to high feature values


-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

-2.00 -1.00 0.00 1.00 2.00

F2 Highmrthesbeenofhavecanalikeonce

F2 Lowpartwith,thatknownotwhichduringasone

F1 Highsessionlikewealloneatofduringyouthursday

F1 Lowpartmadami.thisonce'agendasria


-5.00

-4.00

-3.00

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

-4.00 -2.00 0.00 2.00 4.00 6.00

F2 Highyoupresidentandparliamentlike,thatcaseyearif

F2 Lowathebe-mrmrsihavetherefor

F1 Highwouldabeonallshouldwhichmadamtothe

F1 Lowpresident.thatiyearyousessionitwhoone


-8.00

-6.00

-4.00

-2.00

0.00

2.00

4.00

6.00

8.00

10.00

-6.00 -4.00 -2.00 0.00 2.00 4.00 6.00

F2 Hightheathisyounows-presidentbei

F2 Low,whichinordershouldbeenparliamentshallrequestbecause

F1 Hightohaveontheandmadamnotbeenthatin

F1 Lowwehoweveribeforememberspresidentdowhichprinciplewould

Difficulties

Computation-intense; hard to run thorough tests



Future Work

Simpler sentences Clustering to find meaningful groups of

words in higher feature dimensions Search across multiple neural networks

References Bengio, “A Neural Probabilistic Language Model.”

2003. Bengio, “Taking on the Curse of Dimensionality in

Joint Distributions Using Neural Networks. 2000.

Questions?

analysis of a neural language model eric doi cs 152: neural networks harvey mudd college

Documents

word feature vectors

word feature vector

terms of feature vectors3

output layer

hard feelings

us president

backpropagate errors

iraqi journalist