analysis of a neural language model eric doi cs 152: neural networks harvey mudd college

31
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Upload: bruno-perry

Post on 30-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Analysis of a Neural Language Model

Eric Doi

CS 152: Neural Networks

Harvey Mudd College

Page 2: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Project Goals

Implement a neural network language model

Perform classification between English and Spanish (scrapped)

Produce results supporting work by Bengio et. al

Interpret learned parameters

Page 3: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Page 4: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Exercise 1:

US president has "no hard feelings" about the Iraqi journalist who flung _______

Page 5: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Exercise 1:

US president has "no hard feelings" about the Iraqi journalist who flung shoes

Page 6: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Exercise 2:

in an algorithm that seems to 'backpropagate errors', ______

Page 7: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Problem: Modeling the joint probability function of sequences of words in a language to make predictions

which word maximizes ?QuickTime™ and a

decompressorare needed to see this picture.

Exercise 2:

in an algorithm that seems to 'backpropagate errors', hence

Page 8: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review

Conditional probability

N-Gram assumption

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 9: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Review N-gram does handle sparse data well However, there are problems:

Narrow consideration of context (~1–2 words) Does not consider semantic/grammatical

similarity:“A cat is walking in the bedroom”“A dog was running in a room”

Page 10: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Neural Network Approach

The general idea:1. Associate with each word in the vocabulary

(e.g. size 17,000) a feature vector (30–100 features)

2. Express the joint probability function of word sequences in terms of feature vectors

3. Learn simultaneously the word feature vectors and the parameters of the probability function

Page 11: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Data Preparation Input text needs/benefits from

preprocessing Treat punctuation as words Ignore case Strip any irrelevant data Assemble vocabulary Combine infrequent words (e.g. frequency ≤ 3) Encode numerically

Page 12: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Data Preparation

Parliament proceedings

Page 13: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Neural Architecture

QuickTime™ and a decompressor

are needed to see this picture.

1) C, a word -> Feature vector table

2) A neural network learning the function

QuickTime™ and a decompressor

are needed to see this picture.

Page 14: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Feature Vector Lookup Table

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Like a shared one-hot encoding layer

Page 15: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Neural network

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Optional direct connections

Note, feature vectors are the only connection to words

Hidden layer models interactions

Page 16: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Final Layer

High amount of computation Final layer passes through softmax normalization

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 17: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Parameters

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 18: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Training

We want to find parameters that maximize the training corpus log-likelihood:

Regularization term Run through the full sequence, moving the

viewing window

QuickTime™ and a decompressor

are needed to see this picture.

Page 19: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Training

Perform stochastic (on-line) gradient ascent using backpropagation

Learning rate decreases as

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 20: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Results

Perplexity as a measure of success:

= geometric avg of Measures surprise; a perplexity of 10

means as surprised as when presented with 1 of 10 equally probable outcomes.

Perplexity = 1 => perfect prediction Perplexity ≥ V => failure

QuickTime™ and a decompressor

are needed to see this picture.

Page 21: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Set 1: Train 1000, Test 1000, V = 82

n h m Perplexity

Blindnet 0 0 0 25.3

NNet1 3 3 2 34.6

NNet2 3 3 2 25.9

NNet5 4 20 20 46.9

Page 22: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Set 2: Train 10000, Test 1000, v = 413

n h m Perplexity

Blindnet2 0 0 0 73.7

NNet4 3 3 2 73.9

NNet6 2 50 30

Page 23: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Unigram Modeling

Bias values of the output layer reflect the overall frequencies of the words

Looking at output words with the highest bias values:

freq nnet.b nnet2.b blindnet.b

SUM 1856 1837 1935 1848

∆SUMfreq 0 -19 79 -8

Page 24: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Analyzing features: m = 2

Looked at highest/lowest 10 for both features

Considered the role of overall frequency *rare_word* 5 times as frequent as ‘the,’

but not correlated to high feature values

Page 25: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Analyzing features: m = 2

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

-2.00 -1.00 0.00 1.00 2.00

F2 Highmrthesbeenofhavecanalikeonce

F2 Lowpartwith,thatknownotwhichduringasone

F1 Highsessionlikewealloneatofduringyouthursday

F1 Lowpartmadami.thisonce'agendasria

Page 26: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Analyzing features: m = 2

-5.00

-4.00

-3.00

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

-4.00 -2.00 0.00 2.00 4.00 6.00

F2 Highyoupresidentandparliamentlike,thatcaseyearif

F2 Lowathebe-mrmrsihavetherefor

F1 Highwouldabeonallshouldwhichmadamtothe

F1 Lowpresident.thatiyearyousessionitwhoone

Page 27: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Analyzing features: m = 2

-8.00

-6.00

-4.00

-2.00

0.00

2.00

4.00

6.00

8.00

10.00

-6.00 -4.00 -2.00 0.00 2.00 4.00 6.00

F2 Hightheathisyounows-presidentbei

F2 Low,whichinordershouldbeenparliamentshallrequestbecause

F1 Hightohaveontheandmadamnotbeenthatin

F1 Lowwehoweveribeforememberspresidentdowhichprinciplewould

Page 28: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Difficulties

Computation-intense; hard to run thorough tests

QuickTime™ and a decompressor

are needed to see this picture.

Page 29: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Future Work

Simpler sentences Clustering to find meaningful groups of

words in higher feature dimensions Search across multiple neural networks

Page 30: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

References Bengio, “A Neural Probabilistic Language Model.”

2003. Bengio, “Taking on the Curse of Dimensionality in

Joint Distributions Using Neural Networks. 2000.

Page 31: Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College

Questions?