the mental representation of sentences tree structures or state vectors? stefan frank...

The mental representation of The mental representation of sentencessentences

Tree structures or state vectors?Tree structures or state vectors?

Stefan [email protected]

With help fromWith help from

Rens BodRens Bod

Victor KupermanVictor Kuperman

Brian RoarkBrian Roark

Vera DembergVera Demberg

Understanding a sentenceUnderstanding a sentenceThe very general pictureThe very general picture

the cat is on the mat

word sequence

“meaning”

“comprehension”

Sentence meaning

Theories of mental representationTheories of mental representation

x,y:cat(x) ∧ mat(y) ∧ on(x,y)

tree s

tru

ctu

re

cat

on

mat

log

ical

form

con

cep

tual

netw

ork

perce

ptu

al

simu

latio

nst

ate

vect

or

or

act

ivati

on

p

att

ern

Sentence “structure”

…

??

Grammar-based vs. connectionist Grammar-based vs. connectionist modelsmodels

Grammars account for the productivity and systematicity of language(Fodor & Pylyshyn, 1988; Marcus, 1998)

Connectionism can explain why there is no (unlimited) productivity and (pure) systematicity(Christiansen & Chater, 1999)

The debate (or battle) between the two camps focuses on particular (psycho-) linguistic phenomena, e.g.:

From theories to modelsFrom theories to models

Probabilistic Context-Free

Grammar (PCFG)

Probabilistic Context-Free

Grammar (PCFG)

Simple Recurrent Network (SRN)

Simple Recurrent Network (SRN)

versus

Implemented computational models can be evaluated and compared more thoroughly than ‘mere’ theories

Take a common grammar-based model and a common connectionist model

Compare their ability to predict empirical data (measurements of word-reading time)

Probabilistic Context-Free GrammarProbabilistic Context-Free Grammar

A context-free grammar with a probability for each production rule (conditional on the rule’s left-hand side)

The probability of a tree structure is the product of probabilities of the rules involved in its construction.

The probability of a sentence is the sum of probabilities of all its grammatical tree structures.

Rules and their probabilities can be induced from a large corpus of syntactically annotated sentences (a treebank).

Wall Street Journal treebank: approx. 50,000 sentences from WSJ newspaper articles (1988−1989)

Inducing a PCFGInducing a PCFG

S

NP VP

.PRP VPZ NP

NP

NP

PP

DT

NN

NN

NN NN

NP

PRP$

IN

NP

.It has

no bearing

on

our work force today

S NP VP .NP DT NN

NP PRP$ NN NN

NN bearing

NN today

Simple Recurrent Network Simple Recurrent Network Elman (1990)Elman (1990)

Feedforward neural network with recurrent connections

Processes sentences, word by word Usually trained to predict the upcoming word

(i.e., the input at t+1)

word input at t

word input at t

hidden layerhidden layer

output layeroutput layer

hidden activation at t–

1

hidden activation at t–

1

(copy)

state vector representing the

sentence up to word t

estimated probabilities for

words at t+1

Word probability and reading timesWord probability and reading timesHale (2001), Levy (2008)Hale (2001), Levy (2008)

surprisal of wt

surprisal of wt

surprisal of wt

surprisal of wt

Surprisal theory: the more unexpected the occurrence of a word, the more time needed to process it. Formally:

A sentence is a sequence of words: w1, w2, w3, …

The time needed to read word wt, is logarithmically related to its probability in the ‘context’:

RT(wt) ~ −log Pr(wt|context) If nothing else changes, the context is just the

sequence of previous words:RT(wt) ~ −log Pr(wt|w1, …, wt−1)

Both PCFGs and SRNs can estimate Pr(wt|w1, …, wt−1) So can they predict word-reading times?

Testing surprisal theoryTesting surprisal theoryDemberg & Keller (2008)Demberg & Keller (2008)

Reading-time data: Dundee corpus: approx. 2,400 sentences from The

Independent newspaper editorials Read by 10 subjects Eye-movement registration First-pass RTs: fixation time on a word before any

fixation on later words Computation of surprisal:

PCFG induced from WSJ treebank Applied to Dundee corpus sentences Using Brian Roark’s incremental PCFG parser


But accurate word prediction is difficult because of required world knowledge differences between WSJ and The Independent:

− 1988-’89 versus 2002− general WSJ articles versus Independent editorials− American English versus British English− only major similarity: both are in English

Result: No significant effect of word surprisal on RT, apart from the effects of Pr(wt) and Pr(wt|wt1)

Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual

words

Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual

words

S

NP VP

.PRP VPZ NP

NP

NP

PP

DT

NN

NN

NN NN

NP

PRP$

IN

NP

.

no on

today

it has

bearing

our work force

part-of-speech (pos) tags

‘Unlexicalized’ (or ‘structural’) surprisal− PCFG induced from WSJ trees with words removed− surprisal estimation by parsing sequences of pos-tags

(instead of words) of Dundee corpus texts− independent of semantics, so more accurate

estimation possible− but probably a weaker relation with reading times

Is a word’s RT related to the predictability of its part-of-speech?

Result: Yes, statistically significant (but very small) effect of pos-surprisal on word-RT


CaveatsCaveats

Statistical analysis:− The analysis assumes independent measurements− Surprisal theory is based on dependencies between

words− So the analysis is inconsistent with the theory

Implicit assumptions:− The PCFG forms an accurate language model (i.e., it

gives high probability to the parts-of-speech that actually occur)

− An accurate language model is also an accurate psycholinguistic model (i.e., it predicts reading times)

SolutionsSolutions

Sentence-level (instead of word-level) analysis− Both PCFG and statistical analysis assume independence

between sentences− Surprisal averaged over pos-tags in the sentence− Total sentence RT divided by sentence length (# letters)

Measure accuracya) of the language model:

lower average surprisal more accurate language modelb) of the psycholinguistic model:

RT and surprisal correlate more strongly more accurate psycholinguistic model

If a) and b) increase together, accurate language models are also accurate psycholinguistic models

Comparing PCFG and SRNComparing PCFG and SRN

PCFG− Train on WSJ treebank (unlexicalized)− Parse pos-tag sequences from Dundee corpus − Obtain range of surprisal estimates by varying ‘beam-width’

parameter, which controls parser accuracy

SRN− Train on sequences of pos-tags (not the trees) from WSJ− During training (at regular intervals), process pos-tags from

Dundee corpus, obtaining a range of surprisal estimates

Evaluation− Language model: average surprisal measures inaccuracy

(and estimates language entropy)− Psycholinguistic model: correlation between surprisals and

RTs

just likeDemberg & Keller

PCFG

SRN

ResultsResults

Preliminary conclusionsPreliminary conclusions

Both models account for a statistically significant fraction of variance in reading-time data.

The human sentence-processing system seems to be using an accurate language model.

The SRN is the more accurate psycholinguistic model.

But PCFG and SRN together might form an ever better psycholinguistic

model

But PCFG and SRN together might form an ever better psycholinguistic

model

Improved analysisImproved analysis

Linear mixed-effect regression model (to take into account random effects of subject and item)

Compare regression models that include: − surprisal estimates by PCFG with largest beam width− surprisal estimates by fully trained SRN− both

Also include: sentence length, word frequency, forward and backward transitional probabilities, and all significant two-way interactions between these

ResultsResults

Regression model includes…

PCFG SRN bothEffect of surprisal according to…

PCFG 0.45 p<.02

SRN

Estimated β-coefficients (and associated p-values)

ResultsResults



PCFG 0.45 p<.02

SRN 0.64 p<.001


ResultsResults



PCFG 0.45 p<.02

−0.46 p>.2

SRN 0.64 p<.001

1.02 p<.01


ConclusionsConclusions

Both PCFG and SRN do account for the reading-time data to some extent

But the PCFG does not improve on the SRN’s predictions

No evidence for tree structures in the mental representation of

sentences

No evidence for tree structures in the mental representation of

sentences

Why does the SRN fit the data better than the PCFG?

Is it more accurate on a particular group of data points, or does it perform better overall?

Take the regression analyses’ residuals (i.e., differences between predicted and measured reading times)

δi = |residi(PCFG)| − |residi(SRN)|

δi is the extent to which data point i is predicted better by the SRN than by the PCFG.

Is there a group of data points for which δ is larger than might be expected?

Look at the distribution of the δs.

Qualitative comparisonQualitative comparison

Possible distributions of Possible distributions of δδ

0

0

0

No difference between SRN and PCFG (only random noise)

Overall better predictions by SRN

Particular subset of the data is predicted better by SRN

right-shiftedright-

shifted

right-skewedright-skewed

symmetrical, mean is 0

symmetrical, mean is 0

If the distribution of δ is asymmetric, particular data points are predicted better by the SRN than the PCFG.

The distribution is not significantly asymmetric(two-sample Kolmogorov-Smirnov test, p>.17)

The SRN seems to be a more accurate psycholinguistic model overall.

Test for symmetryTest for symmetry

Questions I cannot (yet) answerQuestions I cannot (yet) answer(but would like to)(but would like to)

Why does the SRN fit the data better than the PCFG?

Perhaps people…− are bad at dealing with long-distance dependencies in a

sentence?− store information about the frequency of multi-word

sequences?

Questions I cannot (yet) answerQuestions I cannot (yet) answer(but would like to)(but would like to)

In general: Is this SRN a more accurate psychological model

than this PCFG? Are SRNs more accurate psychological models than

PCFGs? Does connectionism make for more accurate

psychological models than grammar-based theories? What kind of representations are used in human

sentence processing?

Surprisal-based model evaluation may provide some

answers

Surprisal-based model evaluation may provide some

answers

the mental representation of sentences tree structures or state vectors? stefan frank...

Documents

word t

word w t

word probability

time slide

word input

upcoming word

log prw t w

nn nn nn