the mental representation of sentences tree structures or state vectors? stefan frank...
TRANSCRIPT
The mental representation of The mental representation of sentencessentences
Tree structures or state vectors?Tree structures or state vectors?
Stefan [email protected]
With help fromWith help from
Rens BodRens Bod
Victor KupermanVictor Kuperman
Brian RoarkBrian Roark
Vera DembergVera Demberg
Understanding a sentenceUnderstanding a sentenceThe very general pictureThe very general picture
the cat is on the mat
word sequence
“meaning”
“comprehension”
Sentence meaning
Theories of mental representationTheories of mental representation
x,y:cat(x) ∧ mat(y) ∧ on(x,y)
tree s
tru
ctu
re
cat
on
mat
log
ical
form
con
cep
tual
netw
ork
perce
ptu
al
simu
latio
nst
ate
vect
or
or
act
ivati
on
p
att
ern
Sentence “structure”
…
??
Grammar-based vs. connectionist Grammar-based vs. connectionist modelsmodels
Grammars account for the productivity and systematicity of language(Fodor & Pylyshyn, 1988; Marcus, 1998)
Connectionism can explain why there is no (unlimited) productivity and (pure) systematicity(Christiansen & Chater, 1999)
The debate (or battle) between the two camps focuses on particular (psycho-) linguistic phenomena, e.g.:
From theories to modelsFrom theories to models
Probabilistic Context-Free
Grammar (PCFG)
Probabilistic Context-Free
Grammar (PCFG)
Simple Recurrent Network (SRN)
Simple Recurrent Network (SRN)
versus
Implemented computational models can be evaluated and compared more thoroughly than ‘mere’ theories
Take a common grammar-based model and a common connectionist model
Compare their ability to predict empirical data (measurements of word-reading time)
Probabilistic Context-Free GrammarProbabilistic Context-Free Grammar
A context-free grammar with a probability for each production rule (conditional on the rule’s left-hand side)
The probability of a tree structure is the product of probabilities of the rules involved in its construction.
The probability of a sentence is the sum of probabilities of all its grammatical tree structures.
Rules and their probabilities can be induced from a large corpus of syntactically annotated sentences (a treebank).
Wall Street Journal treebank: approx. 50,000 sentences from WSJ newspaper articles (1988−1989)
Inducing a PCFGInducing a PCFG
S
NP VP
.PRP VPZ NP
NP
NP
PP
DT
NN
NN
NN NN
NP
PRP$
IN
NP
.It has
no bearing
on
our work force today
S NP VP .NP DT NN
NP PRP$ NN NN
NN bearing
NN today
Simple Recurrent Network Simple Recurrent Network Elman (1990)Elman (1990)
Feedforward neural network with recurrent connections
Processes sentences, word by word Usually trained to predict the upcoming word
(i.e., the input at t+1)
word input at t
word input at t
hidden layerhidden layer
output layeroutput layer
hidden activation at t–
1
hidden activation at t–
1
(copy)
state vector representing the
sentence up to word t
estimated probabilities for
words at t+1
Word probability and reading timesWord probability and reading timesHale (2001), Levy (2008)Hale (2001), Levy (2008)
surprisal of wt
surprisal of wt
surprisal of wt
surprisal of wt
Surprisal theory: the more unexpected the occurrence of a word, the more time needed to process it. Formally:
A sentence is a sequence of words: w1, w2, w3, …
The time needed to read word wt, is logarithmically related to its probability in the ‘context’:
RT(wt) ~ −log Pr(wt|context) If nothing else changes, the context is just the
sequence of previous words:RT(wt) ~ −log Pr(wt|w1, …, wt−1)
Both PCFGs and SRNs can estimate Pr(wt|w1, …, wt−1) So can they predict word-reading times?
Testing surprisal theoryTesting surprisal theoryDemberg & Keller (2008)Demberg & Keller (2008)
Reading-time data: Dundee corpus: approx. 2,400 sentences from The
Independent newspaper editorials Read by 10 subjects Eye-movement registration First-pass RTs: fixation time on a word before any
fixation on later words Computation of surprisal:
PCFG induced from WSJ treebank Applied to Dundee corpus sentences Using Brian Roark’s incremental PCFG parser
Testing surprisal theoryTesting surprisal theoryDemberg & Keller (2008)Demberg & Keller (2008)
But accurate word prediction is difficult because of required world knowledge differences between WSJ and The Independent:
− 1988-’89 versus 2002− general WSJ articles versus Independent editorials− American English versus British English− only major similarity: both are in English
Result: No significant effect of word surprisal on RT, apart from the effects of Pr(wt) and Pr(wt|wt1)
Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual
words
Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual
words
S
NP VP
.PRP VPZ NP
NP
NP
PP
DT
NN
NN
NN NN
NP
PRP$
IN
NP
.
no on
today
it has
bearing
our work force
part-of-speech (pos) tags
‘Unlexicalized’ (or ‘structural’) surprisal− PCFG induced from WSJ trees with words removed− surprisal estimation by parsing sequences of pos-tags
(instead of words) of Dundee corpus texts− independent of semantics, so more accurate
estimation possible− but probably a weaker relation with reading times
Is a word’s RT related to the predictability of its part-of-speech?
Result: Yes, statistically significant (but very small) effect of pos-surprisal on word-RT
Testing surprisal theoryTesting surprisal theoryDemberg & Keller (2008)Demberg & Keller (2008)
CaveatsCaveats
Statistical analysis:− The analysis assumes independent measurements− Surprisal theory is based on dependencies between
words− So the analysis is inconsistent with the theory
Implicit assumptions:− The PCFG forms an accurate language model (i.e., it
gives high probability to the parts-of-speech that actually occur)
− An accurate language model is also an accurate psycholinguistic model (i.e., it predicts reading times)
SolutionsSolutions
Sentence-level (instead of word-level) analysis− Both PCFG and statistical analysis assume independence
between sentences− Surprisal averaged over pos-tags in the sentence− Total sentence RT divided by sentence length (# letters)
Measure accuracya) of the language model:
lower average surprisal more accurate language modelb) of the psycholinguistic model:
RT and surprisal correlate more strongly more accurate psycholinguistic model
If a) and b) increase together, accurate language models are also accurate psycholinguistic models
Comparing PCFG and SRNComparing PCFG and SRN
PCFG− Train on WSJ treebank (unlexicalized)− Parse pos-tag sequences from Dundee corpus − Obtain range of surprisal estimates by varying ‘beam-width’
parameter, which controls parser accuracy
SRN− Train on sequences of pos-tags (not the trees) from WSJ− During training (at regular intervals), process pos-tags from
Dundee corpus, obtaining a range of surprisal estimates
Evaluation− Language model: average surprisal measures inaccuracy
(and estimates language entropy)− Psycholinguistic model: correlation between surprisals and
RTs
just likeDemberg & Keller
PCFG
SRN
ResultsResults
Preliminary conclusionsPreliminary conclusions
Both models account for a statistically significant fraction of variance in reading-time data.
The human sentence-processing system seems to be using an accurate language model.
The SRN is the more accurate psycholinguistic model.
But PCFG and SRN together might form an ever better psycholinguistic
model
But PCFG and SRN together might form an ever better psycholinguistic
model
Improved analysisImproved analysis
Linear mixed-effect regression model (to take into account random effects of subject and item)
Compare regression models that include: − surprisal estimates by PCFG with largest beam width− surprisal estimates by fully trained SRN− both
Also include: sentence length, word frequency, forward and backward transitional probabilities, and all significant two-way interactions between these
ResultsResults
Regression model includes…
PCFG SRN bothEffect of surprisal according to…
PCFG 0.45 p<.02
SRN
Estimated β-coefficients (and associated p-values)
ResultsResults
Regression model includes…
PCFG SRN bothEffect of surprisal according to…
PCFG 0.45 p<.02
SRN 0.64 p<.001
Estimated β-coefficients (and associated p-values)
ResultsResults
Regression model includes…
PCFG SRN bothEffect of surprisal according to…
PCFG 0.45 p<.02
−0.46 p>.2
SRN 0.64 p<.001
1.02 p<.01
Estimated β-coefficients (and associated p-values)
ConclusionsConclusions
Both PCFG and SRN do account for the reading-time data to some extent
But the PCFG does not improve on the SRN’s predictions
No evidence for tree structures in the mental representation of
sentences
No evidence for tree structures in the mental representation of
sentences
Why does the SRN fit the data better than the PCFG?
Is it more accurate on a particular group of data points, or does it perform better overall?
Take the regression analyses’ residuals (i.e., differences between predicted and measured reading times)
δi = |residi(PCFG)| − |residi(SRN)|
δi is the extent to which data point i is predicted better by the SRN than by the PCFG.
Is there a group of data points for which δ is larger than might be expected?
Look at the distribution of the δs.
Qualitative comparisonQualitative comparison
Possible distributions of Possible distributions of δδ
0
0
0
No difference between SRN and PCFG (only random noise)
Overall better predictions by SRN
Particular subset of the data is predicted better by SRN
right-shiftedright-
shifted
right-skewedright-skewed
symmetrical, mean is 0
symmetrical, mean is 0
If the distribution of δ is asymmetric, particular data points are predicted better by the SRN than the PCFG.
The distribution is not significantly asymmetric(two-sample Kolmogorov-Smirnov test, p>.17)
The SRN seems to be a more accurate psycholinguistic model overall.
Test for symmetryTest for symmetry
Questions I cannot (yet) answerQuestions I cannot (yet) answer(but would like to)(but would like to)
Why does the SRN fit the data better than the PCFG?
Perhaps people…− are bad at dealing with long-distance dependencies in a
sentence?− store information about the frequency of multi-word
sequences?
Questions I cannot (yet) answerQuestions I cannot (yet) answer(but would like to)(but would like to)
In general: Is this SRN a more accurate psychological model
than this PCFG? Are SRNs more accurate psychological models than
PCFGs? Does connectionism make for more accurate
psychological models than grammar-based theories? What kind of representations are used in human
sentence processing?
Surprisal-based model evaluation may provide some
answers
Surprisal-based model evaluation may provide some
answers