Title: The mental representation of sentences
1The mental representation of sentences Tree
structures or state vectors?
Stefan Frank S.L.Frank_at_uva.nl
2With help from
Vera Demberg
3Understanding a sentenceThe very general picture
the cat is on the mat
comprehension
word sequence
meaning
4Theories of mental representation
Sentence structure
logical form
?x,ycat(x) ? mat(y) ? on(x,y)
tree structure
conceptual network
state vector or activation pattern
perceptual simulation
5Grammar-based vs. connectionist models
- The debate (or battle) between the two camps
focuses on particular (psycho-) linguistic
phenomena, e.g.
- Grammars account for the productivity and
systematicity of language(Fodor Pylyshyn,
1988 Marcus, 1998) - Connectionism can explain why there is no
(unlimited) productivity and (pure)
systematicity(Christiansen Chater, 1999)
6From theories to models
- Implemented computational models can be evaluated
and compared more thoroughly than mere theories - Take a common grammar-based model and a common
connectionist model - Compare their ability to predict empirical data
(measurements of word-reading time)
Probabilistic Context-Free Grammar (PCFG)
Simple Recurrent Network (SRN)
versus
7Probabilistic Context-Free Grammar
- A context-free grammar with a probability for
each production rule (conditional on the rules
left-hand side) - The probability of a tree structure is the
product of probabilities of the rules involved in
its construction. - The probability of a sentence is the sum of
probabilities of all its grammatical tree
structures. - Rules and their probabilities can be induced from
a large corpus of syntactically annotated
sentences (a treebank). - Wall Street Journal treebank approx. 50,000
sentences from WSJ newspaper articles (1988-1989)
8Inducing a PCFG
S
NP
VP
.
PRP
VPZ
NP
.
It
has
NP
PP
DT
NN
NP
IN
on
no
bearing
NP
NP
NN
PRP
NN
NN
our
work
force
today
9Simple Recurrent Network Elman (1990)
- Feedforward neural network with recurrent
connections - Processes sentences, word by word
- Usually trained to predict the upcoming
word(i.e., the input at t1)
output layer
hidden layer
(copy)
hidden activation at t1
word input at t
10Word probability and reading timesHale (2001),
Levy (2008)
- Surprisal theory the more unexpected the
occurrence of a word, the more time needed to
process it. Formally - A sentence is a sequence of words w1, w2, w3,
- The time needed to read word wt, is
logarithmically related to its probability in the
context - RT(wt) -log Pr(wtcontext)
- If nothing else changes, the context is just the
sequence of previous words - RT(wt) -log Pr(wtw1, , wt-1)
- Both PCFGs and SRNs can estimate Pr(wtw1, ,
wt-1) - So can they predict word-reading times?
11Testing surprisal theoryDemberg Keller (2008)
- Reading-time data
- Dundee corpus approx. 2,400 sentences from The
Independent newspaper editorials - Read by 10 subjects
- Eye-movement registration
- First-pass RTs fixation time on a word before
any fixation on later words - Computation of surprisal
- PCFG induced from WSJ treebank
- Applied to Dundee corpus sentences
- Using Brian Roarks incremental PCFG parser
12Testing surprisal theoryDemberg Keller (2008)
Result No significant effect of word surprisal
on RT, apart from the effects of Pr(wt) and
Pr(wtwt?1)
- But accurate word prediction is difficult because
of - required world knowledge
- differences between WSJ and The Independent
- 1988-89 versus 2002
- general WSJ articles versus Independent
editorials - American English versus British English
- only major similarity both are in English
Test for a purely structural (i.e.,
non-semantic) effect by ignoring the actual words
13S
NP
VP
.
PRP
VPZ
NP
NP
PP
DT
NN
NP
IN
NP
NP
NN
PRP
NN
NN
14Testing surprisal theoryDemberg Keller (2008)
- Unlexicalized (or structural) surprisal
- PCFG induced from WSJ trees with words removed
- surprisal estimation by parsing sequences of
pos-tags (instead of words) of Dundee corpus
texts - independent of semantics, so more accurate
estimation possible - but probably a weaker relation with reading times
- Is a words RT related to the predictability of
its part-of-speech? - Result Yes, statistically significant (but very
small) effect of pos-surprisal on word-RT
15Caveats
- Statistical analysis
- The analysis assumes independent measurements
- Surprisal theory is based on dependencies between
words - So the analysis is inconsistent with the theory
- Implicit assumptions
- The PCFG forms an accurate language model (i.e.,
it gives high probability to the parts-of-speech
that actually occur) - An accurate language model is also an accurate
psycholinguistic model (i.e., it predicts reading
times)
16Solutions
- Sentence-level (instead of word-level) analysis
- Both PCFG and statistical analysis assume
independence between sentences - Surprisal averaged over pos-tags in the sentence
- Total sentence RT divided by sentence length (
letters) - Measure accuracy
- of the language modellower average surprisal ?
more accurate language model - of the psycholinguistic modelRT and surprisal
correlate more strongly ? more accurate
psycholinguistic model - If a) and b) increase together, accurate language
models are also accurate psycholinguistic models
17Comparing PCFG and SRN
- PCFG
- Train on WSJ treebank (unlexicalized)
- Parse pos-tag sequences from Dundee corpus
- Obtain range of surprisal estimates by varying
beam-width parameter, which controls parser
accuracy - SRN
- Train on sequences of pos-tags (not the trees)
from WSJ - During training (at regular intervals), process
pos-tags from Dundee corpus, obtaining a range of
surprisal estimates - Evaluation
- Language model average surprisal measures
inaccuracy (and estimates language entropy) - Psycholinguistic model correlation between
surprisals and RTs
just likeDemberg Keller
18Results
PCFG
SRN
19Preliminary conclusions
- Both models account for a statistically
significant fraction of variance in reading-time
data. - The human sentence-processing system seems to be
using an accurate language model. - The SRN is the more accurate psycholinguistic
model.
But PCFG and SRN together might form an ever
better psycholinguistic model
20Improved analysis
- Linear mixed-effect regression model (to take
into account random effects of subject and item) - Compare regression models that include
- surprisal estimates by PCFG with largest beam
width - surprisal estimates by fully trained SRN
- both
- Also include sentence length, word frequency,
forward and backward transitional probabilities,
and all significant two-way interactions between
these
21Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN
22Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN 0.64 plt.001
23Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02 -0.46 pgt.2
Effect of surprisal according to SRN 0.64 plt.001 1.02 plt.01
24Conclusions
- Both PCFG and SRN do account for the reading-time
data to some extent - But the PCFG does not improve on the SRNs
predictions
No evidence for tree structures in the mental
representation of sentences
25Qualitative comparison
- Why does the SRN fit the data better than the
PCFG? - Is it more accurate on a particular group of data
points, or does it perform better overall? - Take the regression analyses residuals (i.e.,
differences between predicted and measured
reading times) - di residi(pcfg) - residi(srn)
- di is the extent to which data point i is
predicted better by the SRN than by the PCFG. - Is there a group of data points for which d is
larger than might be expected? - Look at the distribution of the ds.
26Possible distributions of d
symmetrical, mean is 0
No difference between SRN and PCFG (only random
noise)
right-shifted
Overall better predictions by SRN
right-skewed
Particular subset of the data is predicted better
by SRN
27Test for symmetry
- If the distribution of d is asymmetric,
particular data points are predicted better by
the SRN than the PCFG. - The distribution is not significantly
asymmetric(two-sample Kolmogorov-Smirnov test,
pgt.17) - The SRN seems to be a more accurate
psycholinguistic model overall.
28Questions I cannot (yet) answer(but would like
to)
- Why does the SRN fit the data better than the
PCFG? - Perhaps people
- are bad at dealing with long-distance
dependencies in a sentence? - store information about the frequency of
multi-word sequences?
29Questions I cannot (yet) answer(but would like
to)
- In general
- Is this SRN a more accurate psychological model
than this PCFG? - Are SRNs more accurate psychological models than
PCFGs? - Does connectionism make for more accurate
psychological models than grammar-based theories? - What kind of representations are used in human
sentence processing?
Surprisal-based model evaluation may provide some
answers