The mental representation of sentences - PowerPoint PPT Presentation

About This Presentation

Title:

The mental representation of sentences

Description:

y:cat(x) mat(y) on(x,y) tree structure. cat. on. mat. logical form. conceptual network ... PCFG induced from WSJ trees with words removed ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 30

Provided by: stef139

Category:

more less

Transcript and Presenter's Notes

Title: The mental representation of sentences

1
The mental representation of sentences Tree
structures or state vectors?
Stefan Frank S.L.Frank_at_uva.nl
2
With help from
Vera Demberg
3
Understanding a sentenceThe very general picture
the cat is on the mat
comprehension
word sequence
meaning
4
Theories of mental representation

Sentence meaning

Sentence structure
logical form
?x,ycat(x) ? mat(y) ? on(x,y)
tree structure
conceptual network
state vector or activation pattern
perceptual simulation
5
Grammar-based vs. connectionist models

The debate (or battle) between the two camps
focuses on particular (psycho-) linguistic
phenomena, e.g.

Grammars account for the productivity and
systematicity of language(Fodor Pylyshyn,
1988 Marcus, 1998)
Connectionism can explain why there is no
(unlimited) productivity and (pure)
systematicity(Christiansen Chater, 1999)

6
From theories to models

Implemented computational models can be evaluated
and compared more thoroughly than mere theories
Take a common grammar-based model and a common
connectionist model
Compare their ability to predict empirical data
(measurements of word-reading time)

Probabilistic Context-Free Grammar (PCFG)
Simple Recurrent Network (SRN)
versus
7
Probabilistic Context-Free Grammar

A context-free grammar with a probability for
each production rule (conditional on the rules
left-hand side)
The probability of a tree structure is the
product of probabilities of the rules involved in
its construction.
The probability of a sentence is the sum of
probabilities of all its grammatical tree
structures.
Rules and their probabilities can be induced from
a large corpus of syntactically annotated
sentences (a treebank).
Wall Street Journal treebank approx. 50,000
sentences from WSJ newspaper articles (1988-1989)

8
Inducing a PCFG
S
NP
VP
.
PRP
VPZ
NP
.
It
has
NP
PP
DT
NN
NP
IN
on
no
bearing
NP
NP
NN
PRP
NN
NN
our
work
force
today
9
Simple Recurrent Network Elman (1990)

Feedforward neural network with recurrent
connections
Processes sentences, word by word
Usually trained to predict the upcoming
word(i.e., the input at t1)

output layer
hidden layer
(copy)
hidden activation at t1
word input at t
10
Word probability and reading timesHale (2001),
Levy (2008)

Surprisal theory the more unexpected the
occurrence of a word, the more time needed to
process it. Formally
A sentence is a sequence of words w1, w2, w3,
The time needed to read word wt, is
logarithmically related to its probability in the
context
RT(wt) -log Pr(wtcontext)
If nothing else changes, the context is just the
sequence of previous words
RT(wt) -log Pr(wtw1, , wt-1)
Both PCFGs and SRNs can estimate Pr(wtw1, ,
wt-1)
So can they predict word-reading times?

11
Testing surprisal theoryDemberg Keller (2008)

Reading-time data
Dundee corpus approx. 2,400 sentences from The
Independent newspaper editorials
Read by 10 subjects
Eye-movement registration
First-pass RTs fixation time on a word before
any fixation on later words
Computation of surprisal
PCFG induced from WSJ treebank
Applied to Dundee corpus sentences
Using Brian Roarks incremental PCFG parser

12
Testing surprisal theoryDemberg Keller (2008)
Result No significant effect of word surprisal
on RT, apart from the effects of Pr(wt) and
Pr(wtwt?1)

But accurate word prediction is difficult because
of
required world knowledge
differences between WSJ and The Independent
1988-89 versus 2002
general WSJ articles versus Independent
editorials
American English versus British English
only major similarity both are in English

Test for a purely structural (i.e.,
non-semantic) effect by ignoring the actual words
13
S
NP
VP
.
PRP
VPZ
NP
NP
PP
DT
NN
NP
IN
NP
NP
NN
PRP
NN
NN
14
Testing surprisal theoryDemberg Keller (2008)

Unlexicalized (or structural) surprisal
PCFG induced from WSJ trees with words removed
surprisal estimation by parsing sequences of
pos-tags (instead of words) of Dundee corpus
texts
independent of semantics, so more accurate
estimation possible
but probably a weaker relation with reading times
Is a words RT related to the predictability of
its part-of-speech?
Result Yes, statistically significant (but very
small) effect of pos-surprisal on word-RT

15
Caveats

Statistical analysis
The analysis assumes independent measurements
Surprisal theory is based on dependencies between
words
So the analysis is inconsistent with the theory
Implicit assumptions
The PCFG forms an accurate language model (i.e.,
it gives high probability to the parts-of-speech
that actually occur)
An accurate language model is also an accurate
psycholinguistic model (i.e., it predicts reading
times)

16
Solutions

Sentence-level (instead of word-level) analysis
Both PCFG and statistical analysis assume
independence between sentences
Surprisal averaged over pos-tags in the sentence
Total sentence RT divided by sentence length (
letters)
Measure accuracy
of the language modellower average surprisal ?
more accurate language model
of the psycholinguistic modelRT and surprisal
correlate more strongly ? more accurate
psycholinguistic model
If a) and b) increase together, accurate language
models are also accurate psycholinguistic models

17
Comparing PCFG and SRN

PCFG
Train on WSJ treebank (unlexicalized)
Parse pos-tag sequences from Dundee corpus
Obtain range of surprisal estimates by varying
beam-width parameter, which controls parser
accuracy
SRN
Train on sequences of pos-tags (not the trees)
from WSJ
During training (at regular intervals), process
pos-tags from Dundee corpus, obtaining a range of
surprisal estimates
Evaluation
Language model average surprisal measures
inaccuracy (and estimates language entropy)
Psycholinguistic model correlation between
surprisals and RTs

just likeDemberg Keller
18
Results
PCFG
SRN
19
Preliminary conclusions

Both models account for a statistically
significant fraction of variance in reading-time
data.
The human sentence-processing system seems to be
using an accurate language model.
The SRN is the more accurate psycholinguistic
model.

But PCFG and SRN together might form an ever
better psycholinguistic model
20
Improved analysis

Linear mixed-effect regression model (to take
into account random effects of subject and item)
Compare regression models that include
surprisal estimates by PCFG with largest beam
width
surprisal estimates by fully trained SRN
both
Also include sentence length, word frequency,
forward and backward transitional probabilities,
and all significant two-way interactions between
these

21
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN
22
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN 0.64 plt.001
23
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02 -0.46 pgt.2
Effect of surprisal according to SRN 0.64 plt.001 1.02 plt.01
24
Conclusions

Both PCFG and SRN do account for the reading-time
data to some extent
But the PCFG does not improve on the SRNs
predictions

No evidence for tree structures in the mental
representation of sentences
25
Qualitative comparison

Why does the SRN fit the data better than the
PCFG?
Is it more accurate on a particular group of data
points, or does it perform better overall?
Take the regression analyses residuals (i.e.,
differences between predicted and measured
reading times)
di residi(pcfg) - residi(srn)
di is the extent to which data point i is
predicted better by the SRN than by the PCFG.
Is there a group of data points for which d is
larger than might be expected?
Look at the distribution of the ds.

26
Possible distributions of d
symmetrical, mean is 0
No difference between SRN and PCFG (only random
noise)
right-shifted
Overall better predictions by SRN
right-skewed
Particular subset of the data is predicted better
by SRN
27
Test for symmetry

If the distribution of d is asymmetric,
particular data points are predicted better by
the SRN than the PCFG.
The distribution is not significantly
asymmetric(two-sample Kolmogorov-Smirnov test,
pgt.17)
The SRN seems to be a more accurate
psycholinguistic model overall.

28
Questions I cannot (yet) answer(but would like
to)

Why does the SRN fit the data better than the
PCFG?
Perhaps people
are bad at dealing with long-distance
dependencies in a sentence?
store information about the frequency of
multi-word sequences?

29
Questions I cannot (yet) answer(but would like
to)

In general
Is this SRN a more accurate psychological model
than this PCFG?
Are SRNs more accurate psychological models than
PCFGs?
Does connectionism make for more accurate
psychological models than grammar-based theories?
What kind of representations are used in human
sentence processing?

Surprisal-based model evaluation may provide some
answers

Write a Comment

User Comments (0)