The mental representation of sentences - PowerPoint PPT Presentation

About This Presentation
Title:

The mental representation of sentences

Description:

y:cat(x) mat(y) on(x,y) tree structure. cat. on. mat. logical form. conceptual network ... PCFG induced from WSJ trees with words removed ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 30
Provided by: stef139
Category:

less

Transcript and Presenter's Notes

Title: The mental representation of sentences


1
The mental representation of sentences Tree
structures or state vectors?
Stefan Frank S.L.Frank_at_uva.nl
2
With help from
Vera Demberg
3
Understanding a sentenceThe very general picture
the cat is on the mat
comprehension
word sequence
meaning
4
Theories of mental representation
  • Sentence meaning

Sentence structure
logical form
?x,ycat(x) ? mat(y) ? on(x,y)
tree structure
conceptual network
state vector or activation pattern
perceptual simulation
5
Grammar-based vs. connectionist models
  • The debate (or battle) between the two camps
    focuses on particular (psycho-) linguistic
    phenomena, e.g.
  • Grammars account for the productivity and
    systematicity of language(Fodor Pylyshyn,
    1988 Marcus, 1998)
  • Connectionism can explain why there is no
    (unlimited) productivity and (pure)
    systematicity(Christiansen Chater, 1999)

6
From theories to models
  • Implemented computational models can be evaluated
    and compared more thoroughly than mere theories
  • Take a common grammar-based model and a common
    connectionist model
  • Compare their ability to predict empirical data
    (measurements of word-reading time)

Probabilistic Context-Free Grammar (PCFG)
Simple Recurrent Network (SRN)
versus
7
Probabilistic Context-Free Grammar
  • A context-free grammar with a probability for
    each production rule (conditional on the rules
    left-hand side)
  • The probability of a tree structure is the
    product of probabilities of the rules involved in
    its construction.
  • The probability of a sentence is the sum of
    probabilities of all its grammatical tree
    structures.
  • Rules and their probabilities can be induced from
    a large corpus of syntactically annotated
    sentences (a treebank).
  • Wall Street Journal treebank approx. 50,000
    sentences from WSJ newspaper articles (1988-1989)

8
Inducing a PCFG
S
NP
VP
.
PRP
VPZ
NP
.
It
has
NP
PP
DT
NN
NP
IN
on
no
bearing
NP
NP
NN
PRP
NN
NN
our
work
force
today
9
Simple Recurrent Network Elman (1990)
  • Feedforward neural network with recurrent
    connections
  • Processes sentences, word by word
  • Usually trained to predict the upcoming
    word(i.e., the input at t1)

output layer
hidden layer
(copy)
hidden activation at t1
word input at t
10
Word probability and reading timesHale (2001),
Levy (2008)
  • Surprisal theory the more unexpected the
    occurrence of a word, the more time needed to
    process it. Formally
  • A sentence is a sequence of words w1, w2, w3,
  • The time needed to read word wt, is
    logarithmically related to its probability in the
    context
  • RT(wt) -log Pr(wtcontext)
  • If nothing else changes, the context is just the
    sequence of previous words
  • RT(wt) -log Pr(wtw1, , wt-1)
  • Both PCFGs and SRNs can estimate Pr(wtw1, ,
    wt-1)
  • So can they predict word-reading times?

11
Testing surprisal theoryDemberg Keller (2008)
  • Reading-time data
  • Dundee corpus approx. 2,400 sentences from The
    Independent newspaper editorials
  • Read by 10 subjects
  • Eye-movement registration
  • First-pass RTs fixation time on a word before
    any fixation on later words
  • Computation of surprisal
  • PCFG induced from WSJ treebank
  • Applied to Dundee corpus sentences
  • Using Brian Roarks incremental PCFG parser

12
Testing surprisal theoryDemberg Keller (2008)
Result No significant effect of word surprisal
on RT, apart from the effects of Pr(wt) and
Pr(wtwt?1)
  • But accurate word prediction is difficult because
    of
  • required world knowledge
  • differences between WSJ and The Independent
  • 1988-89 versus 2002
  • general WSJ articles versus Independent
    editorials
  • American English versus British English
  • only major similarity both are in English

Test for a purely structural (i.e.,
non-semantic) effect by ignoring the actual words
13
S
NP
VP
.
PRP
VPZ
NP
NP
PP
DT
NN
NP
IN
NP
NP
NN
PRP
NN
NN
14
Testing surprisal theoryDemberg Keller (2008)
  • Unlexicalized (or structural) surprisal
  • PCFG induced from WSJ trees with words removed
  • surprisal estimation by parsing sequences of
    pos-tags (instead of words) of Dundee corpus
    texts
  • independent of semantics, so more accurate
    estimation possible
  • but probably a weaker relation with reading times
  • Is a words RT related to the predictability of
    its part-of-speech?
  • Result Yes, statistically significant (but very
    small) effect of pos-surprisal on word-RT

15
Caveats
  • Statistical analysis
  • The analysis assumes independent measurements
  • Surprisal theory is based on dependencies between
    words
  • So the analysis is inconsistent with the theory
  • Implicit assumptions
  • The PCFG forms an accurate language model (i.e.,
    it gives high probability to the parts-of-speech
    that actually occur)
  • An accurate language model is also an accurate
    psycholinguistic model (i.e., it predicts reading
    times)

16
Solutions
  • Sentence-level (instead of word-level) analysis
  • Both PCFG and statistical analysis assume
    independence between sentences
  • Surprisal averaged over pos-tags in the sentence
  • Total sentence RT divided by sentence length (
    letters)
  • Measure accuracy
  • of the language modellower average surprisal ?
    more accurate language model
  • of the psycholinguistic modelRT and surprisal
    correlate more strongly ? more accurate
    psycholinguistic model
  • If a) and b) increase together, accurate language
    models are also accurate psycholinguistic models

17
Comparing PCFG and SRN
  • PCFG
  • Train on WSJ treebank (unlexicalized)
  • Parse pos-tag sequences from Dundee corpus
  • Obtain range of surprisal estimates by varying
    beam-width parameter, which controls parser
    accuracy
  • SRN
  • Train on sequences of pos-tags (not the trees)
    from WSJ
  • During training (at regular intervals), process
    pos-tags from Dundee corpus, obtaining a range of
    surprisal estimates
  • Evaluation
  • Language model average surprisal measures
    inaccuracy (and estimates language entropy)
  • Psycholinguistic model correlation between
    surprisals and RTs

just likeDemberg Keller
18
Results
PCFG
SRN
19
Preliminary conclusions
  • Both models account for a statistically
    significant fraction of variance in reading-time
    data.
  • The human sentence-processing system seems to be
    using an accurate language model.
  • The SRN is the more accurate psycholinguistic
    model.

But PCFG and SRN together might form an ever
better psycholinguistic model
20
Improved analysis
  • Linear mixed-effect regression model (to take
    into account random effects of subject and item)
  • Compare regression models that include
  • surprisal estimates by PCFG with largest beam
    width
  • surprisal estimates by fully trained SRN
  • both
  • Also include sentence length, word frequency,
    forward and backward transitional probabilities,
    and all significant two-way interactions between
    these

21
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN
22
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02
Effect of surprisal according to SRN 0.64 plt.001
23
Results
Estimated ß-coefficients (and associated p-values)
Regression model includes Regression model includes Regression model includes
PCFG SRN both
Effect of surprisal according to PCFG 0.45 plt.02 -0.46 pgt.2
Effect of surprisal according to SRN 0.64 plt.001 1.02 plt.01
24
Conclusions
  • Both PCFG and SRN do account for the reading-time
    data to some extent
  • But the PCFG does not improve on the SRNs
    predictions

No evidence for tree structures in the mental
representation of sentences
25
Qualitative comparison
  • Why does the SRN fit the data better than the
    PCFG?
  • Is it more accurate on a particular group of data
    points, or does it perform better overall?
  • Take the regression analyses residuals (i.e.,
    differences between predicted and measured
    reading times)
  • di residi(pcfg) - residi(srn)
  • di is the extent to which data point i is
    predicted better by the SRN than by the PCFG.
  • Is there a group of data points for which d is
    larger than might be expected?
  • Look at the distribution of the ds.

26
Possible distributions of d
symmetrical, mean is 0
No difference between SRN and PCFG (only random
noise)
right-shifted
Overall better predictions by SRN
right-skewed
Particular subset of the data is predicted better
by SRN
27
Test for symmetry
  • If the distribution of d is asymmetric,
    particular data points are predicted better by
    the SRN than the PCFG.
  • The distribution is not significantly
    asymmetric(two-sample Kolmogorov-Smirnov test,
    pgt.17)
  • The SRN seems to be a more accurate
    psycholinguistic model overall.

28
Questions I cannot (yet) answer(but would like
to)
  • Why does the SRN fit the data better than the
    PCFG?
  • Perhaps people
  • are bad at dealing with long-distance
    dependencies in a sentence?
  • store information about the frequency of
    multi-word sequences?

29
Questions I cannot (yet) answer(but would like
to)
  • In general
  • Is this SRN a more accurate psychological model
    than this PCFG?
  • Are SRNs more accurate psychological models than
    PCFGs?
  • Does connectionism make for more accurate
    psychological models than grammar-based theories?
  • What kind of representations are used in human
    sentence processing?

Surprisal-based model evaluation may provide some
answers
Write a Comment
User Comments (0)
About PowerShow.com