Syntactic and Semantic Systematicity in Connectionist Models of Sentence Processing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Syntactic and Semantic Systematicity in Connectionist Models of Sentence Processing

1
Syntactic and Semantic Systematicityin
Connectionist Modelsof Sentence Processing

Stefan Frank
Nijmegen Institute
for
Cognition and Information

2
Systematicity in language
Understood Would you like some milk in your
coffee? Do you want sugar in your tea?
Not understood Would you like some sugar in your
coffee? Do you want milk in your tea?
3
Systematicity and connectionismFodor Pylyshyn
(Cognition, 1988)

Systematicity requires combinatorial
representations
Neural networks do not have a combinatorial
syntax/semantics
So, they cannot display (let alone explain)
systematicity
Connectionism will never result in viable models
of human cognition

4
Systematicity and connectionismHadley (Mind
Language, 1994)
Systematicityknowing sentence X ? knowing new
sentence Y Generalisationtraining on input X ?
ability to process new input Y
The extent to which a network is systematic is
apparent in its ability to generalise
5
Systematicity and connectionismHadley (Mind
Language, 1994)

Alleged demonstrations of connectionist
systematicity only show minimal generalisation
(many training examples and few tests small
difference between training and test inputs).
Syntactic systematicity Decide whether the novel
word string is a sentence.
Semantic systematicity Assign a semantic
representation to the novel sentence.

6
Connectionist sentence processing

Simple Recurrent Network (SRN) feedforward
network with connections from a hidden layer to
itself.

output layer
SRNs do not only have long-term memory (LTM) but
also short-term memory (STM)
(
)
hidden layer
recurrenthidden layer
(
)
hidden layer
input layer
7
Connectionist sentence processing

Simple Recurrent Network (SRN) feedforward
network with connections from a hidden layer to
itself.
Training examples are grammatical sentences ?
network cannot learn to make grammaticality
judgements
Next-word prediction after processing words
w1,, wt of the input sentence, the network is
trained to predict wt1.
(e.g., Elman, 1990, 1991, 1993 Servan-Schreiber
et al., 1991 Christiansen Chater, 1994, 1999
Rohde Plaut, 1999 Tabor Tanenhaus, 1999
MacDonald Christiansen, 2002 Van der Velde et
al., 2004)
If the network makes correct predictions for new
inputs, it must have acquired knowledge of the
grammar

8
Testing for syntactic systematicity

How would we test for syntactic systematicity in
human subjects using next-word prediction?
Give (the first words of) a new sentence, for
instance The averages smelland ask what could
come next
Correct answers bricks, daily, some, the, .,
Incorrect answers eat, see,
If the subjects give only correct answers, they
seem to have no problem with the new sentence
systematicity.But if they give many incorrect
answers, the new sentence is just a string of
words to them no systematicity.

9
Testing for syntactic systematicity

When testing networks, the wrong questions are
often asked
Are correct next-word probabilities predicted?
Are all possible words (all correct word classes)
predicted?
We dont expect human subjects to answer such
questions.
Van der Velde et al. (2004) Does the network
avoid incorrect next-word predictions on novel
inputs?

10
Network training and testingVan der Velde et al.
(Connection Science, 2004)

A SRN processed a minilanguage with
18 words (boy, girl, , loves, sees, , who, .)
3 sentence types
simple N V N . (boy sees girl.)
right-branching N V N who V N . (boy sees girl
who loves boy.)
center-embedded N who N V V N . (boy who girl
sees loves boy.)
Nouns and verbs were divided into four groups,
each had two nouns and two verbs.
In training sentences, nouns and verbs were from
the same group lt 0.44 of sentences used for
training.
In test sentences, nouns and verbs came from
different groups.

11
Results and ConclusionsVan der Velde et al.
(Connection Science, 2004)

The network failed on test sentences, so they
do not generalise to structurally similar
sentences
But
what does it mean to fail? Maybe the network
displayed some systematicity.
was the language complex enough? With more word
types there is more reason to abstract to
syntactic classes.
was the size of the network appropriate?
larger recurrent layer ? more STM ? better
processing?
smaller recurrent layer ? less LTM ? better
generalisation?

12
Syntactic systematicity revisitedFrank
(Connection Science, 2006)

Replicating Van der Velde et al., but
Using a measure of generalisation performance
rather than just prediction performance
Varying language size (number of word types)
Varying recurrent-layer size (number of units)
Varying LTM/STM ratio

13
Rating generalisation

Baseline output of a (hypothetical), perfectly
trained but non-generalising network.
The best such a network can do is use the last
word of the test input that also appeared in the
training input.
Generalisation scores
score 1 network never makes ungrammatical
predictions
score 0 network does not generalise, but gives
the best possible output based on the last word
score 1 network only makes ungrammatical
predictions
Positive generalisation score (at each word)
indicates systematicity

14
Network architecture
w units (one for each word, w 18, , 42)
output layer
10 units
hidden layer
n units (n 10, , 100)
recurrenthidden layer
w units (one for each word, w 18, , 42)
input layer
15
Results
generalisation
Positive generalisation at each word of each test
sentence type, so there is some systematicity.
16
Resultseffect of lexicon size
N V N
N V N who V N
N who N V V N
generalisation
Larger lexicon leads to improved generalisation
(even though a smaller percentage of possible
sentences is used for training).
17
Resultseffect of recurrent layer size
N V N
N V N who V N
N who N V V N
generalisation
Larger networks (n 40) do better, but very
large ones (n 100) overfit.
18
SRN generalisation and memory

SRNs do show systematicity to some extent.
But generalisation remains limited
small n ? limited processing capacity (STM)
large n ? large LTM ? overfitting.
How to combine large STM with small LTM?

19
Echo State NetworksJaeger (Advances in NIPS,
2003)

Keep the connections to and within the recurrent
layer fixed at random values.
The recurrent layer becomes a dynamical
reservoir a task-unspecific STM for the input
sequence.
Some constraints on the dynamical reservoir
large enough
sparsely connected (here 15)
suitable spectral radius of weight matrix (here
0.7)
LTM capacity
In SRNs O(n2)
In ESNs O(n)
Can ESNs successfully combine large STM and
small LTM?

20
Network architecture
w units (one for each word, w 18 - 42)
output layer
10 units
hidden layer
n units (n 10, , 100)
recurrenthidden layer
w units (one for each word, w 18 - 42)
input layer
21
Results
generalisation
Positive performance at each word of each test
sentence type, so there is some systematicity
(but less than in a SRN of the same size)
22
Resultseffect of recurrent layer size
N V N
N V N who V N
N who N V V N
generalisation
Bigger is better no overfitting (even when n
1530)
23
Resultseffect of lexicon size
N V N
N V N who V N
N who N V V N
generalisation
Larger lexicon leads to improved generalisation
(even though a smaller percentage of possible
sentences is used for training).
24
ConclusionsSyntactic systematicity

Generalisation scores at all points of test
sentences are significantly larger than zero.
Both SRNs and ESNs can be syntactically
systematic
Even with few training sentences and many test
sentences
By doing less training, the network can learn
more
Training fewer connections gives better results
Training a smaller fraction of possible sentences
gives better results

25
Semantic systematicity

People can assign a semantic representation to
most sentences that are new to them
Usually, this is the representation intended by
the speaker/writer
How to account for this semantic systematicity?

26
Connectionist semantic systematicityFrank
Haselager (Proceedings of CogSci Conference, 2006)

An ESN transforms sentences into their semantic
representations
Trained on ? of possible sentences particular
sentence structures and semantics are withheld
Can correctly process these untrained sentences

27
Levels of representation

Result of text comprehension is a mental
representation of the described situation (Zwaan
Radvansky, Psych. Bull., 1994)
not linguistic
no predicate-argument structure
involves the readers/listeners knowledge of and
experience with the world
Sentences The sun is shining. The sky is blue.
Propositions SHINE(SUN) BLUE(SKY)
Situation

28
The microworld

Characters Bob, Jilly
Basic events, among others
Bob/Jilly is outside
Bob and Jilly play soccer/hide-and-seek
Bob/Jilly plays a computer game
Bob/Jilly plays with the dog
Bob/Jilly wins
Complex events, e.g.
?(Bob outside) Bob is inside
(Bob and Jilly play soccer) ? (Jilly wins)
Jilly wins at soccer
(Bob wins) ? (Jilly wins) Someone wins
Constraints, e.g.
no winning when playing with the dog
computer games are only played inside

29
Situational representations

The Distributed Situation Space model of
inference during story comprehension (Frank et
al., Cognitive Science, 2003) represents story
situations as vectors in situation space.
Situation space is self-organised by training on
a large number of situations occurring in the
microworld.
Similarities among vectors reflect similarities
among represented situations
playing soccer and being outside often
co-occur ? similar vectors
playing soccer and hide-and-seek ? dissimilar
vectors
Belief value of some event p in situation X
Estimated probability that event p occurs in
situation X
Can be computed from vector representations of p
and X

30
Microworld and microlanguage

Microworld situations can be described by 3558
possible sentences from a 20-word microlanguage,
e.g.

jilly plays soccer outside bob and jilly play
with dog someone loses to jilly jilly or bob wins
inside jilly beats bob at hide-and-seek bob loses
at game outside
31
Network training

Input microlanguage sentences, e.g., jilly wins
at soccer outside
Target output vector for Jilly wins ? soccer ?
Jilly is outside
The network is not trained on the 2288 (gt 64 )
sentences in which
(1) anyone beats bob or anyone loses to jilly
(2) both dog and inside occur or both
hide-and-seek and outside occur
So it is not trained to understand sentences
(1) that use a transitive verb to describe that
bob loses ( jilly wins)
(2) in which the dog is inside, or hide-and-seek
is played outside
Understanding test sentences
(1) requires generalisation to new sentences
(2) requires generalisation to new sentences and
new situations

32
Measuring comprehension

After processing jilly wins at soccer outside
Compute the belief values of Jilly wins, soccer,
and Jilly is outside in the networks output
vector
If these are larger than the prior probabilities
of these events, the sentence was understood to
some extent
Comprehension score (between -1 and 1) Increase
in belief value of stated event relative to the
events prior probability
Negative comprehension score ? Belief value lt
prior probability ? Error the sentence was
misunderstood

33
Results
New sentences, old situations
34
Results
New sentences, new situations
35
ConclusionsSemantic systematicity

Comprehension scores are significantly larger
than zero and error rates are very low.
The network learned how new combinations of known
words refer to (new) combinations of known
microworld events.
It displays semantic systematicity

36
Open questions

Upscaling Will it work with languages of more
realistic size? ? Work in progress
How does it work? ? Look at hidden
representations
Can connectionism explain systematicity?
No, because neural networks do not need to be
systematic
Yes, because they need to adapt to systematicity
in the training input.
Where does systematic cognition come from? Not
from the cognitive system, but from systematicity
in the world and in language.

Write a Comment

User Comments (0)

About PowerShow.com

Syntactic and Semantic Systematicity in Connectionist Models of Sentence Processing PowerPoint PPT Presentation