Title: Syntactic and Semantic Systematicity in Connectionist Models of Sentence Processing
1Syntactic and Semantic Systematicityin
Connectionist Modelsof Sentence Processing
- Stefan Frank
- Nijmegen Institute
- for
- Cognition and Information
2Systematicity in language
Understood Would you like some milk in your
coffee? Do you want sugar in your tea?
Not understood Would you like some sugar in your
coffee? Do you want milk in your tea?
3Systematicity and connectionismFodor Pylyshyn
(Cognition, 1988)
- Systematicity requires combinatorial
representations - Neural networks do not have a combinatorial
syntax/semantics - So, they cannot display (let alone explain)
systematicity - Connectionism will never result in viable models
of human cognition
4Systematicity and connectionismHadley (Mind
Language, 1994)
Systematicityknowing sentence X ? knowing new
sentence Y Generalisationtraining on input X ?
ability to process new input Y
The extent to which a network is systematic is
apparent in its ability to generalise
5Systematicity and connectionismHadley (Mind
Language, 1994)
- Alleged demonstrations of connectionist
systematicity only show minimal generalisation
(many training examples and few tests small
difference between training and test inputs). - Syntactic systematicity Decide whether the novel
word string is a sentence. - Semantic systematicity Assign a semantic
representation to the novel sentence.
6Connectionist sentence processing
- Simple Recurrent Network (SRN) feedforward
network with connections from a hidden layer to
itself.
output layer
SRNs do not only have long-term memory (LTM) but
also short-term memory (STM)
(
)
hidden layer
recurrenthidden layer
(
)
hidden layer
input layer
7Connectionist sentence processing
- Simple Recurrent Network (SRN) feedforward
network with connections from a hidden layer to
itself. - Training examples are grammatical sentences ?
network cannot learn to make grammaticality
judgements - Next-word prediction after processing words
w1,, wt of the input sentence, the network is
trained to predict wt1. - (e.g., Elman, 1990, 1991, 1993 Servan-Schreiber
et al., 1991 Christiansen Chater, 1994, 1999
Rohde Plaut, 1999 Tabor Tanenhaus, 1999
MacDonald Christiansen, 2002 Van der Velde et
al., 2004) - If the network makes correct predictions for new
inputs, it must have acquired knowledge of the
grammar
8Testing for syntactic systematicity
- How would we test for syntactic systematicity in
human subjects using next-word prediction? - Give (the first words of) a new sentence, for
instance The averages smelland ask what could
come next - Correct answers bricks, daily, some, the, .,
Incorrect answers eat, see, - If the subjects give only correct answers, they
seem to have no problem with the new sentence
systematicity.But if they give many incorrect
answers, the new sentence is just a string of
words to them no systematicity.
9Testing for syntactic systematicity
- When testing networks, the wrong questions are
often asked - Are correct next-word probabilities predicted?
- Are all possible words (all correct word classes)
predicted? - We dont expect human subjects to answer such
questions. - Van der Velde et al. (2004) Does the network
avoid incorrect next-word predictions on novel
inputs?
10Network training and testingVan der Velde et al.
(Connection Science, 2004)
- A SRN processed a minilanguage with
- 18 words (boy, girl, , loves, sees, , who, .)
- 3 sentence types
- simple N V N . (boy sees girl.)
- right-branching N V N who V N . (boy sees girl
who loves boy.) - center-embedded N who N V V N . (boy who girl
sees loves boy.) - Nouns and verbs were divided into four groups,
each had two nouns and two verbs. - In training sentences, nouns and verbs were from
the same group lt 0.44 of sentences used for
training. - In test sentences, nouns and verbs came from
different groups.
11Results and ConclusionsVan der Velde et al.
(Connection Science, 2004)
- The network failed on test sentences, so they
do not generalise to structurally similar
sentences - But
- what does it mean to fail? Maybe the network
displayed some systematicity. - was the language complex enough? With more word
types there is more reason to abstract to
syntactic classes. - was the size of the network appropriate?
- larger recurrent layer ? more STM ? better
processing? - smaller recurrent layer ? less LTM ? better
generalisation?
12Syntactic systematicity revisitedFrank
(Connection Science, 2006)
- Replicating Van der Velde et al., but
- Using a measure of generalisation performance
rather than just prediction performance - Varying language size (number of word types)
- Varying recurrent-layer size (number of units)
- Varying LTM/STM ratio
13Rating generalisation
- Baseline output of a (hypothetical), perfectly
trained but non-generalising network. - The best such a network can do is use the last
word of the test input that also appeared in the
training input. - Generalisation scores
- score 1 network never makes ungrammatical
predictions - score 0 network does not generalise, but gives
the best possible output based on the last word - score 1 network only makes ungrammatical
predictions - Positive generalisation score (at each word)
indicates systematicity
14Network architecture
w units (one for each word, w 18, , 42)
output layer
10 units
hidden layer
n units (n 10, , 100)
recurrenthidden layer
w units (one for each word, w 18, , 42)
input layer
15Results
generalisation
Positive generalisation at each word of each test
sentence type, so there is some systematicity.
16Resultseffect of lexicon size
N V N
N V N who V N
N who N V V N
generalisation
Larger lexicon leads to improved generalisation
(even though a smaller percentage of possible
sentences is used for training).
17Resultseffect of recurrent layer size
N V N
N V N who V N
N who N V V N
generalisation
Larger networks (n 40) do better, but very
large ones (n 100) overfit.
18SRN generalisation and memory
- SRNs do show systematicity to some extent.
- But generalisation remains limited
- small n ? limited processing capacity (STM)
- large n ? large LTM ? overfitting.
- How to combine large STM with small LTM?
19Echo State NetworksJaeger (Advances in NIPS,
2003)
- Keep the connections to and within the recurrent
layer fixed at random values. - The recurrent layer becomes a dynamical
reservoir a task-unspecific STM for the input
sequence. - Some constraints on the dynamical reservoir
- large enough
- sparsely connected (here 15)
- suitable spectral radius of weight matrix (here
0.7) - LTM capacity
- In SRNs O(n2)
- In ESNs O(n)
- Can ESNs successfully combine large STM and
small LTM?
20Network architecture
w units (one for each word, w 18 - 42)
output layer
10 units
hidden layer
n units (n 10, , 100)
recurrenthidden layer
w units (one for each word, w 18 - 42)
input layer
21Results
generalisation
Positive performance at each word of each test
sentence type, so there is some systematicity
(but less than in a SRN of the same size)
22Resultseffect of recurrent layer size
N V N
N V N who V N
N who N V V N
generalisation
Bigger is better no overfitting (even when n
1530)
23Resultseffect of lexicon size
N V N
N V N who V N
N who N V V N
generalisation
Larger lexicon leads to improved generalisation
(even though a smaller percentage of possible
sentences is used for training).
24ConclusionsSyntactic systematicity
- Generalisation scores at all points of test
sentences are significantly larger than zero. - Both SRNs and ESNs can be syntactically
systematic - Even with few training sentences and many test
sentences - By doing less training, the network can learn
more - Training fewer connections gives better results
- Training a smaller fraction of possible sentences
gives better results
25Semantic systematicity
- People can assign a semantic representation to
most sentences that are new to them - Usually, this is the representation intended by
the speaker/writer - How to account for this semantic systematicity?
26Connectionist semantic systematicityFrank
Haselager (Proceedings of CogSci Conference, 2006)
- An ESN transforms sentences into their semantic
representations - Trained on ? of possible sentences particular
sentence structures and semantics are withheld - Can correctly process these untrained sentences
27Levels of representation
- Result of text comprehension is a mental
representation of the described situation (Zwaan
Radvansky, Psych. Bull., 1994) - not linguistic
- no predicate-argument structure
- involves the readers/listeners knowledge of and
experience with the world - Sentences The sun is shining. The sky is blue.
- Propositions SHINE(SUN) BLUE(SKY)
- Situation
28The microworld
- Characters Bob, Jilly
- Basic events, among others
- Bob/Jilly is outside
- Bob and Jilly play soccer/hide-and-seek
- Bob/Jilly plays a computer game
- Bob/Jilly plays with the dog
- Bob/Jilly wins
- Complex events, e.g.
- ?(Bob outside) Bob is inside
- (Bob and Jilly play soccer) ? (Jilly wins)
Jilly wins at soccer - (Bob wins) ? (Jilly wins) Someone wins
- Constraints, e.g.
- no winning when playing with the dog
- computer games are only played inside
29Situational representations
- The Distributed Situation Space model of
inference during story comprehension (Frank et
al., Cognitive Science, 2003) represents story
situations as vectors in situation space. - Situation space is self-organised by training on
a large number of situations occurring in the
microworld. - Similarities among vectors reflect similarities
among represented situations - playing soccer and being outside often
co-occur ? similar vectors - playing soccer and hide-and-seek ? dissimilar
vectors - Belief value of some event p in situation X
- Estimated probability that event p occurs in
situation X - Can be computed from vector representations of p
and X
30Microworld and microlanguage
- Microworld situations can be described by 3558
possible sentences from a 20-word microlanguage,
e.g.
jilly plays soccer outside bob and jilly play
with dog someone loses to jilly jilly or bob wins
inside jilly beats bob at hide-and-seek bob loses
at game outside
31Network training
- Input microlanguage sentences, e.g., jilly wins
at soccer outside - Target output vector for Jilly wins ? soccer ?
Jilly is outside - The network is not trained on the 2288 (gt 64 )
sentences in which - (1) anyone beats bob or anyone loses to jilly
- (2) both dog and inside occur or both
hide-and-seek and outside occur - So it is not trained to understand sentences
- (1) that use a transitive verb to describe that
bob loses ( jilly wins) - (2) in which the dog is inside, or hide-and-seek
is played outside - Understanding test sentences
- (1) requires generalisation to new sentences
- (2) requires generalisation to new sentences and
new situations
32Measuring comprehension
- After processing jilly wins at soccer outside
- Compute the belief values of Jilly wins, soccer,
and Jilly is outside in the networks output
vector - If these are larger than the prior probabilities
of these events, the sentence was understood to
some extent - Comprehension score (between -1 and 1) Increase
in belief value of stated event relative to the
events prior probability - Negative comprehension score ? Belief value lt
prior probability ? Error the sentence was
misunderstood
33Results
New sentences, old situations
34Results
New sentences, new situations
35ConclusionsSemantic systematicity
- Comprehension scores are significantly larger
than zero and error rates are very low. - The network learned how new combinations of known
words refer to (new) combinations of known
microworld events. - It displays semantic systematicity
36Open questions
- Upscaling Will it work with languages of more
realistic size? ? Work in progress - How does it work? ? Look at hidden
representations - Can connectionism explain systematicity?
- No, because neural networks do not need to be
systematic - Yes, because they need to adapt to systematicity
in the training input. - Where does systematic cognition come from? Not
from the cognitive system, but from systematicity
in the world and in language.