Title: Connectionism
1Connectionism
2neurally inspired computation
- Neurons integrate information.
- Neurons indicate their level of input.
- Brain structure is layered.
- A neurons influence on another depends on the
strength of their connection. - Learning involves changing the strength of
connections between neurons.
3Features Principles
- Massively parallel processing
- Active representation
- representations directly involved in processing
- Implicit knowledge in connections learning
through adjusting them - Initial architecture constrains learning
- Distributed representations permit graceful
degradation - Memory access by content
4Nature/nurture
- Connectionism Neo-behaviorism?
- Assumptions about what is innate in
- decisions about network architecture, learning
rules, and activation (potentially in
input/output representations). - Assumptions about structure of environment
- decisions about contents, order, frequency of
items in training set
5Localist representations
- A node dedicated to each meaningful
representation - e.g., word nodes, sound nodes, feature nodes
- Language examples with only local reps
- McClelland Rumelhart (1981) visual word
recognition - McClelland Elman (1986) for spoken word
recognition - Dell (1986) word production/speech errors
name
N
\
6Distributed representations
- Less information built in
- More opportunity to see how task input shape
representations but analysis may be difficult. - Learning required
- Various types of architectures
- Language examples
- Elman (1990 1991) and St. John McClelland
(1990) for sentence comprehension - Plaut et al (1996) for word recognition
- Dell, Juliano, Govindjee (1993) for word
production
7Linearly separable
y
X and Y is linearly separable
- No way to draw straight line through
2-dimensional space to group outputs. - XOR cannot be computed with perceptrons.
(1,1)
x
(0,0)
Exclusively X or Y is not linearly separable
(1,1)
x
(0,0)
8Linear threshold XOR network
- Activation rule if ?w ij o j gt 0, a i 1, else 0
(no spreading inhibition) w ij connection
weight between units i j o j output of unit
j a i activation of unit I adapted from
Rumelhart et al., 1986
9Input-to-Hidden layer connections
- make items that yield similar outputs more
similar to each other items with different
outputs less similar.
10Multilayer networks
- Extra layer is needed to solve complex mappings
such as XOR, in which similar inputs dont always
correspond to similar outputs - Hidden layer between input and output.
- Hidden layer smaller than input output so must
find efficient way to compress information - Use back-propagation of error to train weights
- Multilayer networks are as powerful as Turing
machines.
11Supervised learning isnt always easy
Object 1
Object 2
Predict this
12Supervised learning isnt always easy
13Supervised learning isnt always easy
- If Object 1 is a edible to a vegetarian, you get
an ant otherwise you get a lamp.
14Example from reading aloud
15GPC rules
- Grapheme to Phoneme Correspondence rules
spelling to sound rules - E ? /E/, A ? /æ/, but EA ? /I/
- BED BAD BEAD
- allow pronunciation of novel or nonwords FLORP
- Learner already knows sound-meaning relationship
- for sound-based orthographies, good strategy
would be to go from spelling to sound to meaning
16Dual route model
- Via sound (mediated, assembled)
- Convert to phonemes and use phonological form to
find meaning - Phonics sounding out words
- Shows effects associated with spelling-sound
correspondences - regularity effects
- Direct (lexical)
- From whole visual word form to meaning
phonological form - Need experience with word to know pronunciation
(AISLE, PREFACE) - Whole word reading method
- Shows effects associated with whole word
- word frequency
- semantic priming
17Rough Sketch of a Dual Route Model (Coltheart,
Curtis, Atkins, Haller, 1993)
letter detectors
Routes for irregular words
Route for regular words
Visual word detectors
Semantic system
GPC rule system
Phonological output lexicon
Phonemes
18Frequency by Regularity Interaction
Irregular
600
Coltheart (1978, 1985) Marshall Newcombe
(1973) Morton Patterson (1980) Paap Noel
(1991)
550
Response Time (ms)
525
Regular
500
High
Low
frequency
19Patterns of dyslexia
- Phonological
- Can read high-frequency words, but have trouble
with uncommon words - Cant read aloud pronounceable nonwords.
- SLORF
- Surface
- Can read and sound out regular words nonwords
but not irregular words. - Regularize irregular
- say PINT so rhymes with MINT.
20Seidenberg McClelland 1989
Learning Weights between units start out random.
Weight adjustment is scaled by frequency of
the word Results Model learns correct
pronunciation for 3000 single syllable words.
Units may be close to 0 or 1, but not exact.
Distance is error score. Models analog of human
naming latency is phonological error score
Context
Meaning
Orthography
Phonology
MAKE
/mAk/
21Implemented SM89 model
- Learn to activate phonological units given
orthographic ones (no meaning units) - distance between models activation levels
target activation is error measure RT - No word units or explicit GPC rules
- showed frequency by regularity interaction.
- showed pronunciation priming (TINT after
MINT/PINT) repetition priming - could pronounce some nonwords.
- showed performance during learning like kids
learning to read. - suggested degrees of spelling-sound regularity
22Regularity
- Regular
- CODE
- BIRD
- Regular inconsistent
- CONE
- SHONE
- BONE
- GAVE
- PAVE
- SAVE
- Ambiguous
- WIND
- LEAD
- Reg. Nonword
- GLIP
- Incon. Nonword
- MAVE
- NUST
- Pseudohomophone
- BURD
- Irregular
- NONE
- GONE
- DONE
- HAVE
- PINT
- Unique/strange
- SOAP
- AISLE
- FUGUE
23Regularity versus Consistency
- Consistency effect
- Naming time for regular inconsistent words
(sand "wand" is irregular) is longer than for
regular consistent words (week) (Glushko, 1979) - consistency is a statistical property of words,
highly variable across words in English language.
24Model Performance
- Frequency by Regularity interaction
Irregular
5
4
Regular Inconsistent
Phonological Error score
3
Regular
2
High
Low
frequency
25Neighborhoods
- Lexical Neighbors
- words spelled similarly to target.
- Dog has neighbors log, bog, doe, dig,
etc. - Jared, McRae, and Seidenberg (1990)
- Friends are neighbors that share spelling to
sound correspondences, enemies do not. - Consistency effect size depends on summed
frequency of friends versus enemies - Higher frequency friends, smaller consistency
effect.
26Parallel distributed processing approach to word
naming
- PDP models excel at extracting statistical
relations between input and output patterns. - Learning process (e.g., back propagation) is
sensitive to idiosyncratic characteristics of
words - sand is more consistent than pint which is
more consistent than aisle
27Sublexical units
- In symbolic model
- each new grouping requires a level of
representation - syllable
- morpheme
- bigram
- In connectionist model
- sublexical representations may emerge without
being built in - syllables in phonological reps from frequently
used groups of phonemes - morphemes from similarity in form and meaning
28Deep dyslexia
- Mostly semantic errors in reading.
- Semantic NIGHT ? sleep
- Visual SCANDAL ? sandals
- Visual Semantic SHIRT ? skirt
- Visual then Semantic SYMPATHY ? (symphony) ?
orchestra - Symbolic models need to assume multiple modules
damaged. - visual semantic, but no principled reason for
that combination to co-occur frequently
29Attractors
- In models with recurrent connections, eventually
activation levels stop changing - reach stable state, attractor
- where satisfied many constraints
- connection weights trained to get activation
levels to stable state - create landscape in which input determines
starting point activation pattern is a rolling
ball that settles at lowest point - when closer to a stable state, activation levels
change faster - time steps to settle reaction times
30Hinton Shallice (1991)
- Connectionist model of deep dyslexia
- Trained model to map from orthography to
semantics - 40 word set, 5 categories
- recurrent network resulted in attractor
structure - lesioned different parts of model by removing
units, connections, or randomly changing weight.
clean up
sememes
hidden
graphemes
31Modeling dyslexia
- All locations ways of lesioning led to similar
mix of errors, like deep dyslexics. - Lesions changed attractor shapes shape
determined by all weights in model.
cat cap dog
cat dog cap
Mapping from orthographic space to semantic space
Ortho Sem
32Modeling dyslexia
- All locations ways of lesioning led to similar
mix of errors, like deep dyslexics. - Lesions changed attractor shapes shape
determined by all weights in model.
cat dog cap
cat cap dog
Mapping from orthographic space to semantic space
Ortho Sem
33Dependent on architecture?
- Variations on network architecture produced same
results (Plaut Shallice, 1993). - But abstract words (few sememes) depend less on
clean-up units than concrete words (many). - Lesion ortho?hidden or hidden?sem connections,
concrete spared relative to abstract. - Lesion sem ?clean-up connections, abstract
spared relative to concrete. - Double dissociation within one unified processing
system!
34Simple recurrent network (SRN)
- A feedforward network with a memory
- Memory is bank of context units that stores
values from previous time step. - Used for modeling sequential behavior
- Word-by-word sentence prediction/ production
- phoneme-by-phoneme prediction
- learns statistical regularities in language
35Implemented models
- Elmans sentence prediction model
- SRN trained to predict next word in a sentence
- Elman (1990, 1991)
- hidden units reflected similarity in word use
- important because syntax word classes were
specialty of symbolic models - Christiansen Chaters recursive model
- Same design task as Elmans SRN models
- Demonstrated learning of recursion, with
degrading performance for increased embeddings
36Christiansen Chater (1999)
- Model the types of recursion that Chomsky (1957)
said finite state models context free grammars
couldnt. - Right branching easy
- John loves Mary who likes Jim who dislikes
Martha. - Counting recursion
- if S 1 then S 2. if (if S 1 then S 2) then S
3. - Mirror recursion - center embedding
- NP 1 NP 2 V 2 V 1 The cat the dog chased died.
- Identity recursion - cross-dependency
- NP 1 NP 2 V 1 V 2 Dutch has these structures
37Performance after training
- For humans
- If-Then lt cross-dependency lt center embedding
- performance declines with multiple embeddings
- For SRN models
- If-Then lt cross-dependency lt center embedding
- generalized to deeper embeddings than seen in
training - performance declines for all (even right
branching) with multiple embeddings - For trigrams/bigrams (transitional probabilities
for word pairs or triplets) - If-Then lt center embedding lt cross-dependency
- worse than SRN models
38The apartment that the maid who the service had
sent over was well decorated.
39The apartment that the maid who the service had
sent over was cleaning every weekwas well
decorated.
40Unexpected results
- Ungrammatical NNNVV rated more grammatical than
grammatical NNNVVV - Thomas Gibson (1997) Christianson MacDonald
(1999) - CC recursive model
- after NNNVV model activates End-of-sentence
marker more highly than the set of third Vs. - People rate sentences as less grammatical with
increasing right-branching structures - PPs - recursive models show same trend.
- The blooming flowers in the vase on the table
by the window resemble roses.
41PDP Features Principles
- Massively parallel processing
- Neurons much slower than computers so processing
must be parallel to accomplish tasks in under a
second - Merge knowledge/representation processing
- Implicit knowledge in connections learning
through adjusting them - Information not available outside of processing
- Initial architecture constrains learning
- Distributed representations permit graceful
degradation
42PDP approach best at
- Classification
- automatic similarity based generalization
- Pattern recognition
- finding best match quickly even with noisy data
- Memory/recall
- content addressable - retrieval cue leads to
reconstruction of memory - Optimization
- finding best organization given constraints
- Prediction/inference
- e.g., causes from effects, disease from symptoms
43Problems for connectionist models
- Trade-off between ability to generalize ability
to recall individual episodes or examples - Usually opt for generalization
- No one-trial learning for dissimilar examples
- Learning new information may interfere with old
(catastrophic interference) - Difficult to model entire appropriate training
set, but if only train part, might get
unrealistic result - Networks may fail due to non-critical assumptions
made in implementing model - Often require sophisticated analysis to
understand how a problem is solved
44PDP Symbolic
- Learning/development
- Damage (graceful degradation)
- Time course
- Generalization within training space
- Generalization outside training space
- Scaling up
45Trends in modelling
- Training on real input
- e.g., parental speech from transcripts
- Scaling up
- e.g., model whole domain
- Train multiple tasks
- e.g., sentence comprehension production
- Add more neurobiological constraints
- e.g., model specific patients, predict their
recovery. Use neuroanatomy to constrain
architecture.