Title: Connections
1Chapter 7
- Connections
- Associations
- Neural Networks
- Parallel Distributed Processing (PDP)
2Origin ofParallel Distributed Processing
- Work in PDP began from an effort to
computationally model how the networks of neurons
in the brain might contribution to thought.
3Neurons
- Neurons are cells of the nervous system.
- Neurons are specialized to carry "messages"
through an electrochemical process. - Neurons send these messages through a single axon
and receive messages through multiple dendrites. - The human brain has about 100 billion neurons.
4The Neuron
5Differences between axons and dendrites
- Axons
- Take information away from the cell body
- Smooth Surface
- Generally only 1 axon per cell
- Branch further from the cell body
- Dendrites
- Bring information to the cell body
- Rough Surface (dendritic spines)
- Usually many dendrites per cell
- Branch near the cell body
6Axonal Transmission
- A nerve cell at rest is electrically more
negatively charged. - A stimulus will cause the gates in the axonal
membrane to open allowing positively charged
sodium ions to rush into the cell. - When the cell reaches it maximal positive state,
these gates close. - Another set of gates allows positively chargely
potassium ions to leave the cell in a process
called synaptic transmission.
7Synaptic Transmission
Dendrite
8Synaptic Junctions
- Two types
- Excitatory
- depolarization occurs at postsynaptic membrane
sites raising the postsynaptic membrane potential
- Inhibitory
- hyperpolarization occurs at postsynaptic
membrane sites, lowering the postsynaptic
membrance potential - polarization is the production of a reverse
electromotive force
9From Neuron to Perceptron
- In the 1950s and 60s computational modeling of
a network resembling the neuron was tested - The model was called a perceptron.
10From Neuron to Neural Network
Rich
11 A Perceptron
x1
w1
?
Threshold
w2
x2
x1 and x2 are inputs w1 and w2 are connection
weights ? is the sum of these weights if ?
exceeds the threshold, the perceptron will fire.
Rosenblatt
12The AND and OR Relationsare natural input to a
perceptron
- AND
- w1 w2 Sum (?) Threshold
- 0 0 0 0
- 0 1 1 1
- 1 0 1 1
- 1 1 2 1
- OR
- w1 w2 Sum (?) Threshold
- 0 0 0 0
- 0 1 1 1
- 1 0 1 1
- 1 1 1 1
13Sample AND Relation
- AND is the English 'and,' meaning both. For
example, to take - PSYC 301 you must have taken PSYC 220 and PSYC
203. - 220 203 Take 301?
- no no no
- no yes no
- yes no no
- yes yes yes
14Sample OR Relation
- OR is 'inclusive or,' meaning either, possibly
both. For - example, an insurance policy states that
insurance premiums - will be waived in the event of sickness or
unemployment. - Sick Unemployed Premium Waiver?
- no no no
- no yes yes
- yes no yes
- yes yes yes
15Perceptrons and Learnability
- A perceptron, faced with new data, can reset the
weights - This means it can learn
16Learnability
- The symbol/rule systems (logic, rules, concepts,
analogies, images) have proposed only one
explanation of human learning - innate knowledge parameter setting (as in
Universal Grammar) - The perceptron provided another explanation of
learning, one that involved learning entirely
from experience
17The exclusive OR relationcannot be computed by a
perceptron
- Exclusive or (XOR)
- w1 w2 Sum (?) Threshold
- 0 0 0 0
- 0 1 1 1
- 1 0 1 1
- 1 1 0 ?
- An example of an XOR relation
- at a restaurant the lunch special is a
cheeseburger with either - salad or french fries, but not both.
- Minsky and Papert.
18The Fate of the Perceptron
- due to Minsky and Paperts XOR argument, interest
in the perceptron waned. - in the 1980s interest in neural networks revived
with the notions of - hidden units in the network
- backpropagation
19Stereopsis assigning structure to data
- Interest in artificial neural nets revived with
studies of stereoscopic vision (thumbs) - the perception of depth involves assigning
structure (our perception of the object, its
distance, its depth, its structure) to data (the
object in the physical world) - whenever structure is assigned to data, the
question arises as to whether the assignment is
done top-down (from structure to data) or
bottom-up (from data to structure)
20Top-down vs. Bottom-up Processing
- Working top down, a system uses knowledge of
structure to predict the details to be found in
the data - Working bottom up, a system uses the data to
predict high-level structure - With stereopsis, the question is whether we work
bottom up from simple disparities between right
and left image, or whether we anticipate the
image in depth by knowing something about its
structure in advance
21The Perception of Unstructured Data
- Bela Julesz (1971) showed that pictures composed
of random dots could produce depth effects.
worksheet - This implies that stereopsis can work bottom up
from simple disparities in the data alone. - Even telling subjects what they should see does
not speed up the perception of depth. Frisby
and Clatworthy
22Bottom up vs. Top down
- Julesz discovery indicated that depth perception
is driven by low-level brain activity making
sense out of the data rather than high-level
principles or rules imposing structure on the
data. - Marr and Poggio (1976) built a computer program
for stereopsis that relies on two low-level
constraints they believed could be wired into the
brain to guide the matching of two images.
23Stereo Matching 3 lines of sight from each eye
give nine possible points of fusion
3 adjacent points on the surface of an object
how do we resolve to these 3 out of the possible
9?
depth
24Uniqueness Constraint on Stereo Matching
- A point in one image can
- normally be matched
- with one and only one
- point in the other image.
- (Each link in the figure is
- inhibitory if one fusion
- point is active, the points
- to which its linked are
- not active.)
depth
25Continuity Constraint on Stereo Matching
- Two adjacent points in
- an image will tend to
- represent points at about
- the same depth.
- (Each link in the figure is
- excitatory, and so if one
- fusion point is active it
- excites all those to which
- it is connected.)
depth
26Marr Poggios Programfor Stereopsis
- The brain has to apply these constraints in
parallel (simultaneously) - MP designed an array of processors to handle the
fusion of images under these constraints - Each processor unit does the same computation
on the particular values that are local to it
(its own value and those of its neighbors) - The output from each processor is then used on a
fresh cycle of activity - The system continues to compute until it settles
down to stable values at each processor
27- Connectionism was back in business.
- MPs program was followed by
- Feldmans 1981 model of visual representation in
memory - McClelland and Rumelharts 1981 model of letter
perception
28Key Elements in Current Connectionist Models
- hidden units/layers
- distributed representation
- parallel processing
- parallel constraint satisfaction
- excitatory and inhibitory links
- Learning
- Hebbian learning
- delta rule
- backpropagation
- feedforward network
- spreading activation
- relaxation/settling
- graceful degradation
29 Hidden Units A Simple XOR network with one Hidden
Unit
Input Units
Hidden Unit
Output Unit
1
x1
1
-2
1.5
.5
1
x2
1
The number on the arrows show strengths of
connections among units. The numbers in the
circles show the thresholds of the units. If both
input units are on, the hidden unit threshold
will be tripped sending an inhibitory signal
(here -2) to the output unit. Since -2 is below
the .5 threshold of the output unit, the output
unit will not fire. Rumelhart, Hinton,
Williams
30A Local Connectionist Network
- Involves a one-to-one correspondence between
concepts and hardware units - Each unit is given an identifiable interpretation
in terms of specifiable concepts or propositions. -
- likes programming likes parties
- computer geek outgoing
- shy
-
31Distributed Connectionist Networks
- Each entity is represented by a pattern of
activity distributed over many computing elements - Each computing element is involved in
representing many different entities. - As an example, look at the Necker cube in B on
the worksheet. How many ways can you interpret
the cube?
32The Necker Cube
- There are 2 global interpretations of the Necker
cube - in one, point a is the front upper left point
- in the other, point a is the back upper left
point - In each interpretation, the cube has 8 points
- But the orientation of each point differs
depending on the interpretation
33- When point a is the front upper left (FUL) point
- point b is the front lower left (FLL) point
- point c is the front lower right (FLR) point
- point d is the back upper right (BUR) point
- point
- wksht B
34 - The interpretation of the Necker cube as a cube
with its front facing downward (i.e., with point
a as FUL) depends on all of the points in the
cube having one and only one orientation. - In other words, this interpretation is
distributed over 8 orientations that must be on
together in order for us to interpret the cube as
having its front facing downward.
35(No Transcript)
36Parallel Processing
- To interpret the cube in one orientation, we do
not - first establish a as FUL
- then establish b as FLL
- then establish c as FLR
- in a serial fashion
- We interpret all the point simultaneously, in
parallel
37Parallel Processing (2)
- A single computer processor
- Time nanoseconds (1 billionth of a sec.) for
each operation - Mode serial (consecutive)
- The human brain
- Time milliseconds (1 thousandth of a sec.) for
each operation - Mode parallel (simultaneous)
38Parallel Constraint Satisfaction
- There are 3 constraints on the relations between
the points in the Necker cube - Each point can have only one label at a time
(point a cannot be both FUL and BUL
simultaneously) - Each point depends on the interpretation of its
near neighbors - Each label can be used only once in an
interpretation (e.g., no 2 points can be FUL) - These constraints all hold simultaneously
39 Parallel Constraint Satisfaction in Word
Recognition
40Constraints on Word Recognition
- RED KEY BEE word
- R K T 1st letter
- features
- of letters
41Word Recognition
- Constraints operate at the feature, letter, and
word levels. - Each word is represented by a processing unit
- Each letter at each position in a word is
represented by a processing unit - Some pairs of units excite each other
- the word is RED the 1st letter is R
- Inconsistent pairs inhibit each other
- the word is KEY the 1st letter is R
- McClelland Rumelhart
42Excitatory Inhibitory Links
- Each point can have only one label at a time
- ? negative link between a points label in one
interpretation and the other - Each point depends on the interpretation of its
near neighbors - ? positive link between a point and its 3
closest neighbors - Each label can be used only once in an
interpretation - ? negative link between 2 identical labels
representing different points
43Representation of Links as A Connectivity Matrix
- u1 u2 u3 u4 u5 u6 u7 u8
- u1 0 0 0 0 0 -5 -5 0
- u2 0 0 0 0 6 5 0 6
- u3 0 0 0 0 6 5 0 6
- u4 0 0 0 0 6 5 0 6
- u5 0 0 0 0 6 5 0 6
- u5 strongly excites u2
- u6 strongly inhibits u1
44Hebb association between neurons
- When an axon of cell A is near enough to excite
a cell B and repeatedly or persistently takes
part in firing it, some growth process or
metabolic change takes place in one or both cells
such that As efficiency, as one of the cells
firing B, is increased. - From The organization of behavior. 1949.
45Hebb a cell assembly
- many repetitions of a sensory event will lead to
the gradual building up of a set of perhaps 25 to
100 neurons in a cell assembly - One assembly will form connections with others,
and it may therefore be made active by one of
them in the total absence of the adequate
stimulus. In short, the assembly activity is the
simplest case of an image or an idea . . . .
46Hebbian Learning
- a type of learning in which the weight on a link
between two units is increased if both units are
active at the same time - Expressed as the delta rule
-
47The Delta Rule
- ?wij g(ai(t),ti(t)) h(oj(t),wij).
- A change (?) in the weight of a link between unit
i and unit j is the product of - the function g of the activation of uniti (ai at
time t) and its teaching input ( ti(t) ) and - the function h of the output value of unitj and
the connection strength between unit i and unitj
( wij ).
48The Delta Rule (2)
- a change in the weight from uniti to unitj is the
product of - the value of the teaching input
- the value of the current activation state of
uniti - the value of the output of unitj
- the current weight between uniti and unitj
49The Delta Rule (3)
unitj
uniti
outputj
wij
teachingi
activationi
- If there is no teaching input, the weights
change in - proportion to the change in ai
-
50Advantages of Hebbian Learning
- founded on known neurological activity
(neurologically real) - can occur without explicit correction (without a
teacher)
51Feed forward Networks
- A network in which information flows only from
input layer to output layer, but not back - input layer - the set of units in a network that
is directly activated by the input - output layer - the last layer of units
- hidden layer - an intermediate layer that serves
to readjust threshold weights
52Learning by Backpropagation
- 1. Weights between units are randomly assigned
- 2. There is an initial test phase in which an
input activation is introduced and propagates
through the network to yield an output - 3. This output is compared with the required
output given by an external source (teacher). If
there is a difference, then an error signal is
calculated. - 3. The error signal causes the weights to change
backward through the network.
53Disadvantages of Back Propagation
- not neurologically realistic
- requires an external teaching source
54Relaxation ? Settling
- Relaxation -- the procedure whereby a system
settles into a locally optimal state in which as
many as possible of the constraints are satisfied
- Suppose all units are off, and you focus on the
FLL unit on the left of the network. This unit
receives positive input from the diagram. The
activation of this FLL unit spreads to FUL, BLL,
and FLR on the left of the network, turning them
on, and on to the units for which they have
excitatory links. Why? -
55Relaxation ? Settling
- The activation of this FLL unit also spreads to
FLL and BLL toward the right of the network,
turning them off. Why? - At this point, the network 'settles' into the
interpretation that corresponds to the network of
8 units on the left. - Gazing at the lower left corner of the diagram
may turn on the BLL unit near the center of the
network, turning on the BLL, FLL, and BLR units
with which it has excitatory links, and turning
off its competitors (FLL and BLL in the network
on the left). - At this point, the network settles in to the
other interpretation.
56Graceful Degradation
- Because knowledge is distributed, i.e.,
- Every unit is involved in the storage of patterns
of connections - Each pattern of connections involves many units
- if some units or connections are lost, the stored
knowledge will be degraded but not entirely
lost - in other CRUM models, the failure of one
component means a loss of knowledge
57Rumelhart metaphors for mind
- New the brain
- slow (milliseconds)
- parallel processing
- parallel constraint satisfaction
- knowledge stored in links between units
- accommodates approximate matching (by prototype)
- Old the computer
- fast (nanoseconds)
- serial processing
- serial constraint satisfaction
- knowledge stored as facts rules
- demands exact matching
58Approximate Matching
- PDPs have content addressable memory
- many patterns (units links) stored
- the network can perform a match when only a
portion of the pattern is given - the network finds the closest match
- Consider the Necker cube. How many units and
links do you need to be given to match one
orientation of the cube?
59Issues in Rumelhart
60Architecture more than just input, hidden, and
output units
- Human activity that takes place in time involves
sequential as well as parallel behavior (e.g.,
movement, speech) - How do PDPs blend the sequential and the
parallel? - Plan units tell the network which sequence it is
producing (Jordan, 1986) - Context units keep track of where the system is
in the sequence (Jordan, 1986 Elman, 1988)
61The Scaling Problem
- moderately difficult problems require a few
hundred thousand input examples - One view grows from viewing learning and
evolution as continuous with one another. On
this view the fact that networks take a long time
to learn is to be expected because we normally
compare their behavior to organisms that have
long evolutionary histories (Rumelhart, p. 235) - Compare innateness and universal grammar
-
62The Generalization Problemwksht C
- How do neural networks perform in inducing the
best generalization from input data? - Rumelhart chose the simplest, most robust
network that is consistent with the observations
made. - But . . . .
63An example of generalizationRich and Knight
- hair scales feathers flies water
eggs - dog 1 0 0 0 0 0
- cat 1 0 0 0 0 0
- bat 1 0 0 1 0 0
- whale 1 0 0 0 1 0
- canary 0 0 1 1 0 1
- robin 0 0 1 1 0 1
- ostrich 0 0 1 1 0 1
- snake 0 1 0 0 0 1
- lizard 0 1 0 0 0 1
- alligator 0 1 0 0 1 1
64A Common Generalization Effect in Neural Net
Learning
training set
- After a certain plateau, performance on the test
set gets worse - Given large amounts of input, the network begins
to memorize individual input-output pairs - It stores the entire training set, rather than
generalizing over it. Rich Knight
Performance ?
test set
Training Time ?
65Applications of Connectionism
- Transforming text to speech (wksht D)
- Teaching sound discrimination to non-native
speakers - Language processing (past tense)
- Decision systems
- Teaching reading
66Distinguishing /r/ from /l/
- The following four slides are from a talk
entitled - Intervention Strategies that Promote Learning
Their Basis and Use in Enhancing Literacy - from the Center for the Neural Basis of
Cognition
67Learning to identify speech sounds
- Key hypothesis Brain reinforces whatever
pattern of neural representation is elicited. - Training that elicits undesired neural
representations may be counterproductive so
ensure correct perception during training! - Findings
- Learning without feedback requires correct
perception. - Learning with feedback occurs effectively,
whether or not correct perception is ensured.
68Learning to distinguish /l/ and /r/ Behavioral
experiment
Train Japanese natives on normal vs. adaptively
exaggerated /l/ and /r/
classified as /r/
After only 3 training sessions, adaptive training
yields much better performance!
classified as /r/
lock
lock
rock
rock
69Learning /l/-/r/ distinction Neural network
model
Percept
Nearby units model /l/ and /r/ acoustic
inputs. English training on /l/ and /r/ learns
2 percepts. Japanese training maps both to
single percept. Later training on /l/ and /r/
reinforces this output preventing learning of
the English contrast.
Input
But training on exaggerated inputs learns /l/-/r/
contrast successfully and retains it even
under later training on normal /l/ and /r/ input.
70Distinguishing /l/ from /r/Functional magnetic
resonance imaging (fMRI) study
Auditory brain areas in English speakers
habituate to a stream of similar speech input
(load), but dishabituate to oddballs that vary
only by the sound /r/ (road).
Schematic of the acoustic input
road
road
road
load
load
load
load
load
load
load
load
load
load
load
load
load
load
load
load
load
load
14 sec post-oddball time-window
post-oddball time-window
post-oddball time-window
Left Auditory Cortex
Right Auditory Cortex
.09
.09
.06
.06
p.0001
p.0001
..03
signal change from average
signal change from average
.03
0
0
-.03
-.03
-.06
-.06
1.6
4.8
8.0
11.2
14.4
1.6
4.8
8.0
11.2
14.4
Post-Oddball Time (sec)
Post-Oddball Time (sec)
Areas in white show parts of auditory cortex that
respond transiently to oddballs. Graphs show the
time course of this response (arrows show peak)
relative to baseline (dashed line).
Future work will determine if the same auditory
areas in Japanese speakers respond to /l/ vs. /r/
oddballs after training, so as to test whether
training modifies perceptual representations
learned in childhood, versus downstream,
non-acoustic processes.
71 Language Processing
-the past tense- Rumelhart McClelland
- Net is trained on both regular and irregular past
tense forms - Training input go stand look
- Training output went stood looked
- no knowledge of verb stems (look decomposable
into look and ed) - explicitly coded word boundary information
72Language Processing -the past
tense- Rumelhart McClelland
- Testing sample 86 unseen low frequency verbs
(14 irregular, 72 regular) - Performance
- Irregulars 78.6 error rate
- Regulars 33.3 error rate
- for 6 regular verbs it produced no response (it
cannot generalize to Ved) - strange errors squat ? squakt, mail ? membled,
tour ? toureder, shape ? shipt, brown ? brawned
73Issues in Evaluating Connectionism
- Implementational connectionism
- PDP models have to be able to
- implement symbolic structures
- in order to enable them to
- manipulate mental
- representations with constituent
- structure
- Eliminative connectionism
- once PDP models are fully
- developed, they will replace
- symbol-processing models as
- explanations of cognitive
- processes
74Issues in evaluating connectionism
- Neural networks are best suited to handle
classification problems they have have not been
tried extensively on planning, language modeling,
etc. - Still serious problems with their ability to
handle phenomena that involve time - Inability of the network to capture generalization
75Issues in Evaluating Connectionism
- Inability to deal with infinite sets that have no
finite sample for inductive modeling - In such dealings, humans are guided by a
knowledge of which similarities are important and
which are spurious - With pairs like (5x6) 2 32,
- (8x4)7 39, (105x72) 3 7563, the net has a
potentially infinite number of pairs and no
knowledge of the structure of the arithmetic
expression - scaling PDPs require a large number of examples
for tasks a human does with one example (e.g.,
face recognition)
76References
- Frisby, J.P. and J.L. Clatworthy. 1975.
Learning to see complex random-dot stereograms.
Perception, 4, 173-8. - Interactive Tutorial Building Blocks of the
Nervous System http//www.wwnorton.com/gleitman/c
h2/tutorials/2tut2.htm - Johnson-Laird, P. 1988. The Computer and the
Mind. Harvard Univ. Press. - Julesz, B. 1971. Foundations of Cyclopean
Perception. Univ. of Chicago Press. - Marr, D. and T. Poggio. 1976. Co-operative
computation of stereo disparity. Science, 194,
283-7. - Rich, E. and K. Knight. 1991. Artificial
Intelligence. 2nd Edition. McGraw Hill. - Rumelhart, D, G. Hinton, and R. Williams. 1986.
Learning internal representations by error
propagation. in Rumelhart, McClelland et al. - Rumelhart, D., J. McClelland and the PDP Research
Group. 1986. Parallel Distributed Processing
Explorations in the Microstructure of Cognition.
MIT Press. - http//neuromod.uva.nl/courses/connectionism1999/i
ntro/sld029.htm
77References (2)
- Rumelhart, D. and J. McClelland. 1986. Learning
the past tense of English verbs. In Rumelhart,
D., J. McClelland and the PDP Research Group,
vol. 2.