Backprop, 25 years later

About This Presentation

Title:

Backprop, 25 years later

Description:

Title: Recent Progress: Understanding Human Facial Expression Recognition Author: Compaq Last modified by: Gary Cottrell Created Date: 2/28/2000 8:28:56 AM – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 93

Provided by: Compaq

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Backprop, 25 years later

1
Backprop, 25 years later

Garrison W. Cottrell
Gary's Unbelievable Research Unit (GURU)
Computer Science and Engineering Department
Institute for Neural Computation
UCSD

2
But first

Hal White passed away March 31st, 2012
Hal was our theoretician of neural nets, and
one of the nicest guys I knew.
His paper on A heteroskedasticity-consistent
covariance matrix estimator and a direct test for
heteroskedasticity has been cited 15,805 times,
and led to him being shortlisted for the Nobel
Prize.
But his paper with Max Stinchcombe Multilayer
feedforward networks are universal approximators
is his second most-cited paper, at 8,114 cites.

3
But first

In yet another paper (in Neural Computation,
1989), he wrote
The premise of this article is that learning
procedures used to train artificial neural
networks are inherently statistical techniques.
It follows that statistical theory can provide
considerable insight into the properties,
advantages, and disadvantages of different
network learning methods
This was one of the first papers to make the
connection between neural networks and
statistical models - and thereby put them on a
sound statistical foundation.

4
We should also remember

Dave E. Rumelhart passed away on March 13, 2011
Many had invented back propagation few could
appreciate as deeply as Dave did what they had
when they discovered it.

5
What is backpropagation, and why is/was it
important?

We have billions and billions of neurons that
somehow work together to create the mind.
These neurons are connected by 1014 - 1015
synapses, which we think encode the knowledge
in the network - too many for us to explicitly
program them in our models
Rather we need some way to indirectly set them
via a procedure that will achieve some goal by
changing the synaptic strengths (which we call
weights).
This is called learning in these systems.

6
Learning A bit of history

Frank Rosenblatt studied a simple version of a
neural net called a perceptron
A single layer of processing
Binary output
Can compute simple things like (some) boolean
functions (OR, AND, etc.)

7
Learning A bit of history
8
Learning A bit of history
9
Learning A bit of history

Rosenblatt (1962) discovered a learning rule for
perceptrons called the perceptron convergence
procedure.
Guaranteed to learn anything computable (by a
two-layer perceptron)
Unfortunately, not everything was computable
(Minsky Papert, 1969)

10
Perceptron Learning Demonstration

Output activation rule
First, compute the net input to the output unit
?wixi net
Then, compute the output as
If net ? ? then output 1
else output 0

11
Perceptron Learning Demonstration

Output activation rule
First, compute the net input to the output unit
?wixi net
If net ? ? then output 1
else output 0
Learning rule
If output is 1 and should be 0, then lower
weights to active inputs and raise the threshold
(?)
If output is 0 and should be 1, then raise
weights to active inputs and lower the threshold
(?)
(active input means xi 1, not 0)

12
Characteristics of perceptron learning

Supervised learning Gave it a set of
input-output examples for it to model the
function (a teaching signal)
Error correction learning only correct it when
it is wrong.
Random presentation of patterns.
Slow! Learning on some patterns ruins learning on
others.

13
Perceptron Learning Made Simple

Output activation rule
First, compute the net input to the output unit
?wixi net
If net ? ? then output 1
else output 0
Learning rule
If output is 1 and should be 0, then lower
weights to active inputs and raise the threshold
(?)
If output is 0 and should be 1, then raise
weights to active inputs and lower the threshold
(?)

14
Perceptron Learning Made Simple

Learning rule
If output is 1 and should be 0, then lower
weights to active inputs and raise the threshold
(?)
If output is 0 and should be 1, then raise
weights to active inputs and lower the threshold
(?)
Learning rule
wi(t1) wi(t) ?(teacher - output)xi
(? is the learning rate)
This is known as the delta rule because learning
is based on the delta (difference) between what
you did and what you should have done ?
(teacher - output)

15
Problems with perceptrons

The learning rule comes with a great guarantee
anything a perceptron can compute, it can learn
to compute.
Problem Lots of things were not computable,
e.g., XOR (Minsky Papert, 1969)
Minsky Papert said
if you had hidden units, you could compute any
boolean function.
But no learning rule exists for such multilayer
networks, and we dont think one will ever be
discovered.

16
Problems with perceptrons
17
Aside about perceptrons

They didnt have hidden units - but Rosenblatt
assumed nonlinear preprocessing!
Hidden units compute features of the input
The nonlinear preprocessing is a way to choose
features by hand.
Support Vector Machines essentially do this in a
principled way, followed by a (highly
sophisticated) perceptron learning algorithm.

18
Enter Rumelhart, Hinton, Williams (1985)

Discovered a learning rule for networks with
hidden units.
Works a lot like the perceptron algorithm
Randomly choose an input-output pattern
present the input, let activation propagate
through the network
give the teaching signal
propagate the error back through the network
(hence the name back propagation)
change the connection strengths according to the
error

19
Enter Rumelhart, Hinton, Williams (1985)
OUTPUTS
. . .
Hidden Units
Error
Activation
. . .
INPUTS

The actual algorithm uses the chain rule of
calculus to go downhill in an error measure with
respect to the weights
The hidden units must learn features that solve
the problem

20
XOR
Back Propagation Learning
AND
OR
Random Network
XOR Network

Here, the hidden units learned AND and OR - two
features that when combined appropriately, can
solve the problem

21
XOR

But, depending on initial conditions, there are
an infinite number of ways to do XOR - backprop
can surprise you with innovative solutions.

22
Why is/was this wonderful?

Efficiency
Learns internal representations
Learns internal representations
Learns internal representations
Generalizes to recurrent networks

23
Hintons Family Trees example

Idea Learn to represent relationships between
people that are encoded in a family tree

24
Hintons Family Trees example

Idea 2 Learn distributed representations of
concepts localist outputs

25
People hidden units Hinton diagram

The corresponding people in the two trees are
above/below one another english above, italian
below

26
People hidden units Hinton diagram

What does the unit 1 encode?

What is unit 1 encoding?
27
People hidden units Hinton diagram

What does unit 2 encode?

What is unit 2 encoding?
28
People hidden units Hinton diagram

Unit 6?

What is unit 6 encoding?
29
Relation units

What does the upper right one code?

30
Lessons

The network learns features in the service of the
task - i.e., it learns features on its own.
This is useful if we dont know what the features
ought to be.
Can explain some human phenomena

31
Another example

In the next example(s), I make two points
The perceptron algorithm is still useful!
Representations learned in the service of the
task can explain the Visual Expertise Mystery

32
A Face Processing System
33
The Face Processing System
34
The Face Processing System
35
The Face Processing System
Bob Carol Ted Cup Can Book
PCA
Gabor Filtering
Neural Net
Pixel (Retina) Level
Perceptual (V1) Level
Object (IT) Level
Category Level
Feature level
36
The Gabor Filter Layer

Basic feature the 2-D Gabor wavelet filter
(Daugman, 85)

These model the processing in early visual areas

Subsample in a 29x36 grid
37
Principal Components Analysis

The Gabor filters give us 40,600 numbers
We use PCA to reduce this to 50 numbers
PCA is like Factor Analysis It finds the
underlying directions of Maximum Variance
PCA can be computed in a neural network through a
competitive Hebbian learning mechanism
Hence this is also a biologically plausible
processing step
We suggest this leads to representations similar
to those in Inferior Temporal cortex

38
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
39
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
40
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
41
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
42
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
43
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Holons (Gestalt layer)
...
Input from Perceptual Layer
44
How to do PCA with a neural network(Cottrell,
Munro Zipser, 1987 Cottrell Fleming 1990
Cottrell Metcalfe 1990 OToole et al. 1991)

A self-organizing network that learns
whole-object representations
(features, Principal Components, Holons,
eigenfaces)

Input from Perceptual Layer
45
Holons

They act like face cells (Desimone, 1991)
Response of single units is strong despite
occluding eyes, e.g.
Response drops off with rotation
Some fire to my dogs face
A novel representation Distributed templates --
each units optimal stimulus is a ghostly looking
face (template-like),
but many units participate in the representation
of a single face (distributed).
For this audience Neither exemplars nor
prototypes!
Explain holistic processing
Why? If stimulated with a partial match, the
firing represents votes for this template
Units downstream dont know what caused
this unit to fire.
(more on this later)

46
The Final Layer Classification(Cottrell
Fleming 1990 Cottrell Metcalfe 1990 Padgett
Cottrell 1996 Dailey Cottrell, 1999 Dailey et
al. 2002)

The holistic representation is then used as input
to a categorization network trained by supervised
learning.

Output Cup, Can, Book, Greeble, Face, Bob,
Carol, Ted, Happy, Sad, Afraid, etc.
Categories
Holons
Input from Perceptual Layer

Excellent generalization performance demonstrates
the sufficiency of the holistic representation
for recognition

47
The Final Layer Classification

Categories can be at different levels basic,
subordinate.
Simple learning rule (delta rule). It says (mild
lie here)
add inputs to your weights (synaptic strengths)
when you are supposed to be on,
subtract them when you are supposed to be off.
This makes your weights look like your favorite
patterns the ones that turn you on.
When no hidden units gt No back propagation of
error.
When hidden units we get task-specific features
(most interesting when we use the
basic/subordinate distinction)

48
Facial Expression Database

Ekman and Friesen quantified muscle movements
(Facial Actions) involved in prototypical
portrayals of happiness, sadness, fear, anger,
surprise, and disgust.
Result the Pictures of Facial Affect Database
(1976).
70 agreement on emotional content by naive human
subjects.
110 images, 14 subjects, 7 expressions.

Anger, Disgust, Neutral, Surprise, Happiness
(twice), Fear, and Sadness This is actor JJ
The easiest for humans (and our model) to classify
49
Results (Generalization)

Kendalls tau (rank order correlation) .667,
p.0441
Note This is an emergent property of the model!

50
Correlation of Net/Human Errors

Like all good Cognitive Scientists, we like our
models to make the same mistakes people do!
Networks and humans have a 6x6 confusion matrix
for the stimulus set.
This suggests looking at the off-diagonal terms
The errors
Correlation of off-diagonal terms r 0.567. F
(1,28) 13.3 p 0.0011
Again, this correlation is an emergent property
of the model It was not told which expressions
were confusing.

51
Examining the Nets Representations

We want to visualize receptive fields in the
network.
But the Gabor magnitude representation is
noninvertible.
We can learn an approximate inverse mapping,
however.
We used linear regression to find the best linear
combination of Gabor magnitude principal
components for each image pixel.
Then projecting each units weight vector into
image space with the same mapping visualizes its
receptive field.

52
Examining the Nets Representations

The y-intercept coefficient for each pixel is
simply the average pixel value at that location
over all faces, so subtracting the resulting
average face shows more precisely what the
units attend to

Apparently local features appear in the global
templates.

53
Morph Transition Perception

Morphs help psychologists study categorization
behavior in humans
Example JJ Fear to Sadness morph

0 10 30 50 70
90 100

Young et al. (1997) Megamix presented images
from morphs of all 6 emotions (15 sequences) to
subjects in random order, task is 6-way forced
choice button push.

54
Results classical Categorical Perception sharp
boundaries
6-WAY ALTERNATIVE FORCED CHOICE
PERCENT CORRECT DISCRIMINATION

and higher discrimination of pairs of images
when they cross a perceived category boundary

55
Results Non-categorical RTs

Scalloped Reaction Times

BUTTON PUSH
REACTION TIME
56
Results More non-categorical effects

Young et al. Also had subjects rate 1st, 2nd, and
3rd most apparent emotion.

At the 70/30 morph level, subjects were above
chance at detecting mixed-in emotion. These data
seem more consistent with continuous theories of
emotion.

57
Modeling Megamix

1 trained neural network 1 human subject.
50 networks, 7 random examples of each expression
for training, remainder for holdout.
Identification average of network outputs
Response time uncertainty of maximal output
(1.0 - ymax).
Mixed-in expression detection record 1st, 2nd,
3rd largest outputs.
Discrimination 1 correlation of layer
representations
We can then find the layer that best accounts for
the data

58
Modeling Six-Way Forced Choice

Overall correlation r.9416, with NO FIT
PARAMETERS!

59
Model Discrimination Scores
HUMAN
MODEL OUTPUT LAYER R0.36
MODEL OBJECT LAYER R0.61
PERCENT CORRECT DISCRIMINATION

The model fits the data best at a precategorical
layer The layer we call the object layer NOT
at the category level

60
Discrimination

Classically, one requirement for categorical
perception is higher discrimination of two
stimuli at a fixed distance apart when those two
stimuli cross a category boundary
Indeed, Young et al. found in two kinds of tests
that discrimination was highest at category
boundaries.
The result that we fit the data best at a layer
before any categorization occurs is significant
In some sense, the category boundaries are in
the data, or at least, in our representation of
the data.

61
Outline

An overview of our facial expression recognition
system.
The internal representation shows the models
prototypical representations of Fear, Sadness,
etc.
How our model accounts for the categorical data
How our model accounts for the non-categorical
data
Discussion
Conclusions for part 1

62
Reaction Time Human/Model
MODEL REACTION TIME (1 - MAX_OUTPUT)
HUMAN SUBJECTS REACTION TIME
Correlation between model data .6771, plt.001
63
Mix Detection in the Model
Can the network account for the continuous data
as well as the categorical data? YES.
64
Human/Model Circumplexes

These are derived from similarities between
images using non-metric Multi-dimensional
scaling.
For humans similarity is correlation between
6-way forced-choice button push.
For networks similarity is correlation between
6-category output vectors.

65
Outline

An overview of our facial expression recognition
system.
How our model accounts for the categorical data
How our model accounts for the two-dimensional
data
The internal representation shows the models
prototypical representations of Fear, Sadness,
etc.
Discussion
Conclusions for part 1

66
Discussion

Our model of facial expression recognition
Performs the same task people do
On the same stimuli
At about the same accuracy
Without actually feeling anything, without any
access to the surrounding culture, it
nevertheless
Organizes the faces in the same order around the
circumplex
Correlates very highly with human responses.
Has about the same rank order difficulty in
classifying the emotions

67
Discussion

The discrimination correlates with human results
most accurately at a precategorization layer The
discrimination improvement at category boundaries
is in the representation of data, not based on
the categories.
These results suggest that for expression
recognition, the notion of categorical
perception simply is not necessary to explain
the data.
Indeed, most of the data can be explained by the
interaction between the similarity of the
representations and the categories imposed on the
data Fear faces are similar to surprise faces in
our representation so they are near each other
in the circumplex.

68
Conclusions from this part of the talk

The best models perform the same task people do
Concepts such as similarity and
categorization need to be understood in terms
of models that do these tasks
Our model simultaneously fits data supporting
both categorical and continuous theories of
emotion
The fits, we believe, are due to the interaction
of the way the categories slice up the space of
facial expressions,
and the way facial expressions inherently
resemble one another.
It also suggests that the continuous theories are
correct discrete categories are not required
to explain the data.
We believe our results will easily generalize to
other visual tasks, and other modalities.

69
The Visual Expertise Mystery

Why would a face area process BMWs?
Behavioral brain data
Model of expertise
results

70
Are you a perceptual expert?
Take the expertise test!!!
Identify this object with the first name that
comes to mind.
Courtesy of Jim Tanaka, University of Victoria
71
Car - Not an expert
2002 BMW Series 7 - Expert!
72
Bird or Blue Bird - Not an expert
Indigo Bunting - Expert!
73
Face or Man - Not an expert
George Dubya- Expert!
Jerk or Megalomaniac - Democrat!
74
How is an object to be named?
75
Entry Point Recognition
Animal
Semantic analysis
Bird
Visual analysis
Indigo Bunting
Fine grain visual analysis
76
Dog and Bird Expert Study
Each expert had a least 10 years experience in
their respective domain of expertise. None of
the participants were experts in both dogs and
birds. Participants provided their own
controls.
Tanaka Taylor, 1991
77
Object Verification Task
Superordinate
Basic
Subordinate
YES NO
YES NO
78
Dog and bird experts recognize objects in their
domain of expertise at subordinate levels.
900
800
Mean Reaction Time (msec)
700
600
Superordinate
Basic
Subordinate
Animal
Bird/Dog
Robin/Beagle
79
Is face recognition a general form of perceptual
expertise?
80
Face experts recognize faces at the individual
level of unique identity
1200
1000
Downward Shift
Mean Reaction Time (msec)
800
600
Superordinate
Basic
Subordinate
Tanaka, 2001
81
Event-related Potentials and Expertise
Face Experts
Object Experts
Tanaka Curran, 2001 see also Gauthier,
Curran, Curby Collins, 2003, Nature Neuro.
Bentin, Allison, Puce, Perez McCarthy, 1996
Novice Domain
Expert Domain
82
Neuroimaging of face, bird and car experts
Cars-Objects
Birds-Objects
Fusiform Gyrus
Car Experts
Fusiform Gyrus
Face Experts
Bird Experts
Gauthier et al., 2000
Fusiform Gyrus
83
How to identify an expert?
Behavioral benchmarks of expertise Downward
shift in entry point recognition Improved
discrimination of novel exemplars from learned
and related categories Neurological benchmarks
of expertise Enhancement of N170 ERP brain
component Increased activation of fusiform
gyrus
84
End of Tanaka Slides
85

Kanwisher showed the FFA is specialized for
faces
But she forgot to control for what???

86
Greeble Experts (Gauthier et al. 1999)

Subjects trained over many hours to recognize
individual Greebles.
Activation of the FFA increased for Greebles as
the training proceeded.

87
The visual expertise mystery

If the so-called Fusiform Face Area (FFA) is
specialized for face processing, then why would
it also be used for cars, birds, dogs, or
Greebles?
Our view the FFA is an area associated with a
process fine level discrimination of homogeneous
categories.
But the question remains why would an area that
presumably starts as a face area get recruited
for these other visual tasks? Surely, they dont
share features, do they?

Sugimoto Cottrell (2001), Proceedings of the
Cognitive Science Society
88
Solving the mystery with models

Main idea
There are multiple visual areas that could
compete to be the Greeble expert - basic level
areas and the expert (FFA) area.
The expert area must use features that
distinguish similar looking inputs -- thats what
makes it an expert
Perhaps these features will be useful for other
fine-level discrimination tasks.
We will create
Basic level models - trained to identify an
objects class
Expert level models - trained to identify
individual objects.
Then we will put them in a race to become Greeble
experts.
Then we can deconstruct the winner to see why
they won.

Sugimoto Cottrell (2001), Proceedings of the
Cognitive Science Society
89
Model Database

A network that can differentiate faces, books,
cups and
cans is a basic level network.
A network that can also differentiate individuals
within ONE
class (faces, cups, cans OR books) is an expert.

90
Model
(Experts)

Pretrain two groups of neural networks on
different tasks.
Compare the abilities to learn a new individual
Greeble classification task.

(Non-experts)
Hidden layer
91
Expertise begets expertise
Amount Of Training Required To be a Greeble Expert
Training Time on first task

Learning to individuate cups, cans, books, or
faces first, leads to faster learning of Greebles
(cant try this with kids!!!).
The more expertise, the faster the learning of
the new task!
Hence in a competition with the object area, FFA
would win.
If our parents were cans, the FCA (Fusiform Can
Area) would win.

92
Entry Level Shift Subordinate RT decreases with
training (rt uncertainty of response 1.0
-max(output))
Human data
Network data
--- Subordinate Basic
RT
Training Sessions
93
How do experts learn the task?

Expert level networks must be sensitive to
within-class variation
Representations must amplify small differences
Basic level networks must ignore within-class
variation.
Representations should reduce differences

94
Observing hidden layer representations

Principal Components Analysis on hidden unit
activation
PCA of hidden unit activations allows us to
reduce the dimensionality (to 2) and plot
representations.
We can then observe how tightly clustered stimuli
are in a low-dimensional subspace
We expect basic level networks to separate
classes, but not individuals.
We expect expert networks to separate classes and
individuals.

95
Subordinate level training magnifies small
differences within object representations
1 epoch
80 epochs
1280 epochs
Face
Basic
96
Greeble representations are spread out prior to
Greeble Training
Face
Basic
97
Variability Decreases Learning Time
(r -0.834)
Greeble Learning Time
Greeble Variance Prior to Learning Greebles
98
Examining the Nets Representations

We want to visualize receptive fields in the
network.
But the Gabor magnitude representation is
noninvertible.
We can learn an approximate inverse mapping,
however.
We used linear regression to find the best linear
combination of Gabor magnitude principal
components for each image pixel.
Then projecting each hidden units weight vector
into image space with the same mapping visualizes
its receptive field.

99
Two hidden unit receptive fields
AFTER TRAINING AS A FACE EXPERT
AFTER FURTHER TRAINING ON GREEBLES
HU 16
HU 36
NOTE These are not face-specific!
100
Controlling for the number of classes

We obtained 13 classes from hemera.com
10 of these are learned at the basic level.
10 faces, each with 8 expressions, make the
expert task
3 (lamps, ships, swords) are used for the novel
expertise task.

101
Results Pre-training

New initial tasks of similar difficulty In
previous work, the basic level task was much
easier.
These are the learning curves for the 10 object
classes and the 10 faces.

102
Results

As before, experts still learned new expert level
tasks faster

Number of epochs To learn swords After learning
faces Or objects
Number of training epochs on faces or objects
103
Outline

Why would a face area process BMWs?
Why this model is wrong

104
background

I have a model of face and object processing that
accounts for a lot of data

105
The Face and Object Processing System
106
Effects accounted for by this model

Categorical effects in facial expression
recognition (Dailey et al., 2002)
Similarity effects in facial expression
recognition (ibid)
Why fear is hard (ibid)
Rank of difficulty in configural, featural, and
inverted face discrimination (Zhang Cottrell,
2006)
Holistic processing of identity and expression,
and the lack of interaction between the two
(Cottrell et al. 2000)
How a specialized face processor may develop
under realistic developmental constraints (Dailey
Cottrell, 1999)
How the FFA could be recruited for other areas of
expertise (Sugimoto Cottrell, 2001 Joyce
Cottrell, 2004, Tran et al., 2004 Tong et al,
2005)
The other race advantage in visual search (Haque
Cottrell, 2005)
Generalization gradients in expertise (Nguyen
Cottrell, 2005)
Priming effects (face or character) on
discrimination of ambiguous chinese characters
(McCleery et al, under review)
Age of Acquisition Effects in face recognition
(Lake Cottrell, 2005 LC, under review)
Memory for faces (Dailey et al., 1998, 1999)
Cultural effects in facial expression recognition
(Dailey et al., in preparation)

107
Backprop, 25 years later

Backprop is important because it was the first
relatively efficient method for learning internal
representations
Recent advances have made deeper networks
possible
This is important because we dont know how the
brain uses transformations to recognize objects
across a wide array of variations (e.g., the
Halle Berry neuron)

108