Title: A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap
1A neglected problem in the computational theory
of mindObject Tracking and the Mind-World gap
- Zenon Pylyshyn
- Rutgers Center for Cognitive Science
2Before I begin I would like you to see a video
game that will figure in the last part of my talk
- The demonstration shows a task called Multiple
Object Tracking - Track the initially-distinct (flashing) items
through the trial (here 10 secs) and indicate at
the end which items are the targets - After each example Id like you to ask yourself,
How do I do it? - If you are like most of our subjects you will
have no idea, or a false idea
3Keep track of the objects that flash 512x6.83
172x 169
4How do we do it? What properties of individual
objects do we use?
5Going behind occluding surfaces does not disrupt
tracking
Scholl, B. J., Pylyshyn, Z. W. (1999). Tracking
multiple items through occlusion Clues to visual
objecthood. Cognitive Psychology, 38(2), 259-290.
6Not all well-defined features can be
trackedTrack endpoints of these linesEndpoints
move exactly as the squares did!
7(No Transcript)
8The basic problem of cognitive science
- What determines our behavior is not how the world
is, but how we represent it as being - As Chomsky pointed out in his review of Skinner,
if we describe behavior in relation to the
objective properties of the world, we would have
to conclude that behavior is essentially
stimulus-independent - Every naturally-occurring behavioral regularity
is cognitively penetrable - Any information that changes beliefs can
systematically and rationally change behavior
9Representation and Mind Why representations are
essential
- Do representations only come into play in higher
level mental activities, such as reasoning? - Even at early stages of perception many of the
states that must be postulated are
representations (i.e. what they are about plays a
role in explanations).
10Examples from vision (1) Intrapercept
constraints Epstein, W. (1982). Percept-percept
couplings. Perception, 11, 75-83.
11Examples from vision (2)The Pogendorf iIlusion
depends on perceived contours they need not be
physical edges
12The rules of color mixing apply to perceived color
- Red light and yellow light mix to produce
orange light - This law holds regardless of how the red light
and yellow light are produced - The yellow may be light of 580 nanometer
wavelength, or it may be a mixture of light of
530 nm and 650 nm wavelengths. - So long as one light looks yellow and the other
looks red the law will hold the mixture will
look orange.
13Another example of a classical representation
14Other forms of representation.
- Lines FG, BC are parallel and equal.
- Lines EH, AD are parallel and equal.
- Lines FB, GC are parallel and equal.
- Lines EA, HD are parallel and equal.
- Vertices EF, HG, DC and AB are joined....
- Part-OfCube, Top-Face(EFGH), Bottom-Face(ABCD),
Front-Face(FGCB), Back-Face(EHDA) - Part-OfTop-Face(Front-Edge(FG), Back-Edge(EH),
Left-Edge(EF), Right-Edge(HG),
15Whats wrong with this picture?
- Whats wrong is that the CTM is incomplete
it does not address a number of fundamental
questions - It fails to specify how representations connect
with what they represent its not enough to use
English words in the representation (thats been
a common confusion in AI) or to draw pictures (a
common confusion in theories of mental imagery) - English labels and pictures may help the theorist
recall which objects are being referred to - But what makes it the case that a particular
mental symbol refers to one thing rather than
another? - How are concepts grounded? (Symbol Grounding
Problem)
16Another way to look at what the Computational
Theory of Mind lacks
- The missing function in the CTM is a mechanism
that allows perception to refer to individual
things in the visual field directly and
nonconceptually - Not as whatever has properties P1, P2, P3, ...,
but as a singular term that refers directly to an
individual and does not appeal to a
representation of the individuals properties. - Such a reference is like a proper name or a
pointer in a computer data structure, or like a
demonstrative term (like this or that) in natural
language. - Note that in a computer a pointer does not refer
via a location, despite what the term
pointer suggests
17An example from personal history Why we need to
pick out individual things without referring to
their properties
- We wanted to develop a computer system that would
reason about geometry by actually drawing a
diagram and noticing adventitious properties of
the diagram from which it would conjecture lemmas
to prove - We wanted the system to be as psychologically
realistic as possible so we assumed that it had a
narrow field of view and noticed only limited,
spatially-restricted information as it examined
the drawing - This immediately raised the problem of
coordinating noticings and led us to the idea of
visual indexes to keep track of previously
encoded parts of the diagram.
18Begin by drawing a line.
L1
19Now draw a second line.
L2
20And draw a third line.
L3
21Notice what you have so far.(noticings are local
you encode what you attend to)
L1
V6
L2
There is an intersection of two lines But which
of the two lines you drew are they? There is no
way to indicate which individual things are seen
again without a way to refer to individual
(token) things
22Look around some more to see what is there .
L5
L2
V12
Here is another intersection of two lines Is it
the same intersection as the one seen
earlier? Without a special way to keep track of
individuals the only way to tell would be to
encode unique properties of each of the lines.
Which properties should you encode?
23In examining a geometrical figure one only gets
to see a sequence of local glimpses
24The incremental construction of visual
representations requires solving a correspondence
problem over time
- We have to determine whether a particular
individual element seen at time t is identical to
another individual element seen at a previous
time t-? . This is one manifestation of the
correspondence problem. - Solving the correspondence problem is equivalent
to picking out and tracking the identity of token
individuals as they change their appearance,
their location or the way they are encoded or
conceptualized - To do that we need the capacity to refer to token
individuals (I will call them objects) without
doing so by appealing to their properties. This
requires a special form of demonstrative
reference I call a Visual Index.
25A note about the use of labels in this example
- There are two purposes for figure labels. One is
to specify what type of individual it is (line,
vertex,..). The other is to specify which
individual it is so it is individuated and thus
can be selected or bound to the argument of a
predicate. - The second of these is what I am concerned with
because indicating which individual it is is
essential in vision. - Many people (e.g., Marr, Yantis) have suggested
that individuals may be marked by tags, but that
wont do since one cannot literally place a tag
on an object and even if we could it would not
obviate the need to individuate and index just as
labels dont help. - Labeling things in the world is not enough
because to refer to the line labeled L1 you would
have to be able to think this is line L1 and
you could not think that unless you had a way to
first picking out the referent of this.
26- The difference between a direct (demonstrative)
and a descriptive way of picking something out
has produced many You are here cartoons. - It is also illustrated in this recent New Yorker
cartoon
27The difference between descriptive and
demonstrative ways of picking something out
(illustrated in this New Yorker cartoon by
Sipress )
28Picking out
- Picking out entails individuating, in the sense
of separating something from a background (what
Gestalt psychologists called a figure-ground
distinction) - This sort of picking out has been studied in
psychology under the heading of focal or
selective attention. - Focal attention appears to pick out and adhere to
objects rather than places - In addition to a unitary focal attention there is
also evidence for a mechanism of multiple
references (about 4 or 5), that I have called a
visual index or a FINST - Indexes are different from focal attention in
many ways that we have studied in our laboratory
(I will mention a few later) - A visual index is like a pointer in a computer
data structure it allows access but does not
itself tell you anything about what is being
pointed to
29The requirements for picking out and keeping
track of several individual things reminded me of
an early comic book character called Plastic Man
30Imagine being able to place several of your
fingers on things in the world without
recognizing their properties while doing so. You
could then refer to those things (e.g. what
finger 2 is touching) and could move your
attention to them. You would then be said to
possess FINgers of INSTantiation (FINSTs)
31FINST Theory postulates a limited number of
pointers in early vision that are elicited by
certain events in the visual field and that
enable vision to refer to those things without
doing so under concept or a description
32FINSTs and Object Files form the link between the
world and its conceptualization
The only nonconceptual contents in this picture
are FINST indexes!
Object File contents are conceptual!
33Summarizing FINSTs
- A FINST is a primitive reference mechanism that
normally references individual visible objects in
the world. There are a small number (4-5) FINSTs
available at any one time. - Objects are picked out and referred to without
using any encoding of their properties, including
their location. - Picking out objects is prior to encoding any
properties! - Indexing is nonconceptual because it does not
represent an individual as a member of some
conceptual category. - An important function of FINST indexes is to bind
arguments of visual predicates to things in the
world to which they refer. Only predicates with
bound arguments can be evaluated. Since
predicates are quintessential concepts, an index
serves as a bridge from nonconceptual to
conceptual representations. - Similarly they can bind arguments of motor
commands, including the command to move focal
attention or gaze to the indexed object e.g.,
MoveGaze(x)
34A note on terminology
- A FINST provides a reference to an individual
visible thing - I sometimes call this referent a FING by analogy
with FINST and sometimes an object to conform
with usage in psych, but FINGs are nonconceptual
so they do not pick out something as an object,
because OBJECT us a concept. Maybe proto
object? - I have also called it a pointer, but that
erroneously suggests that it points to the
location of an object, as opposed to the object
itself. In a computer, a pointer is the name of
a stored datum. - I have said that a FINST is a visual
demonstrative like this or that, but that too
is misleading because the reference of a
demonstrative depends on the intentions of the
speaker - I have also noted that a FINST is like a proper
name but that wont do since a name can pick out
something not in sensory contact whereas a FINST
can only refer to a visible item (or one that is
briefly out of sight).
35A quick tour of some evidence for FINSTs
- The correspondence problem
- The binding problem
- Evaluating multi-place visual predicates
(recognizing multi-element patterns) - Operating over several visual elements at once
without having to search for them first - Subitizing
- Subset search
- Multiple-Object Tracking
- Cognizing space without requiring a spatial
display in the head
36A quick tour of some evidence for FINSTs
- The correspondence problem (mentioned earlier)
- The binding problem
- Evaluating multi-place visual predicates
(recognizing multi-element patterns) - Operating over several visual elements at once
without having to search for them first - Subitizing
- Subset selection
- Multiple-Object Tracking
- Cognizing space without requiring a spatial
display in the head
37Individual objects and the binding problem
- We can distinguish scenes that differ by
conjunctions of properties, so early vision must
somehow keep track of how properties co-occur
conjunction must not be obscured. This is the
called the binding problem - The most common proposal is that vision keeps
track of properties according to their location
and binds together co-located properties.
38The proposal of binding conjunctions by the
location of conjuncts does not work when feature
location is not punctate and becomes even more
problematic if they are co-located e.g., if
their relation is inside
39 PandemoniumAn early architecture, was
proposed by Oliver Selfridge in 1959. This idea
continues to be at the heart of many
psychological models, including ones implemented
in contemporary connectionist or neural net
models.
40Binding as object-based
- The proposal that properties are conjoined by
virtue of their common location has many problems - In order to assign a location to a property you
need to know its boundaries, which requires
distinguishing the object that has those
properties from its background (figure-ground
individuation) - Properties are properties of objects, not of
locations which is why properties move when
objects move. Empty locations have no causal
properties. - The alternative to conjoining-by-location is
conjoining by object. According to this view,
solving the binding problem requires first
selecting individual objects and then keeping
track of each objects properties (in its object
file) - If only properties of selected objects are
encoded and if those properties are recorded in
object files specific to each object, then all
conjoined properties will be recorded in the same
object file, thus solving the binding problem
41Attention spreads over perceived objects
Spreads to B and not C
Spreads to C and not B
Spreads to B and not C
Spreads to C and not B
Using a priming method (Egly, Driver Rafal,
1994) showed that the effect of a prime spreads
to other parts of the same visual object compared
to equally distant parts of different objects.
42A quick tour of some evidence for FINSTs
- The correspondence problem (mentioned earlier)
- The binding problem
- Evaluating multi-place visual predicates
(recognizing multi-element patterns) - Operating over several visual elements at once
without having to search for them first - Subitizing
- Subset selection
- Multiple-Object Tracking
- Cognizing space without requiring a spatial
display in the head
43Being able to pick out and refer to individual
distal elements is essential for encoding patterns
- Encoding relational predicates e.g., Collinear
(x,y,z,..) Inside (x, C) Above (x,y) Square
(w,x,y,z), requires simultaneously binding the
arguments of n-place predicates to n elements in
the visual scene - Evaluating such visual predicates requires
individuating and referring to the objects over
which the predicate is evaluated i.e., the
arguments in the predicate must be bound to
individual elements in the scene.
44Several objects must be picked out at once in
making relational judgments
When we judge that certain objects are
collinear, we must first pick out the relevant
objects while ignoring their properties
45Several objects must be picked out at once in
making relational judgments
- The same is true for other relational judgments
like inside or on-the-same-contour etc. We must
pick out the relevant individual objects first.
Are dots Inside-same contour? On-same contour?
46A quick tour of some evidence for FINSTs
- The correspondence problem
- The binding problem
- Evaluating multi-place visual predicates
(recognizing multi-element patterns) - Operating over several visual elements at once
without first having to search for them - Subitizing
- Subset selection
- Multiple-Object Tracking
- Cognizing space without requiring a spatial
display in the head
47More functions of FINSTsFurther experimental
explorationsusing different paradigms
- Recognizing the cardinality of small sets of
things Subitizing vs counting (Trick, 1994) - Searching through subsets selecting items to
search through (Burkell, 1997) - Selecting subsets and maintaining the selection
during a saccade (Currie, 2002) - Application of FINST index theory to infant
cardinality studies (Carey, Spelke, Leslie,
Uller, etc) - Indexes explain how children are able to acquire
words for objects by ostension without suffering
Quines Gavagai problem.
48Signature subitizing phenomena only appear when
objects are automatically individuated and indexed
Counting slope
subitizing slope
Trick, L. M., Pylyshyn, Z. W. (1994). Why are
small and large numbers enumerated differently? A
limited capacity preattentive stage in vision.
Psychological Review, 101(1), 80-102.
49Subitizing results
- There is evidence that a different mechanism is
involved in enumerating small (nlt4) and large
(ngt4) numbers of items (even different brain
mechanisms Dehaene Cohen, 1994) - Rapid small-number enumeration (subitizing) only
occurs when items are first (automatically)
individuated - Subitizing is not affected by precuing location
while counting is - Subitizing is insensitive to distance among
items - Our explanation for what is special about
subitizing is that once FINST indexes are
assigned to nlt 4 individual objects, the objects
can be enumerated without first searching for
them. In fact they might be enumerated simply by
counting active indexes which is fast and
accurate because it does not require visual
scanning - Trick, L. M., Pylyshyn, Z. W. (1994).
Why are small and large numbers enumerated
differently? A limited capacity preattentive
stage in vision. Psychological Review, 101(1),
80-102.
50Subset selection for search
Burkell, J., Pylyshyn, Z. W. (1997). Searching
through subsets A test of the visual indexing
hypothesis. Spatial Vision, 11(2), 225-258.
51Subset search results
- Only properties of the subset matter but note
that properties of the entire subset are taken
into account simultaneously (since that is what
distinguishes a feature search from a conjunction
search) - If the subset is a single-feature search it is
fast and the slope (RT vs number of items) is
shallow - If the subset is a conjunction search set, it
takes longer and is more sensitive to the set
size - As with subitizing, the distance between targets
does not matter, so observers dont seem to be
scanning the display looking for the target
52The stability of the visual world entails the
capacity to reidentify individuals after a saccade
- There is no problem about how tactile selection
can provide a stable world when you move around
while keeping your fingers on the same objects
because in that case retaining individual
identity is automatic - But with FINSTs the same can be true with vision
for a small number of visual objects - This is compatible with the fact that it appears
one retains the relative location of only about 4
elements during saccadic eye movements (Irwin,
1996)Irwin, D. E. (1996). Integrating
information across saccadic eye movements.
Current Directions in Psychological Science,
5(3), 94-100.
53The selective search experiment with a saccade
induced between the late onset cues and start of
search
Even with a saccade between selection and access,
items can be accessed efficiently
54A quick tour of some evidence for FINSTs
- The correspondence problem (mentioned earlier)
- The binding problem
- Evaluating multi-place visual predicates
(recognizing multi-element patterns) - Operating over several visual elements at once
without having to search for them first - Subitizing
- Subset selection
- Multiple-Object Tracking
- Cognizing space without requiring a spatial
display in the head
55Demonstrating the function of FINSTs
withMultiple Object Tracking (MOT)
- In a typical experiment, 8 simple identical
objects are presented on a screen and 4 of them
are briefly distinguished in some visual manner
usually by flashing them on and off. - After these 4 targets are briefly identified, all
objects resume their identical appearance and
move randomly. The observers task is to keep
track of the ones that had been designated as
targets at the start - After a period of 5-10 seconds the motion stops
and observers must indicate, using a mouse, which
objects are the targets
56Another example of MOT With self occlusion 5 x
5 1.75 x 1.75
57Self occlusion dues not seriously impair tracking
58Some findings of Multiple Object Tracking
- Basic finding Most people can track at least 4
targets that move randomly among identical
non-target objects (even 5 year old children can
track 3 objects) - Object properties do not appear to be recorded
during tracking and tracking is not improved if
all objects are visually distinct (no two objects
have the same color, shape or size) - How is it done?
- We showed that it is unlikely that the tracking
is done by keeping a record of the targets
locations and updating them by serially visiting
the objects (Pylyshyn Storm, 1998) - Other strategies may be employed (e.g., tracking
a single deforming pattern), but they do not
explain tracking - Hypothesis FINST Indexes get assigned to
targets. At the end of the trial these pointers
can be used to move attention to the targets and
hence to select them
59What role do visual properties play in MOT?
- Certain properties may have to be present in
order for an object to be indexed, and certain
properties (probably different properties) may be
required in order for the index to keep track of
the object, but this does not mean that such
properties are encoded, stored, or used in
tracking. - Compare this with Kripkes distinction between
properties that fix the referent of a proper name
and the property that the name refers to. The
former only plays a role at the names initial
baptism. - Is there something special about location? Do we
record and track properties-at-locations? - Location in time space may be essential for
individuating objects, but locations need not be
encoded or made cognitively available - The fact that an object is actually at some
location or other does not mean that it is
represented as such. Representing property P
(where P happens to be at location L) ?
Representing property P-is-at-L.
60A way of viewing what goes on in MOT
- According Kahneman Treismans Object File
theory, the appearance of a new visual object
causes a new Object File to be created. Each
object file is associated with its respective
object presumably through a FINST Index. - The object file may contain information about the
object to which it is attached. But according to
FINST Theory, keeping track of the objects
identity does not require the use of this
information. The evidence suggests that in MOT,
little or nothing is stored in the object file
except maybe in special cases (e.g., when the
object suddenly changes or disappears). - What makes something the same object over time is
that it remains connected to the same object-file
(by the same FINST). Thus, for vision to treat
something as the same enduring individual does
not require appeal to properties or concepts.
61Why is this relevant to foundational questions in
the philosophy of mind?
- According to Quine, Strawson, and most
philosophers, you cannot pick out or track
individuals without concepts (sortals) - But you also cannot pick out individuals with
only concepts - Sooner or later you have to pick out individuals
using non-conceptual causal connections between
thoughts and things - The present proposal is that FINSTs provide the
needed non-conceptual mechanism for individuating
objects and for tracking their identity, which
works most of the time in our kind of world. It
relies on a natural constraint (Marr) - FINST indexes provide the right sort of
connection for predicating properties of the
world by allowing the arguments of predicates to
be bound to objects prior to the predicates being
evaluated. They may thus be the basis for early
vocabulary learning.
62But there must be some properties that cause
indexes to be grabbed!
- Of course there are properties that are causally
responsible for indexes being grabbed, and also
properties (probably different ones) that make it
possible for objects to be tracked - But these properties need not be represented
(encoded) and used in tracking - The distinction between object properties that
cause indexes to be assigned and those that are
represented (in Object Files) is similar to
Kripkes distinction between properties that are
needed to pick out name an object and those that
constitute its meaning
63Effect of target properties on MOT
- Changes of target properties are not reported nor
even noticed during MOT - Keeping all targets at different color, size, or
shape does not improve tracking - Observers do not use target speed or direction in
tracking (e.g., by anticipating where the targets
will be when they reappear after occlusion)
64Some open questions
- We have arrived at the view that only properties
of selected (indexed) objects enter into
subsequent conceptualization and perception-based
thought (i.e., only information in object files
is made available to cognition) - So what happens to the rest of the visual
information? - Visual information seems rich and fine-grained
while this theory only allows for the properties
of 4 or 5 objects to be encoded! - The present view leaves no room for nonconceptual
representations whose content corresponds to the
content of conscious experience - According to the present view, the only content
that nonconceptual representations have is the
demonstrative content of indexes that refer to
perceptual objects - Question Why do we need any more than that?
65An intriguing possibility.
- Maybe the theoretically relevant information we
take in is less than (or at least different from)
what we experience - This possibility has received attention recently
with the discovery of various blindnesses
(e.g., change-blindness, inattentional blindness,
blindsight) as well as the discovery of
independent-vision systems (e.g., recognition and
motor control) - The qualitative content of conscious experience
may not play a role in explanations of cognitive
processes - Even if unconceptualized information enters into
causal process (e.g., motor control) it may not
be represented or made available to the cognitive
mind it not even as a nonconceptual
representation - For something to be a representation its content
must figure in explanations it must capture
generalizations. It must have truth conditions
and therefore allow for misrepresentation. It is
an empirical question whether current proposals
do (e.g., primal sketch, scenarios). cf Devitt
Pylyshyns Razor
66Vision science has always been deeply ambivalent
about role of conscious experience
- Isnt how things appear one of the things that
our theories must explain? Answer There is no a
priori must explain! - The content of subjective experience is a major
type of evidence. But it may turn out not to be
the most reliable source for inferring the
relevant functional states. It competes with
other types of evidence. - How things appear cannot be taken at face value
it carries substantive theoretical assumptions.
It also draws on many levels of processing. - It was a serious obstacle to early theories of
vision (Kepler) - It has been a poor guide in the case of theories
of mental imagery (e.g., color mixing, image
size, image distances). Reading X off an image
is an illusion. - It seems likely that vision science will use
evidence of conscious experience the way
linguistics uses evidence of grammatical
intuitions only as it is filtered through
developing theories. - The questions a science is expected to answer
cannot be set in advance they change as the
science develops.
67What next?
- This picture leaves many unanswered questions,
but it does provide a mechanism for solving the
binding problem and also explaining how mental
representations could have a nonconceptual
connection with objects in the world (something
required if mental representations are to connect
with actions)
68Schema for how FINSTs function in hockey
69- For a copy of these slides seehttp//ruccs.rutge
rs.edu/faculty/pylyshyn/SelectionReference.ppt - Or MIT PressPaperback
70Index capacity and training
- Daphne Baveliers lab (Rochester) has shown that
videogame players can track a larger number of
objects in MOT - Jose Rivest (York) has shown that some athletes
can track more targets than non-athletes - Within individuals the main determiner of number
of targets that can be tracked is the spacing
between them
71You are now here
X
But you are also here
72(No Transcript)
73Additional examples of MOT
- MOT with occlusion
- MOT with virtual occluders
- MOT with matched nonoccluding disappearance
- Track endpoints of lines
- Track rubber-band linked boxes
- Track and remember ID by location
- Track and remember ID by name (number)
- Track while everything briefly disappears (½ sec)
and goes on moving while invisible - Track while everything briefy disappears and
reappears where they were when they disappeared