Title: Categorization and density estimation Tom Griffiths UC Berkeley
1Categorization and density estimationTom
GriffithsUC Berkeley
2Categorization
cat
dog
cat
dog
dog
cat
cat
dog
cat
dog
dog
cat
cat
dog
3Categorization
- cat ? small ? furry ? domestic ? carnivore
4Borderline cases
- Is a tomato a vegetable?
- around 50 say yes
- (Hampton, 1979)
- Is an olive a fruit?
- around 22 change their mind
- (McClosky Glucksberg, 1978)
5Borderline cases
- Is a tomato a vegetable?
- around 50 say yes
- (Hampton, 1979)
- Is an olive a fruit?
- around 22 change their mind
- (McClosky Glucksberg, 1978)
6Typicality
7Typicality
8(No Transcript)
9Typical Atypical
10Typicality and generalization
Penguins can catch disease X All birds can catch
disease X
Robins can catch disease X All birds can catch
disease X
(Rips, 1975)
11How can we explain typicality?
- One answer reject definitions, and have a new
representation for categories - Prototype theory
- categories are represented by a prototype
- other members share a family resemblance relation
to the prototype - typicality is a function of similarity to the
prototype
12Prototypes
Prototype
(Posner Keele, 1968)
13Posner and Keele (1968)
- Prototype effect in categorization accuracy
- Constructed categories by perturbing prototypical
dot arrays - Ordering of categorization accuracy at test
- old exemplars
- prototypes
- new exemplars
14Formalizing prototype theories
Representation Each category (e.g., A, B) has a
corresponding prototype (?A,?B) Categorization
(for a new stimulus x) Choose category that
minimizes (maximizes) the distance (similarity)
from x to its prototype
(e.g., Reed, 1972)
15Formalizing prototype theories
Prototype is most frequent or typical member
Spaces
(Binary) Features
Prototype e.g., binary vector with most frequent
feature values Distance e.g., Hamming distance
Prototype e.g., average of members of
category Distance e.g., Euclidean distance
16Formalizing prototype theories
Decision boundary at equal distance (always a
straight line for two categories)
Category B
Category A
Prototypes (category means)
17Predicting prototype effects
- Prototype effects are built into the model
- assume categorization becomes easier as proximity
to the prototype increases - or distance from the boundary increases
- But what about the old exemplar advantage?
- (Posner Keele, 1968)
- Prototype models are not the only way to get
prototype effects
18Exemplar theories
Store every member (exemplar) of the category
19Formalizing exemplar theories
Representation A set of stored exemplars y1,
y2, , yn, each with its own category
label Categorization (for a new stimulus
x) Choose category A with probability
?xy is similarity of x to y
?A is bias towards A
Luce-Shepard choice rule
20The context model(Medin Schaffer, 1978)
Defined for stimuli with binary features (color,
form, size, number)
1111 (red, triangle, big, one)
0000 (green, circle, small, two)
Define similarity as the product of similarity on
each dimension
21The generalized context model(Nosofsky, 1986)
Defined for stimuli in psychological space
22The generalized context model
Decision boundary determined by exemplars
Category B
Category A
Category A
Category B
90 A
10 A
50 A
23Prototypes vs. exemplars
- Exemplar models produce prototype effects
- if prototype minimizes distance to all exemplars
in a category, then it has high probability - Also predicts old exemplar advantage
- being close (or identical) to an old exemplar of
the category gives high probability - Predicts new effects prototype models cannot
produce - stimuli close to an old exemplar should have high
probability, even far from the prototype
24Prototypes vs. exemplars
Exemplar models can capture complex boundaries
25Prototypes vs. exemplars
Exemplar models can capture complex boundaries
26Some questions
- Both prototype and exemplar models seem
reasonable are they rational? - are they solutions to the computational problem?
- Should we use prototypes, or exemplars?
- How can we define other models that handle more
complex categorization problems?
27A computational problem
- Categorization is a classic inductive problem
- data stimulus x
- hypotheses category c
- We can apply Bayes rule
- and choose c such that P(cx) is maximized
28Density estimation
- We need to estimate some probability
distributions - what is P(c)?
- what is p(xc)?
- Two approaches
- parametric
- nonparametric
P(c)
c
x
p(xc)
29Parametric density estimation
- Assume that p(xc) has a simple form,
characterized by parameters ? - Given stimuli X x1, x2, , xn from category c,
find ? by maximum-likelihood estimation - or some form of Bayesian estimation
30Binary features
- x (x1,x2,,xm) ? 0,1m
- Rather than estimating distribution over 2m
possibilities, assume feature independence
P(c)
c
x
P(xc)
31Binary features
- x (x1,x2,,xm) ? 0,1m
- Rather than estimating distribution over 2m
possibilities, assume feature independence - Called Naïve Bayes, because independence is a
naïve assumption!
P(c)
c
x1
x2
xm
32Spatial representations
- Assume a simple parametric form for p(xc) a
Gaussian - For each category, estimate parameters
- mean
- variance
P(c)
c
x
p(xc)
?
33The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
34Multivariate Gaussians
35Bayesian inference
Probability
x
36Nonparametric density estimation
- Rather than estimating a probability density from
a parametric family, use a scheme that can give
you any possible density - Nonparametric can mean
- not making distributional assumptions
- covering a broad class of functions
- a model with infinitely many parameters
37Kernel density estimation
Approximate a probability distribution as a
sum of many kernels (one per data point)
X x1, x2, , xn independently sampled from
something
for kernel function k(x,y) such that
38Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 1
x
39Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 2
x
40Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 5
x
41Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 10
x
42Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 100
x
43Bayesian inference
h 0.5
Probability
x
44Bayesian inference
h 0.1
Probability
x
45Bayesian inference
h 2.0
Probability
x
46Advantages and disadvantages
- Which method should we choose?
- Methods are complementary
- parametric estimation requires less data, but is
severely constrained in possible distributions - nonparametric estimation requires a lot of data,
but can model any function
An instance of the bias-variance tradeoff
47Categorization as density estimation
- Prototype and exemplar models can be interpreted
as rational Bayesian inference - Different forms of density estimation
- prototype parametric density estimation
- exemplar nonparametric density estimation
- Suggests other categorization models
48Prototype models
- Prototype model
- Choose category with the closest prototype
- Bayesian categorization
- Choose category with highest P(cx)
- Equivalent if P(xc) decreases as a function
of the distance of x from the prototype ?c and
the prior probability is equal for all categories
49Exemplar models
Exemplar model Choose category A with
probability P(Ax) is the posterior
probability of category A when P(xA) is
approximated using a kernel density estimator
50Bayesian exemplars
- ? is the prior probability of category A (divided
by the number of exemplars in A) - The summed similarity is proportional to p(xA)
using a kernel density estimator
for appropriate kernel and similarity function
51The generalized context model
Probability
exemplar
similarity gradient
x
52Implications
- Prototype and exemplar models can be interpreted
as rational Bayesian inference - Different strengths and limitations
- exemplar models are rational categorization
models for any category density (and large n) - Suggests other categorization models
- alternatives between prototypes and exemplars
53Prototypes vs. exemplars
- Prototype model
- parametric density estimation
- P(xA) specified by one Gaussian
- Exemplar model
- nonparametric density estimation
- P(xA) is a sum of nA Gaussians
- Compromise
- semiparametric density estimation
- sum of more than one and less than nA Gaussians
- mixture of Gaussians (Rosseel, 2003)
54The Rational Model of Categorization(RMC
Anderson 1990 1991)
- Computational problem predicting a feature based
on observed data - assume that category labels are just features
- Predictions are made on the assumption that
objects form clusters with similar properties - each object belongs to a single cluster
- feature values likely to be the same within
clusters - the number of clusters is unbounded
55Representation in the RMC
- Flexible representation can interpolate between
prototype and exemplar models
Probability
Probability
Feature Value
Feature Value
56The optimal solution
- The probability of the missing feature (i.e.,
the category label) taking a certain value is
where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into clusters
57The posterior over partitions
58The prior over partitions
- An object is assumed to have a constant
probability of joining same cluster as another
object, known as the coupling probability - This allows some probability that a stimulus
forms a new cluster, so the probability that the
ith object is assigned to the kth cluster is
59Nonparametric Bayes
- Bayes treat density estimation as a problem of
Bayesian inference, defining a prior over the
number of components - Nonparametric use a prior that places no limit
on the number of components - Dirichlet process mixture models (DPMMs) use a
prior of this kind
60The prior over components
- Each sample is assumed to come from a single
(possibly previously unseen) mixture component - The ith sample is drawn from the kth component
with probability - where ??is a parameter of the model
61Equivalence
- Neal (1998) showed that the prior for the RMC
and the DPMM are the same, with
RMC prior
DPMM prior
62The computational challenge
- The probability of the missing feature (i.e.,
the category label) taking a certain value is
where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into groups
n 1 2 3 4 5 6 7 8 9
10 xn 1 2 5 15 52 203 877 4140
21147 115975
63Andersons approximation
- Data observed sequentially
- Each object is deterministically assigned to the
cluster with the highest posterior probability - Call this the Local MAP
- choosing the cluster with the maximum a
posteriori probability
64Two uses of Monte Carlo methods
- For solving problems of probabilistic inference
involved in developing computational models - As a source of hypotheses about how the mind
might solve problems of probabilistic inference
65Alternative approximation schemes
- There are several methods for making
approximations to the posterior in DPMMs - Gibbs sampling
- Particle filtering
- These methods provide asymptotic performance
guarantees (in contrast to Andersons procedure)
(Sanborn, Griffiths, Navarro, 2006)
66Gibbs sampling for the DPMM
Starting Partition
- All the data are required at once (a batch
procedure) - Each stimulus is sequentially assigned to a
cluster based on the assignments of all of the
remaining stimuli - Assignments are made probabilistically, using the
full conditional distribution
0.33
0.67
0.12
0.40
67Particle filter for the DPMM
- Data are observed sequentially
- The posterior distribution at each point is
approximated by a set of particles - Particles are updated, and a fixed number of are
carried over from trial to trial
68Approximating the posterior
- For a single order, the Local MAP will produce a
single partition - The Gibbs sampler and particle filter will
approximate the exact DPMM distribution
69Order effects in human data
- The probabilistic model underlying the DPMM does
not produce any order effects - follows from exchangeability
- But human data shows order effects
- (e.g., Medin Bettger, 1994)
- Anderson and Matessa tested local MAP predictions
about order effects in an unsupervised clustering
experiment - (Anderson, 1990)
70Anderson and Matessas Experiment
- Subjects were shown all sixteen stimuli that had
four binary features - Front-anchored ordered stimuli emphasized the
first two features in the first eight trials
end-anchored ordered emphasized the last two
Front-Anchored Order
End-Anchored Order
scadsporm scadstirm sneksporb snekstirb sneksporm
snekstirm scadsporb scadstirb
snadstirb snekstirb scadsporm sceksporm sneksporm
snadsporm scedstirb scadstirb
71Anderson and Matessas Experiment
Proportion that are Divided Along a
Front-Anchored Feature
Front-Anchored Order
End-Anchored Order
72A rational process model
- A rational model clarifies a problem and serves
as a benchmark for performance - Using a psychologically plausible approximation
can change a rational model into a rational
process model - Research in machine learning and statistics has
produced useful approximations to statistical
models which can be tested as general-purpose
psychological heuristics
73Summary
- Traditional models of the cognitive processes
involved in categorization can be reinterpreted
as rational models (via density estimation) - Prototypes vs. exemplars is about schemes for
density estimation (and representations) - Nonparametric Bayes lets us explore options
between these extremes (as well as some new
models) - all models instance of hierarchical Dirichlet
process - (Griffiths, Canini, Sanborn, Navarro, 2007)
- Monte Carlo provides hypotheses about how people
address the computational challenges
74(No Transcript)