Categorization and density estimation Tom Griffiths UC Berkeley - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Categorization and density estimation Tom Griffiths UC Berkeley

Description:

Categorization and density estimation Tom Griffiths UC Berkeley – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 73

Provided by: thomasgr6

Category:

more less

Transcript and Presenter's Notes

Title: Categorization and density estimation Tom Griffiths UC Berkeley

1
Categorization and density estimationTom
GriffithsUC Berkeley
2
Categorization
cat
dog
cat
dog
dog
cat
cat
dog
cat
dog
dog
cat
cat
dog
3
Categorization

cat ? small ? furry ? domestic ? carnivore

4
Borderline cases

Is a tomato a vegetable?
around 50 say yes
(Hampton, 1979)
Is an olive a fruit?
around 22 change their mind
(McClosky Glucksberg, 1978)

5
Borderline cases

Is a tomato a vegetable?
around 50 say yes
(Hampton, 1979)
Is an olive a fruit?
around 22 change their mind
(McClosky Glucksberg, 1978)

6
Typicality
7
Typicality
8
(No Transcript)
9
Typical Atypical
10
Typicality and generalization
Penguins can catch disease X All birds can catch
disease X
Robins can catch disease X All birds can catch
disease X
(Rips, 1975)
11
How can we explain typicality?

One answer reject definitions, and have a new
representation for categories
Prototype theory
categories are represented by a prototype
other members share a family resemblance relation
to the prototype
typicality is a function of similarity to the
prototype

12
Prototypes
Prototype
(Posner Keele, 1968)
13
Posner and Keele (1968)

Prototype effect in categorization accuracy
Constructed categories by perturbing prototypical
dot arrays
Ordering of categorization accuracy at test
old exemplars
prototypes
new exemplars

14
Formalizing prototype theories
Representation Each category (e.g., A, B) has a
corresponding prototype (?A,?B) Categorization
(for a new stimulus x) Choose category that
minimizes (maximizes) the distance (similarity)
from x to its prototype
(e.g., Reed, 1972)
15
Formalizing prototype theories
Prototype is most frequent or typical member
Spaces
(Binary) Features
Prototype e.g., binary vector with most frequent
feature values Distance e.g., Hamming distance
Prototype e.g., average of members of
category Distance e.g., Euclidean distance
16
Formalizing prototype theories
Decision boundary at equal distance (always a
straight line for two categories)
Category B
Category A
Prototypes (category means)
17
Predicting prototype effects

Prototype effects are built into the model
assume categorization becomes easier as proximity
to the prototype increases
or distance from the boundary increases
But what about the old exemplar advantage?
(Posner Keele, 1968)
Prototype models are not the only way to get
prototype effects

18
Exemplar theories
Store every member (exemplar) of the category
19
Formalizing exemplar theories
Representation A set of stored exemplars y1,
y2, , yn, each with its own category
label Categorization (for a new stimulus
x) Choose category A with probability
?xy is similarity of x to y
?A is bias towards A
Luce-Shepard choice rule
20
The context model(Medin Schaffer, 1978)
Defined for stimuli with binary features (color,
form, size, number)
1111 (red, triangle, big, one)
0000 (green, circle, small, two)
Define similarity as the product of similarity on
each dimension
21
The generalized context model(Nosofsky, 1986)
Defined for stimuli in psychological space
22
The generalized context model
Decision boundary determined by exemplars
Category B
Category A
Category A
Category B
90 A
10 A
50 A
23
Prototypes vs. exemplars

Exemplar models produce prototype effects
if prototype minimizes distance to all exemplars
in a category, then it has high probability
Also predicts old exemplar advantage
being close (or identical) to an old exemplar of
the category gives high probability
Predicts new effects prototype models cannot
produce
stimuli close to an old exemplar should have high
probability, even far from the prototype

24
Prototypes vs. exemplars
Exemplar models can capture complex boundaries
25
Prototypes vs. exemplars
Exemplar models can capture complex boundaries
26
Some questions

Both prototype and exemplar models seem
reasonable are they rational?
are they solutions to the computational problem?
Should we use prototypes, or exemplars?
How can we define other models that handle more
complex categorization problems?

27
A computational problem

Categorization is a classic inductive problem
data stimulus x
hypotheses category c
We can apply Bayes rule
and choose c such that P(cx) is maximized

28
Density estimation

We need to estimate some probability
distributions
what is P(c)?
what is p(xc)?
Two approaches
parametric
nonparametric

P(c)
c
x
p(xc)
29
Parametric density estimation

Assume that p(xc) has a simple form,
characterized by parameters ?
Given stimuli X x1, x2, , xn from category c,
find ? by maximum-likelihood estimation
or some form of Bayesian estimation

30
Binary features

x (x1,x2,,xm) ? 0,1m
Rather than estimating distribution over 2m
possibilities, assume feature independence

P(c)
c
x
P(xc)
31
Binary features

x (x1,x2,,xm) ? 0,1m
Rather than estimating distribution over 2m
possibilities, assume feature independence
Called Naïve Bayes, because independence is a
naïve assumption!

P(c)
c
x1
x2
xm

32
Spatial representations

Assume a simple parametric form for p(xc) a
Gaussian
For each category, estimate parameters
mean
variance

P(c)
c
x
p(xc)

?
33
The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
34
Multivariate Gaussians
35
Bayesian inference
Probability
x
36
Nonparametric density estimation

Rather than estimating a probability density from
a parametric family, use a scheme that can give
you any possible density
Nonparametric can mean
not making distributional assumptions
covering a broad class of functions
a model with infinitely many parameters

37
Kernel density estimation
Approximate a probability distribution as a
sum of many kernels (one per data point)
X x1, x2, , xn independently sampled from
something
for kernel function k(x,y) such that
38
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 1
x
39
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 2
x
40
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 5
x
41
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 10
x
42
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 100
x
43
Bayesian inference
h 0.5
Probability
x
44
Bayesian inference
h 0.1
Probability
x
45
Bayesian inference
h 2.0
Probability
x
46
Advantages and disadvantages

Which method should we choose?
Methods are complementary
parametric estimation requires less data, but is
severely constrained in possible distributions
nonparametric estimation requires a lot of data,
but can model any function

An instance of the bias-variance tradeoff
47
Categorization as density estimation

Prototype and exemplar models can be interpreted
as rational Bayesian inference
Different forms of density estimation
prototype parametric density estimation
exemplar nonparametric density estimation
Suggests other categorization models

48
Prototype models

Prototype model
Choose category with the closest prototype
Bayesian categorization
Choose category with highest P(cx)
Equivalent if P(xc) decreases as a function
of the distance of x from the prototype ?c and
the prior probability is equal for all categories

49
Exemplar models
Exemplar model Choose category A with
probability P(Ax) is the posterior
probability of category A when P(xA) is
approximated using a kernel density estimator
50
Bayesian exemplars

? is the prior probability of category A (divided
by the number of exemplars in A)
The summed similarity is proportional to p(xA)
using a kernel density estimator

for appropriate kernel and similarity function
51
The generalized context model
Probability
exemplar
similarity gradient
x
52
Implications

Prototype and exemplar models can be interpreted
as rational Bayesian inference
Different strengths and limitations
exemplar models are rational categorization
models for any category density (and large n)
Suggests other categorization models
alternatives between prototypes and exemplars

53
Prototypes vs. exemplars

Prototype model
parametric density estimation
P(xA) specified by one Gaussian
Exemplar model
nonparametric density estimation
P(xA) is a sum of nA Gaussians
Compromise
semiparametric density estimation
sum of more than one and less than nA Gaussians
mixture of Gaussians (Rosseel, 2003)

54
The Rational Model of Categorization(RMC
Anderson 1990 1991)

Computational problem predicting a feature based
on observed data
assume that category labels are just features
Predictions are made on the assumption that
objects form clusters with similar properties
each object belongs to a single cluster
feature values likely to be the same within
clusters
the number of clusters is unbounded

55
Representation in the RMC

Flexible representation can interpolate between
prototype and exemplar models

Probability
Probability
Feature Value
Feature Value
56
The optimal solution

The probability of the missing feature (i.e.,
the category label) taking a certain value is

where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into clusters
57
The posterior over partitions
58
The prior over partitions

An object is assumed to have a constant
probability of joining same cluster as another
object, known as the coupling probability
This allows some probability that a stimulus
forms a new cluster, so the probability that the
ith object is assigned to the kth cluster is

59
Nonparametric Bayes

Bayes treat density estimation as a problem of
Bayesian inference, defining a prior over the
number of components
Nonparametric use a prior that places no limit
on the number of components
Dirichlet process mixture models (DPMMs) use a
prior of this kind

60
The prior over components

Each sample is assumed to come from a single
(possibly previously unseen) mixture component
The ith sample is drawn from the kth component
with probability
where ??is a parameter of the model

61
Equivalence

Neal (1998) showed that the prior for the RMC
and the DPMM are the same, with

RMC prior
DPMM prior
62
The computational challenge

The probability of the missing feature (i.e.,
the category label) taking a certain value is

where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into groups
n 1 2 3 4 5 6 7 8 9
10 xn 1 2 5 15 52 203 877 4140
21147 115975
63
Andersons approximation

Data observed sequentially
Each object is deterministically assigned to the
cluster with the highest posterior probability
Call this the Local MAP
choosing the cluster with the maximum a
posteriori probability

64
Two uses of Monte Carlo methods

For solving problems of probabilistic inference
involved in developing computational models
As a source of hypotheses about how the mind
might solve problems of probabilistic inference

65
Alternative approximation schemes

There are several methods for making
approximations to the posterior in DPMMs
Gibbs sampling
Particle filtering
These methods provide asymptotic performance
guarantees (in contrast to Andersons procedure)

(Sanborn, Griffiths, Navarro, 2006)
66
Gibbs sampling for the DPMM
Starting Partition

All the data are required at once (a batch
procedure)
Each stimulus is sequentially assigned to a
cluster based on the assignments of all of the
remaining stimuli
Assignments are made probabilistically, using the
full conditional distribution

0.33
0.67
0.12
0.40
67
Particle filter for the DPMM

Data are observed sequentially
The posterior distribution at each point is
approximated by a set of particles
Particles are updated, and a fixed number of are
carried over from trial to trial

68
Approximating the posterior

For a single order, the Local MAP will produce a
single partition
The Gibbs sampler and particle filter will
approximate the exact DPMM distribution

69
Order effects in human data

The probabilistic model underlying the DPMM does
not produce any order effects
follows from exchangeability
But human data shows order effects
(e.g., Medin Bettger, 1994)
Anderson and Matessa tested local MAP predictions
about order effects in an unsupervised clustering
experiment
(Anderson, 1990)

70
Anderson and Matessas Experiment

Subjects were shown all sixteen stimuli that had
four binary features
Front-anchored ordered stimuli emphasized the
first two features in the first eight trials
end-anchored ordered emphasized the last two

Front-Anchored Order
End-Anchored Order
scadsporm scadstirm sneksporb snekstirb sneksporm
snekstirm scadsporb scadstirb
snadstirb snekstirb scadsporm sceksporm sneksporm
snadsporm scedstirb scadstirb
71
Anderson and Matessas Experiment
Proportion that are Divided Along a
Front-Anchored Feature
Front-Anchored Order
End-Anchored Order
72
A rational process model

A rational model clarifies a problem and serves
as a benchmark for performance
Using a psychologically plausible approximation
can change a rational model into a rational
process model
Research in machine learning and statistics has
produced useful approximations to statistical
models which can be tested as general-purpose
psychological heuristics

73
Summary

Traditional models of the cognitive processes
involved in categorization can be reinterpreted
as rational models (via density estimation)
Prototypes vs. exemplars is about schemes for
density estimation (and representations)
Nonparametric Bayes lets us explore options
between these extremes (as well as some new
models)
all models instance of hierarchical Dirichlet
process
(Griffiths, Canini, Sanborn, Navarro, 2007)
Monte Carlo provides hypotheses about how people
address the computational challenges

74
(No Transcript)

Write a Comment

User Comments (0)