Categorization and density estimation Tom Griffiths UC Berkeley - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Categorization and density estimation Tom Griffiths UC Berkeley

Description:

Categorization and density estimation Tom Griffiths UC Berkeley – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 73
Provided by: thomasgr6
Category:

less

Transcript and Presenter's Notes

Title: Categorization and density estimation Tom Griffiths UC Berkeley


1
Categorization and density estimationTom
GriffithsUC Berkeley
2
Categorization
cat
dog
cat
dog
dog
cat
cat
dog
cat
dog
dog
cat
cat
dog
3
Categorization
  • cat ? small ? furry ? domestic ? carnivore

4
Borderline cases
  • Is a tomato a vegetable?
  • around 50 say yes
  • (Hampton, 1979)
  • Is an olive a fruit?
  • around 22 change their mind
  • (McClosky Glucksberg, 1978)

5
Borderline cases
  • Is a tomato a vegetable?
  • around 50 say yes
  • (Hampton, 1979)
  • Is an olive a fruit?
  • around 22 change their mind
  • (McClosky Glucksberg, 1978)

6
Typicality
7
Typicality
8
(No Transcript)
9
Typical Atypical
10
Typicality and generalization
Penguins can catch disease X All birds can catch
disease X
Robins can catch disease X All birds can catch
disease X
(Rips, 1975)
11
How can we explain typicality?
  • One answer reject definitions, and have a new
    representation for categories
  • Prototype theory
  • categories are represented by a prototype
  • other members share a family resemblance relation
    to the prototype
  • typicality is a function of similarity to the
    prototype

12
Prototypes
Prototype
(Posner Keele, 1968)
13
Posner and Keele (1968)
  • Prototype effect in categorization accuracy
  • Constructed categories by perturbing prototypical
    dot arrays
  • Ordering of categorization accuracy at test
  • old exemplars
  • prototypes
  • new exemplars

14
Formalizing prototype theories
Representation Each category (e.g., A, B) has a
corresponding prototype (?A,?B) Categorization
(for a new stimulus x) Choose category that
minimizes (maximizes) the distance (similarity)
from x to its prototype
(e.g., Reed, 1972)
15
Formalizing prototype theories
Prototype is most frequent or typical member
Spaces
(Binary) Features
Prototype e.g., binary vector with most frequent
feature values Distance e.g., Hamming distance
Prototype e.g., average of members of
category Distance e.g., Euclidean distance
16
Formalizing prototype theories
Decision boundary at equal distance (always a
straight line for two categories)
Category B
Category A
Prototypes (category means)
17
Predicting prototype effects
  • Prototype effects are built into the model
  • assume categorization becomes easier as proximity
    to the prototype increases
  • or distance from the boundary increases
  • But what about the old exemplar advantage?
  • (Posner Keele, 1968)
  • Prototype models are not the only way to get
    prototype effects

18
Exemplar theories
Store every member (exemplar) of the category
19
Formalizing exemplar theories
Representation A set of stored exemplars y1,
y2, , yn, each with its own category
label Categorization (for a new stimulus
x) Choose category A with probability
?xy is similarity of x to y
?A is bias towards A
Luce-Shepard choice rule
20
The context model(Medin Schaffer, 1978)
Defined for stimuli with binary features (color,
form, size, number)
1111 (red, triangle, big, one)
0000 (green, circle, small, two)
Define similarity as the product of similarity on
each dimension
21
The generalized context model(Nosofsky, 1986)
Defined for stimuli in psychological space
22
The generalized context model
Decision boundary determined by exemplars
Category B
Category A
Category A
Category B
90 A
10 A
50 A
23
Prototypes vs. exemplars
  • Exemplar models produce prototype effects
  • if prototype minimizes distance to all exemplars
    in a category, then it has high probability
  • Also predicts old exemplar advantage
  • being close (or identical) to an old exemplar of
    the category gives high probability
  • Predicts new effects prototype models cannot
    produce
  • stimuli close to an old exemplar should have high
    probability, even far from the prototype

24
Prototypes vs. exemplars
Exemplar models can capture complex boundaries
25
Prototypes vs. exemplars
Exemplar models can capture complex boundaries
26
Some questions
  • Both prototype and exemplar models seem
    reasonable are they rational?
  • are they solutions to the computational problem?
  • Should we use prototypes, or exemplars?
  • How can we define other models that handle more
    complex categorization problems?

27
A computational problem
  • Categorization is a classic inductive problem
  • data stimulus x
  • hypotheses category c
  • We can apply Bayes rule
  • and choose c such that P(cx) is maximized

28
Density estimation
  • We need to estimate some probability
    distributions
  • what is P(c)?
  • what is p(xc)?
  • Two approaches
  • parametric
  • nonparametric

P(c)
c
x
p(xc)
29
Parametric density estimation
  • Assume that p(xc) has a simple form,
    characterized by parameters ?
  • Given stimuli X x1, x2, , xn from category c,
    find ? by maximum-likelihood estimation
  • or some form of Bayesian estimation

30
Binary features
  • x (x1,x2,,xm) ? 0,1m
  • Rather than estimating distribution over 2m
    possibilities, assume feature independence

P(c)
c
x
P(xc)
31
Binary features
  • x (x1,x2,,xm) ? 0,1m
  • Rather than estimating distribution over 2m
    possibilities, assume feature independence
  • Called Naïve Bayes, because independence is a
    naïve assumption!

P(c)
c
x1
x2
xm

32
Spatial representations
  • Assume a simple parametric form for p(xc) a
    Gaussian
  • For each category, estimate parameters
  • mean
  • variance

P(c)
c
x
p(xc)

?
33
The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
34
Multivariate Gaussians
35
Bayesian inference
Probability
x
36
Nonparametric density estimation
  • Rather than estimating a probability density from
    a parametric family, use a scheme that can give
    you any possible density
  • Nonparametric can mean
  • not making distributional assumptions
  • covering a broad class of functions
  • a model with infinitely many parameters

37
Kernel density estimation
Approximate a probability distribution as a
sum of many kernels (one per data point)
X x1, x2, , xn independently sampled from
something
for kernel function k(x,y) such that
38
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 1
x
39
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 2
x
40
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 5
x
41
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 10
x
42
Kernel density estimation
estimated function individual kernels true
function
Probability
h 0.25 n 100
x
43
Bayesian inference
h 0.5
Probability
x
44
Bayesian inference
h 0.1
Probability
x
45
Bayesian inference
h 2.0
Probability
x
46
Advantages and disadvantages
  • Which method should we choose?
  • Methods are complementary
  • parametric estimation requires less data, but is
    severely constrained in possible distributions
  • nonparametric estimation requires a lot of data,
    but can model any function

An instance of the bias-variance tradeoff
47
Categorization as density estimation
  • Prototype and exemplar models can be interpreted
    as rational Bayesian inference
  • Different forms of density estimation
  • prototype parametric density estimation
  • exemplar nonparametric density estimation
  • Suggests other categorization models

48
Prototype models
  • Prototype model
  • Choose category with the closest prototype
  • Bayesian categorization
  • Choose category with highest P(cx)
  • Equivalent if P(xc) decreases as a function
    of the distance of x from the prototype ?c and
    the prior probability is equal for all categories

49
Exemplar models
Exemplar model Choose category A with
probability P(Ax) is the posterior
probability of category A when P(xA) is
approximated using a kernel density estimator
50
Bayesian exemplars
  • ? is the prior probability of category A (divided
    by the number of exemplars in A)
  • The summed similarity is proportional to p(xA)
    using a kernel density estimator

for appropriate kernel and similarity function
51
The generalized context model
Probability
exemplar
similarity gradient
x
52
Implications
  • Prototype and exemplar models can be interpreted
    as rational Bayesian inference
  • Different strengths and limitations
  • exemplar models are rational categorization
    models for any category density (and large n)
  • Suggests other categorization models
  • alternatives between prototypes and exemplars

53
Prototypes vs. exemplars
  • Prototype model
  • parametric density estimation
  • P(xA) specified by one Gaussian
  • Exemplar model
  • nonparametric density estimation
  • P(xA) is a sum of nA Gaussians
  • Compromise
  • semiparametric density estimation
  • sum of more than one and less than nA Gaussians
  • mixture of Gaussians (Rosseel, 2003)

54
The Rational Model of Categorization(RMC
Anderson 1990 1991)
  • Computational problem predicting a feature based
    on observed data
  • assume that category labels are just features
  • Predictions are made on the assumption that
    objects form clusters with similar properties
  • each object belongs to a single cluster
  • feature values likely to be the same within
    clusters
  • the number of clusters is unbounded

55
Representation in the RMC
  • Flexible representation can interpolate between
    prototype and exemplar models

Probability
Probability
Feature Value
Feature Value
56
The optimal solution
  • The probability of the missing feature (i.e.,
    the category label) taking a certain value is

where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into clusters
57
The posterior over partitions
58
The prior over partitions
  • An object is assumed to have a constant
    probability of joining same cluster as another
    object, known as the coupling probability
  • This allows some probability that a stimulus
    forms a new cluster, so the probability that the
    ith object is assigned to the kth cluster is

59
Nonparametric Bayes
  • Bayes treat density estimation as a problem of
    Bayesian inference, defining a prior over the
    number of components
  • Nonparametric use a prior that places no limit
    on the number of components
  • Dirichlet process mixture models (DPMMs) use a
    prior of this kind

60
The prior over components
  • Each sample is assumed to come from a single
    (possibly previously unseen) mixture component
  • The ith sample is drawn from the kth component
    with probability
  • where ??is a parameter of the model

61
Equivalence
  • Neal (1998) showed that the prior for the RMC
    and the DPMM are the same, with

RMC prior
DPMM prior
62
The computational challenge
  • The probability of the missing feature (i.e.,
    the category label) taking a certain value is

where j is a feature value, Fn are the
observed features of a set of n objects, and xn
is a partition of objects into groups
n 1 2 3 4 5 6 7 8 9
10 xn 1 2 5 15 52 203 877 4140
21147 115975
63
Andersons approximation
  • Data observed sequentially
  • Each object is deterministically assigned to the
    cluster with the highest posterior probability
  • Call this the Local MAP
  • choosing the cluster with the maximum a
    posteriori probability

64
Two uses of Monte Carlo methods
  • For solving problems of probabilistic inference
    involved in developing computational models
  • As a source of hypotheses about how the mind
    might solve problems of probabilistic inference

65
Alternative approximation schemes
  • There are several methods for making
    approximations to the posterior in DPMMs
  • Gibbs sampling
  • Particle filtering
  • These methods provide asymptotic performance
    guarantees (in contrast to Andersons procedure)

(Sanborn, Griffiths, Navarro, 2006)
66
Gibbs sampling for the DPMM
Starting Partition
  • All the data are required at once (a batch
    procedure)
  • Each stimulus is sequentially assigned to a
    cluster based on the assignments of all of the
    remaining stimuli
  • Assignments are made probabilistically, using the
    full conditional distribution

0.33
0.67
0.12
0.40
67
Particle filter for the DPMM
  • Data are observed sequentially
  • The posterior distribution at each point is
    approximated by a set of particles
  • Particles are updated, and a fixed number of are
    carried over from trial to trial

68
Approximating the posterior
  • For a single order, the Local MAP will produce a
    single partition
  • The Gibbs sampler and particle filter will
    approximate the exact DPMM distribution

69
Order effects in human data
  • The probabilistic model underlying the DPMM does
    not produce any order effects
  • follows from exchangeability
  • But human data shows order effects
  • (e.g., Medin Bettger, 1994)
  • Anderson and Matessa tested local MAP predictions
    about order effects in an unsupervised clustering
    experiment
  • (Anderson, 1990)

70
Anderson and Matessas Experiment
  • Subjects were shown all sixteen stimuli that had
    four binary features
  • Front-anchored ordered stimuli emphasized the
    first two features in the first eight trials
    end-anchored ordered emphasized the last two

Front-Anchored Order
End-Anchored Order
scadsporm scadstirm sneksporb snekstirb sneksporm
snekstirm scadsporb scadstirb
snadstirb snekstirb scadsporm sceksporm sneksporm
snadsporm scedstirb scadstirb
71
Anderson and Matessas Experiment
Proportion that are Divided Along a
Front-Anchored Feature
Front-Anchored Order
End-Anchored Order
72
A rational process model
  • A rational model clarifies a problem and serves
    as a benchmark for performance
  • Using a psychologically plausible approximation
    can change a rational model into a rational
    process model
  • Research in machine learning and statistics has
    produced useful approximations to statistical
    models which can be tested as general-purpose
    psychological heuristics

73
Summary
  • Traditional models of the cognitive processes
    involved in categorization can be reinterpreted
    as rational models (via density estimation)
  • Prototypes vs. exemplars is about schemes for
    density estimation (and representations)
  • Nonparametric Bayes lets us explore options
    between these extremes (as well as some new
    models)
  • all models instance of hierarchical Dirichlet
    process
  • (Griffiths, Canini, Sanborn, Navarro, 2007)
  • Monte Carlo provides hypotheses about how people
    address the computational challenges

74
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com