Bayesian models of human learning and inference

About This Presentation
Title:

Bayesian models of human learning and inference

Description:

Bayesian models of human learning and inference – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 157
Provided by: josht151

less

Transcript and Presenter's Notes

Title: Bayesian models of human learning and inference


1
  • Bayesian models of human learning and
    inference
  • Josh Tenenbaum
  • MIT
  • Department of Brain and Cognitive Sciences
  • Computer Science and AI Lab (CSAIL)

(http//web.mit.edu/cocosci/Talks/nips06-tutorial.
ppt)
Thanks to Tom Griffiths, Charles Kemp, Vikash
Mansinghka
2
The probabilistic revolution in AI
  • Principled and effective solutions for inductive
    inference from ambiguous data
  • Vision
  • Robotics
  • Machine learning
  • Expert systems / reasoning
  • Natural language processing
  • Standard view no necessary connection to how the
    human brain solves these problems.

3
Probabilistic inference inhuman cognition?
  • People arent Bayesian
  • Kahneman and Tversky (1970s-present)
    heuristics and biases research program. 2002
    Nobel Prize in Economics.
  • Slovic, Fischhoff, and Lichtenstein (1976) It
    appears that people lack the correct programs for
    many important judgmental tasks.... it may be
    argued that we have not had the opportunity to
    evolve an intellect capable of dealing
    conceptually with uncertainty.
  • Stephen Jay Gould (1992) Our minds are not
    built (for whatever reason) to work by the rules
    of probability.

4
A. greater than 90 B. between 70 and
90 C. between 50 and 70 D. between 30 and
50 E. between 10 and 30 F. less than 10
The probability of breast cancer is 1 for a
woman at 40 who participates in a routine
screening. If a woman has breast cancer, the
probability is 80 that she will have a positive
mammography. If a woman does not have breast
cancer, the probability is 9.6 that she will
also have a positive mammography.
A woman in this age group had a positive
mammography in a routine screening. What is the
probability that she actually has breast cancer?
5
Availability biases in probability judgment
  • How likely is that a randomly chosen word
  • ends in g?
  • ends in ing?
  • When buying a car, how much do you weigh your
    friends experience relative to consumer
    satisfaction surveys?

6
(No Transcript)
7
(No Transcript)
8
Probabilistic inference inhuman cognition?
  • People arent Bayesian
  • Kahneman and Tversky (1970s-present)
    heuristics and biases research program. 2002
    Nobel Prize in Economics.
  • Psychology is often drawn towards the minds
    errors and apparent irrationalities.
  • But the computationally interesting question
    remains How does mind work so well?

9
Bayesian models of cognition
  • Visual perception Weiss, Simoncelli, Adelson,
    Richards, Freeman, Feldman, Kersten, Knill,
    Maloney, Olshausen, Jacobs, Pouget, ...
  • Language acquisition and processing Brent, de
    Marken, Niyogi, Klein, Manning, Jurafsky, Keller,
    Levy, Hale, Johnson, Griffiths, Perfors,
    Tenenbaum,
  • Motor learning and motor control Ghahramani,
    Jordan, Wolpert, Kording, Kawato, Doya, Todorov,
    Shadmehr,
  • Associative learning Dayan, Daw, Kakade,
    Courville, Touretzky, Kruschke,
  • Memory Anderson, Schooler, Shiffrin, Steyvers,
    Griffiths, McClelland,
  • Attention Mozer, Huber, Torralba, Oliva,
    Geisler, Yu, Itti, Baldi,
  • Categorization and concept learning Anderson,
    Nosfosky, Rehder, Navarro, Griffiths, Feldman,
    Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka,
  • Reasoning Chater, Oaksford, Sloman, McKenzie,
    Heit, Tenenbaum, Kemp,
  • Causal inference Waldmann, Sloman, Steyvers,
    Griffiths, Tenenbaum, Yuille,
  • Decision making and theory of mind Lee,
    Stankiewicz, Rao, Baker, Goodman, Tenenbaum,

10
Learning concepts from examples
  • Word learning

horse
horse
horse
11
Learning concepts from examples
12
Everyday inductive leaps
  • How can people learn so much about the world . .
    .
  • Kinds of objects and their properties
  • The meanings of words, phrases, and sentences
  • Cause-effect relations
  • The beliefs, goals and plans of other people
  • Social structures, conventions, and rules
  • . . . from such limited evidence?

13
Contributions of Bayesian models
  • Principled quantitative models of human behavior,
    with broad coverage and a minimum of free
    parameters and ad hoc assumptions.
  • Explain how and why human learning and reasoning
    works, in terms of (approximations to) optimal
    statistical inference in natural environments.
  • A framework for studying peoples implicit
    knowledge about the structure of the world how
    it is structured, used, and acquired.
  • A two-way bridge to state-of-the-art AI and
    machine learning.

14
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Algorithm
  • Cognitive psychology
  • Implementation
  • Neurobiology

15
What about those errors?
  • The human mind is not a universal Bayesian
    engine.
  • But, the mind does appear adapted to solve
    important real-world inference problems in
    approximately Bayesian ways, e.g.
  • Predicting everyday events
  • Causal learning and reasoning
  • Learning concepts from examples
  • Like perceptual tasks, adults and even young
    children solve these problems mostly
    unconsciously, effortlessly, and successfully.

16
Technical themes
  • Inference in probabilistic models
  • Role of priors, explaining away.
  • Learning in graphical models
  • Parameter learning, structure learning.
  • Bayesian model averaging
  • Being Bayesian over network structures.
  • Bayesian Occams razor
  • Trade off model complexity against data fit.

17
Technical themes
  • Structured probabilistic models
  • Grammars, first-order logic, relational schemas.
  • Hierarchical Bayesian models
  • Acquire abstract knowledge, supports transfer.
  • Nonparametric Bayes
  • Flexible models that grow in complexity as new
    data warrant.
  • Tractable approximate inference
  • Markov chain Monte Carlo (MCMC), Sequential Monte
    Carlo (particle filtering).

18
Outline
  • Predicting everyday events
  • Causal learning and reasoning
  • Learning concepts from examples

19
Outline
  • Predicting everyday events
  • Causal learning and reasoning
  • Learning concepts from examples

20
Basics of Bayesian inference
  • Bayes rule
  • An example
  • Data John is coughing
  • Some hypotheses
  • John has a cold
  • John has lung cancer
  • John has a stomach flu
  • Likelihood P(dh) favors 1 and 2 over 3
  • Prior probability P(h) favors 1 and 3 over 2
  • Posterior probability P(hd) favors 1 over 2 and
    3

21
Bayesian inference in perception and sensorimotor
integration
(Weiss, Simoncelli Adelson 2002)
(Kording Wolpert 2004)
22
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log memory strength
Mean recalled
Log delay (hours)
Retention interval (days)
Log delay (seconds)
23
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
  • For each item in memory, estimate the probability
    that it will be useful in the present context.
  • Use priors based on the statistics of natural
    information sources.

24
Memory retrieval as Bayesian inference(Anderson
Schooler, 1991)
Power law of forgetting
Spacing effects in forgetting
Additive effects of practice delay
Log need odds
Log need odds
Log days since last occurrence
Log days since last occurrence
Log days since last occurrence
New York Times data c.f. email sources,
child-directed speech
25
Everyday prediction problems(Griffiths
Tenenbaum, 2006)
  • You read about a movie that has made 60 million
    to date. How much money will it make in total?
  • You see that something has been baking in the
    oven for 34 minutes. How long until its ready?
  • You meet someone who is 78 years old. How long
    will they live?
  • Your friend quotes to you from line 17 of his
    favorite poem. How long is the poem?
  • You see taxicab 107 pull up to the curb in front
    of the train station. How many cabs in this city?

26
Making predictions
  • You encounter a phenomenon that has existed for
    tpast units of time. How long will it continue
    into the future? (i.e. whats ttotal?)
  • We could replace time with any other quantity
    that ranges from 0 to some unknown upper limit.

27
Bayesian inference
  • P(ttotaltpast) ? P(tpastttotal) P(ttotal)

posterior probability
likelihood
prior
28
Bayesian inference
  • P(ttotaltpast) ? P(tpastttotal) P(ttotal)
  • ? 1/ttotal 1/ttotal

posterior probability
likelihood
prior
Uninformative prior
Assume random sample (0 lt tpast lt ttotal)
(e.g., Jeffreys, Jaynes)
29
Bayesian inference
  • P(ttotaltpast) ? 1/ttotal
    1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
30
Bayesian inference
  • P(ttotaltpast) ? 1/ttotal
    1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Best guess for ttotal t such that P(ttotal gt
ttpast) 0.5
31
Bayesian inference
  • P(ttotaltpast) ? 1/ttotal
    1/ttotal

posterior probability
Random sampling
Uninformative prior
P(ttotaltpast)
ttotal
tpast
Yields Gotts Rule P(ttotal gt ttpast) 0.5
when t 2tpast i.e., best
guess for ttotal 2tpast .
32
Evaluating Gotts Rule
  • You read about a movie that has made 78 million
    to date. How much money will it make in total?
  • 156 million seems reasonable.
  • You meet someone who is 35 years old. How long
    will they live?
  • 70 years seems reasonable.
  • Not so simple
  • You meet someone who is 78 years old. How long
    will they live?
  • You meet someone who is 6 years old. How long
    will they live?

33
The effects of priors
  • Different kinds of priors P(ttotal) are
    appropriate in different domains.

e.g., wealth, contacts
e.g., height, lifespan
Gott P(ttotal)?? ttotal-1
34
The effects of priors
35
Evaluating human predictions
  • Different domains with different priors
  • A movie has made 60 million
  • Your friend quotes from line 17 of a poem
  • You meet a 78 year old man
  • A move has been running for 55 minutes
  • A U.S. congressman has served for 11 years
  • A cake has been in the oven for 34 minutes
  • Use 5 values of tpast for each.
  • People predict ttotal .

36
(No Transcript)
37
You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
38
You learn that in ancient Egypt, there was a
great flood in the 11th year of a pharaohs
reign. How long did he reign?
How long did the typical pharaoh reign in
ancient egypt?
39
exponential or power law?
If a friend is calling a telephone box office to
book tickets and tells you he has been on hold
for 3 minutes, how long do you think will be on
hold in total?
40
Summary prediction
  • Predictions about the extent or magnitude of
    everyday events follow Bayesian principles.
  • Contrast with Bayesian inference in perception,
    motor control, memory no universal priors
    here.
  • Predictions depend rationally on priors that are
    appropriately calibrated for different domains.
  • Form of the prior (e.g., power-law or
    exponential)
  • Specific distribution given that form
    (parameters)
  • Non-parametric distribution when necessary.
  • In the absence of concrete experience, priors may
    be generated by qualitative background knowledge.

41
Outline
  • Predicting everyday events
  • Causal learning and reasoning
  • Learning concepts from examples

42
Bayesian networks
Nodes variables Links direct dependencies Each
node has a conditional probability
distribution Data observations of X1, ..., X4
  • Four random variables
  • X1 coughing
  • X2 high body temperature
  • X3 flu
  • X4 lung cancer

43
Causal Bayesian networks
Nodes variables Links causal mechanisms Each
node has a conditional probability
distribution Data observations of and
interventions on X1, ..., X4
  • Four random variables
  • X1 coughing
  • X2 high body temperature
  • X3 flu
  • X4 lung cancer

(Pearl Glymour Cooper)
44
Inference in causal graphical models
  • Explaining away or discounting in
    social reasoning (Kelley Morris Larrick)
  • Screening off in intuitive causal reasoning
    (Waldmann, Rehder Burnett, Blok Sloman,
    Gopnik Sobel)
  • Better in chains than common-cause structures
    common-cause better if mechanisms clearly
    independent
  • Understanding and predicting the effects of
    interventions (Sloman Lagnado Gopnik Schulz)

A
B
C
B
P(cb) vs. P(cb, a) P(cb, not
a)
B
A
C
A
C
45
Learning graphical models
  • Structure learning what causes what?
  • Parameter learning how do causes work?

46
Bayesian learning of causal structure
  • Data d Causal
    hypotheses h

X4
X3
X4
X3
X1
X2
X1
X2
1. What is the most likely network h given
observed data d ? 2. How likely
is there to be a link X4
X2 ?
(Bayesian model averaging)
47
Bayesian Occams Razor
(MacKay, 2003 Ghahramani tutorials)
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
48
Learning causation from contingencies
C present (c)
C absent (c-)
e.g., Does injecting this chemical cause mice to
express a certain gene?
a
c
E present (e)
d
b
E absent (e-)
Subjects judge the extent C to which causes E
(rate on a scale from 0 to 100)
49
Two models of causal judgment
  • Delta-P (Jenkins Ward, 1965)
  • Power PC (Cheng, 1997)

Power
50
Judging the probability that C E (Buehner
Cheng, 1997 2003)
  • Independent effects of both DP and causal power.
  • At DP0, judgments decrease with base rate.
    (frequency illusion)

51
Learning causal strength(parameter learning)
  • Assume this causal structure
  • DP and causal power are maximum likelihood
    estimates of the strength parameter w1, under
    different parameterizations for P(EB,C)
  • linear ? DP, Noisy-OR ? causal power

B
52
Learning causal structure(Griffiths Tenenbaum,
2005)
  • Hypotheses
  • Bayesian causal support

h0
h1
likelihood ratio (Bayes factor) gives evidence
in favor of h1
noisy-OR
(assume uniform parameter priors, but see Yuille
et al., Danks et al.)
53
Buehner and Cheng (1997)
People
DP (r 0.89)
Power (r 0.88)
Support (r 0.97)
54
Implicit background theory
  • Injections may or may not cause gene expression,
    but gene expression does not cause injections.
  • No hypotheses with E C
  • Other naturally occurring processes may also
    cause gene expression.
  • All hypotheses include an always-present
    background cause B C
  • Causes are generative, probabilistically
    sufficient and independent, i.e. each cause
    independently produces the effect in some
    proportion of cases.
  • Noisy-OR parameterization

55
Sensitivity analysis
People
Support (Noisy-OR)
?2
Support (generic parameterization)
56
Generativity is essential
0/8 0/8
P(ec)
8/8 8/8
6/8 6/8
4/8 4/8
2/8 2/8
P(ec-)
100 50 0
Support
  • Predictions result from ceiling effect
  • ceiling effects only matter if you believe a
    cause increases the probability of an effect

57
Different parameterizations for different kinds
of mechanisms
Does C cause E?
Is there a difference in E with C vs. not-C?
Does C prevent E?
58
Blicket detector (Sobel, Gopnik, and colleagues)
59
Backwards blocking (Sobel, Tenenbaum Gopnik,
2004)
A Trial
AB Trial
  • Initially Nothing on detector detector silent
    (A0, B0, E0)
  • Trial 1 A B on detector detector active (A1,
    B1, E1)
  • Trial 2 A on detector detector active (A1,
    B0, E1)
  • 4-year-olds judge if each object is a blicket
  • A a blicket (100 say yes)
  • B probably not a blicket (34 say yes)

B
A
?
?
E
(cf. explaining away in weight space, Dayan
Kakade)
60
Possible hypotheses?
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
E
E
E
E
E
E
E
E
61
Bayesian causal learning
With a uniform prior on hypotheses, generic
parameterization
Probability of being a blicket
A
B
0.32
0.32
0.34
0.34
62
A stronger hypothesis space
  • Links can only exist from blocks to detectors.
  • Blocks are blickets with prior probability q.
  • Blickets always activate detectors, detectors
    never activate on their own (i.e., deterministic
    OR parameterization, no hidden causes).

P(h00) (1 q)2
P(h10) q(1 q)
P(h01) (1 q) q
P(h11) q2
B
A
B
A
B
A
B
A
E
E
E
E
P(E1 A0, B0) 0 0
0
0 P(E1 A1, B0) 0
0 1
1 P(E1 A0, B1) 0
1 0
1 P(E1 A1, B1)
0 1
1 1
63
Manipulating prior probability(Tenenbaum, Sobel,
Griffiths, Gopnik)
A Trial
Initial
AB Trial
64
Learning more complex structures
  • Tenenbaum et al., Griffiths Sobel detectors
    with more than two objects and noisy mechanisms
  • Steyvers et al., Sobel Kushnir active learning
    with interventions (c.f. Tong Koller, Murphy)
  • Lagnado Sloman learning
    from interventions on
    continuous dynamical systems

65
Inferring hidden causes
Common unobserved cause
4 x
2 x
2 x
Independent unobserved causes
1 x
2 x
2 x
2 x
2 x
One observed cause
The stick ball machine
2 x
4 x
(Kushnir, Schulz, Gopnik, Danks, 2003)
66
Bayesian learning with unknown number of hidden
variables
(Griffiths et al 2006)
67
(No Transcript)
68
Inferring latent causes in classical
conditioning(Courville, Daw, Gordon, Touretzky
2003)
e.g., A noise X tone B click US
shock
Training A US A X B US Test X X
B
69
Inferring latent causes in perceptual learning
(Orban, Fiser, Aslin, Lengyel 2006)
Learning to recognize objects and segment scenes
70
Inferring latent causes in sensory integration
(Kording et al. 2006, NIPS 06)
71
Coincidences(Griffiths Tenenbaum, in press)
  • The birthday problem
  • How many people do you need to have in the room
    before the probability exceeds 50 that two of
    them have the same birthday?
  • The bombing of London

23.
72
How much of a coincidence?
73
Bayesian coincidence factor
Chance
Latent common cause
C
x
x
x
x
x
x
x
x
x
x
August
  • Alternative hypotheses
  • proximity in date, matching days of the
    month, matching month, ....

74
How much of a coincidence?
75
Bayesian coincidence factor
Latent common cause
Chance
C
x
x
x
x
x
x
x
x
x
x
uniform regularity
uniform
76
Summary causal inference learning
  • Human causal induction can be explained using
    core principles of graphical models.
  • Bayesian inference (explaining away, screening
    off)
  • Bayesian structure learning (Occams razor,
    model averaging)
  • Active learning with interventions
  • Identifying latent causes

77
Summary causal inference learning
  • Crucial constraints on hypothesis spaces come
    from abstract prior knowledge, or intuitive
    theories.
  • What are the variables?
  • How can they be connected?
  • How are their effects parameterized?
  • Big open questions
  • How can these theories be described formally?
  • How can these theories be learned?

78
Hierarchical Bayesian framework
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
79
A theory for blickets(c.f. PRMs, BLOG, FOPL)
80
Learning with a uniform prior on network
structures
True network Sample 75 observations
attributes (1-12)
observed data
patients
81
Learning a block-structured prior on network
structures (Mansinghka et al. 2006)
z
1 2 3 4
5 6 7 8
0.8
0.0
0.01
h
0.0
0.0
0.75
9 1011 12
0.0
0.0
0.0
True network Sample 75 observations
attributes (1-12)
observed data
patients
82
The blessing of abstraction
True structure of graphical model G
of samples 20 80
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
83
The nonparametric safety-net
12
1
11
True structure of graphical model G
2
10
3
9
4
8
5
7
6
of samples 40 100
1000
Graph G
edge (G)
Data D
edge (G)
Abstract theory Z
Graph G
class (z)
Data D
84
Outline
  • Predicting everyday events
  • Causal learning and reasoning
  • Learning concepts from examples

85
Learning from just one or a few examples, and
mostly unlabeled examples (semi-supervised
learning).
86
Simple model of concept learning
Can you show me the other blickets?
87
Simple model of concept learning
Other blickets.
88
Simple model of concept learning
Other blickets.
  • Learning from just one positive example is
    possible if
  • Assume concepts refer to clusters in the world.
  • Observe enough unlabeled data to identify clear
    clusters.
  • (c.f. Learning with mixture models and EM,
    Ghahramani Jordan, 1994 Nigam et al. 2000)

89
Concept learning with mixture models in cognitive
science
  • Fried Holyoak (1984)
  • Modeled unsupervised and
    semi-supervised categorization as EM
    in a Gaussian mixture.
  • Anderson (1990)
  • Modeled unsupervised and semi-supervised
    categorization as greedy sequential search in an
    infinite (Chinese restaurant process) mixture.

90
Infinite (CRP) mixture models
  • Construct from k-component mixtures by
    integrating out mixing weights, collapsing
    equivalent partitions, and taking the limit as
    .
  • Does not require that we commit to a fixed or
    even finite number of classes.
  • Effective number of classes can grow with number
    of data points, balancing complexity with data
    fit.
  • Computationally much simpler than applying
    Bayesian Occams razor or cross-validation.
  • Easy to learn with standard Monte Carlo
    approximations (MCMC, particle filtering),
    hopefully avoiding local minima.

91
High school lunch room analogy
92
Sampling from the CRP

punks
preppies
jocks
nerds

93
(No Transcript)
94
Assign to larger groups
Group with similar objects
Gibbs sampler (Neal)

punks
preppies
jocks
nerds

95
A typical cognitive experiment
F1 F2 F3 F4 Label
Training stimuli 1 1 1 1 1 1
0 1 0 1 0 1 0 1 1 0 0
0 0 0 0 1 0 0 0 1 0 1
1 0
Test stimuli 0 1 1 1 ? 1
1 0 1 ? 1 1 1 0 ? 1 0
0 0 ? 0 0 1 0 ? 0 0
0 1 ?
96
Anderson (1990), Rational model of
categorization Greedy sequential search
in an infinite mixture
model. Sanborn, Griffiths, Navarro (2006), More
rational model of categorization Particle
filter with a small of particles
97
Towards more natural concepts
98
CrossCat Discovering multiple structures that
capture different subsets of features(Shafto,
Kemp, Mansinghka, Gordon Tenenbaum, 2006)
99
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept
  • Biomedical predicate data from UMLS (McCrae et
    al.)
  • 134 concepts enzyme, hormone, organ, disease,
    cell function ...
  • 49 predicates affects(hormone, organ),
    complicates(enzyme, cell function), treats(drug,
    disease), diagnoses(procedure, disease)

100
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
101
Learning from very few examples
tufa
tufa
  • Word learning

tufa
  • Property induction

Cows have T9 hormones. Seals have T9
hormones. Squirrels have T9 hormones. All
mammals have T9 hormones.
Cows have T9 hormones. Sheep have T9
hormones. Goats have T9 hormones. All mammals
have T9 hormones.
102
The computational problem(c.f., semi-supervised
learning)
?
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
New property
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
103

X
Y
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
104

X
Y
Prediction P(Y X)
Hypotheses h
Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Sea
l Rhino Elephant
? ? ? ? ? ? ? ?
...
...
Prior P(h)
105
Many sources of priors
106
Hierarchical Bayesian Framework(Kemp Tenenbaum)
F form
Tree
S structure
F1 F2 F3 F4 Has T9 hormones
mouse squirrel chimp gorilla
D data
? ? ?

107
P(DS) How the structure constrains the data of
experience
  • Define a stochastic process over structure S that
    generates hypotheses h.
  • For generic properties, prior should favor
    hypotheses that vary smoothly over structure.
  • Many properties of biological species were
    actually generated by such a process (i.e.,
    mutation selection).

Smooth P(h) high
Not smooth P(h) low
108
P(DS) How the structure constrains the data of
experience
S
Gaussian Process ( random walk,
diffusion)
Zhu, Ghahramani Lafferty 2003
y
Threshold
h
109
A graph-based prior
  • Let dij be the length of the edge between i and j
  • ( if i and j are not connected)

A Gaussian prior N(0, S), with (Zhu, Lafferty
Ghahramani, 2003)
110
Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
Features
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
111
(No Transcript)
112
(No Transcript)
113
Structure S
Data D
Species 1 Species 2 Species 3 Species 4 Species
5 Species 6 Species 7 Species 8 Species 9 Species
10
? ? ? ? ? ? ? ?
Features
New property
(85 features from Osherson et al., e.g., for
Elephant gray, hairless, toughskin,
big, bulbous, longleg, tail,
chewteeth, tusks, smelly, walks, slow,
strong, muscle, quadrapedal,)
114
Cows have property P. Elephants have property
P. Horses have property P.
Tree
2D
Gorillas have property P. Mice have property
P. Seals have property P. All mammals have
property P.
115
Reasoning about spatially varying properties
  • Native American artifacts task

116
Property type has T9 hormones can bite
through wire carry E. Spirus
bacteria Theory Structure taxonomic tree
directed chain
directed network diffusion process
drift process noisy transmission
Class D
Class D
Class A
Class A
Class F
Class E
Class C
Class C
Class B
Class G
Class E
Class B
Class F
Hypotheses
Class G
Class A Class B Class C Class D Class E Class
F Class G
. . .
. . .
. . .
117
Herring
Tuna
Mako shark
Sand shark
Dolphin
Human
Kelp
Sand shark
Kelp
Human
Mako shark
Tuna
Herring
Dolphin
118
Hierarchical Bayesian Framework
F form
Tree
Space
Chain
mouse
squirrel
S structure
gorilla
chimp
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
119
Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Robin
Crocodile
Snake
Bat
Turtle
Orangutan
Ostrich
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
120
Discovering structural forms
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Great chain of being
Robin
Crocodile
Snake
Bat
Turtle
Plant
Rock
Orangutan
Angel
Ostrich
God
Linnaeus
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
121
People can discover structural forms
  • Scientists
  • Tree structure for living kinds (Linnaeus)
  • Periodic structure for chemical elements
    (Mendeleev)
  • Children
  • Hierarchical structure of category labels
  • Clique structure of social groups
  • Cyclical structure of seasons or days of the week
  • Transitive structure for value

122
The value of structural form knowledge inductive
bias
123
Typical structure learning algorithms assume a
fixed structural form
Flat Clusters
Line
Circle
K-Means Mixture models Competitive learning
Guttman scaling Ideal point models
Circumplex models
Grid
Tree
Euclidean Space
Hierarchical clustering Bayesian phylogenetics
MDS PCA Factor Analysis
Self-Organizing Map Generative topographic
mapping
124
Goal a universal framework for unsupervised
learning
Universal Learner
K-Means Hierarchical clustering Factor
Analysis Guttman scaling Circumplex
models Self-Organizing maps
Data
Representation
125
Hierarchical Bayesian Framework
F form
S structure
F1 F2 F3 F4
D data
mouse squirrel chimp gorilla
126
Structural forms as graph grammars
Form
Form
Process
Process
127
Node-replacement graph grammars
Production (Line)
Derivation
128
Node-replacement graph grammars
Production (Line)
Derivation
129
Node-replacement graph grammars
Production (Line)
Derivation
130
Model fitting
  • Evaluate each form in parallel
  • For each form, heuristic search over structures
    based on greedy growth from a one-node seed

131
(No Transcript)
132
Development of structural forms as more data are
observed
133
Beyond Nativism versus Empiricism
  • Nativism Explicit knowledge of structural
    forms for core domains is innate.
  • Atran (1998) The tendency to group living kinds
    into hierarchies reflects an innately determined
    cognitive structure.
  • Chomsky (1980) The belief that various systems
    of mind are organized along quite different
    principles leads to the natural conclusion that
    these systems are intrinsically determined, not
    simply the result of common mechanisms of
    learning or growth.
  • Empiricism General-purpose learning systems
    without explicit knowledge of structural form.
  • Connectionist networks (e.g., Rogers and
    McClelland, 2004).
  • Traditional structure learning in probabilistic
    graphical models.

134
Summary concept learning
  • Models based on Bayesian inference over
    hierarchies of structured representations.
  • How does abstract domain knowledge guide learning
    of new concepts?
  • How can this knowledge be represented, and how
    might it be learned?

F form
mouse
squirrel
S structure
chimp
gorilla
F1 F2 F3 F4
mouse squirrel chimp gorilla
D data
  • How can probabilistic inference work together
    with flexibly structured representations to model
    complex, real-world learning and reasoning?

135
Contributions of Bayesian models
  • Principled quantitative models of human behavior,
    with broad coverage and a minimum of free
    parameters and ad hoc assumptions.
  • Explain how and why human learning and reasoning
    works, in terms of (approximations to) optimal
    statistical inference in natural environments.
  • A framework for studying peoples implicit
    knowledge about the structure of the world how
    it is structured, used, and acquired.
  • A two-way bridge to state-of-the-art AI and
    machine learning.

136
Looking forward
  • What we need to understand the minds ability to
    build rich models of the world from
    sparse data.
  • Learning about objects, categories, and their
    properties.
  • Causal inference
  • Language comprehension and production
  • Scene understanding
  • Understanding other peoples actions, plans,
    thoughts, goals
  • What do we need to understand these abilities?
  • Bayesian inference in probabilistic generative
    models
  • Hierarchical models, with inference at all levels
    of abstraction
  • Structured representations graphs, grammars,
    logic
  • Flexible representations, growing in response to
    observed data

137
Learning word meanings
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Abstract Principles
Structure
Data
(Tenenbaum Xu)
138
Causal learning and reasoning
Abstract Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
139
Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
140
Vision as probabilistic parsing
(Han Zhu, 2006 c.f., Zhu, Yuanhao Yuille
NIPS 06 )
141
(No Transcript)
142
Goal-directed action (production and
comprehension)
(Wolpert et al., 2003)
143
Bayesian models of action understanding
(Baker, Tenenbaum Saxe Verma Rao)
144
Open directions and challenges
  • Effective methods for learning structured
    knowledge
  • How to balance expressiveness/learnability
    tradeoff?
  • More precise relation to psychological processes
  • To what extent do mental processes implement
    boundedly rational methods of approximate
    inference?
  • Relation to neural computation
  • How to implement structured representations in
    brains?
  • Modeling individual subjects and single trials
  • Is there a rational basis for probability
    matching?
  • Understanding failure cases
  • Are these simply not Bayesian, or are people
    using a different model? How do we avoid
    circularity?

145
Want to learn more?
  • Special issue of Trends in Cognitive Sciences
    (TiCS), July 2006 (Vol. 10, no. 7), on
    Probabilistic models of cognition.
  • Tom Griffiths reading list, a/k/a
    http//bayesiancognition.com
  • Summer school on probabilistic models of
    cognition, July 2007, Institute for Pure and
    Applied Mathematics (IPAM) at UCLA.

146
(No Transcript)
147
Extra slides
148
Bayesian prediction
  • P(ttotaltpast) ? 1/ttotal
    P(tpast)

posterior probability
Random sampling
Domain-dependent prior
What is the best guess for ttotal? Compute t
such that P(ttotal gt ttpast) 0.5
P(ttotaltpast)
We compared the median of the Bayesian
posterior with the median of subjects judgments
but what about the distribution of subjects
judgments?
ttotal
149
Sources of individual differences
  • Individuals judgments could by noisy.
  • Individuals judgments could be optimal, but with
    different priors.
  • e.g., each individual has seen only a sparse
    sample of the relevant population of events.
  • Individuals inferences about the posterior could
    be optimal, but their judgments could be based on
    probability (or utility) matching rather than
    maximizing.

150
Individual differences in prediction
P(ttotaltpast)
ttotal
Proportion of judgments below predicted value
Quantile of Bayesian posterior distribution
151
Individual differences in prediction
P(ttotaltpast)
ttotal
  • Average over all
  • prediction tasks
  • movie run times
  • movie grosses
  • poem lengths
  • life spans
  • terms in congress
  • cake baking times

152
Individual differences in concept learning
153
Why probability matching?
  • Optimal behavior under some (evolutionarily
    natural) circumstances.
  • Optimal betting theory, portfolio theory
  • Optimal foraging theory
  • Competitive games
  • Dynamic tasks (changing probabilities or
    utilities)
  • Side-effect of algorithms for approximating
    complex Bayesian computations.
  • Markov chain Monte Carlo (MCMC) instead of
    integrating over complex hypothesis spaces,
    construct a sample of high-probability
    hypotheses.
  • Judgments from individual (independent) samples
    can on average be almost as good as using the
    full posterior distribution.

154
Markov chain Monte Carlo
(Metropolis-Hastings algorithm)
155
The puzzle of coincidences
  • Discoveries of hidden causal structure are often
    driven by noticing coincidences. . .
  • Science
  • Halleys comet (1705)

156
(Halley, 1705)
157
(Halley, 1705)
158
The puzzle of coincidences
  • Discoveries of hidden causal structure are often
    driven by noticing coincidences. . .
  • Science
  • Halleys comet (1705)
  • John Snow and the cause of cholera (1854)

159
(No Transcript)
160
Rational analysis of cognition
  • Often can show that apparently irrational
    behavior is actually rational.

Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
161
Rational analysis of cognition
  • Often can show that apparently irrational
    behavior is actually rational.
  • Oaksford Chaters rational analysis
  • Optimal data selection based on maximizing
    expected information gain.
  • Test the rule If p, then q against the null
    hypothesis that p and q are independent.
  • Assuming p and q are rare predicts peoples
    choices

162
Integrating multiple forms of reasoning(Kemp,
Shafto, Berke Tenenbaum NIPS 06)
2) Causal relations between features
Parameters of causal relations vary
smoothly over the category hierarchy.
1) Taxonomic relations between categories
T9 hormones cause elevated heart rates. Elevated
heart rates cause faster metabolisms. Mice have
T9 hormones. ?
163
Integrating multiple forms of reasoning
164
Infinite relational models (Kemp, Tenenbaum,
Griffiths, Yamada Ueda, AAAI 06)
(c.f. Xu, Tresp, et al. SRL 06)
concept
predicate
concept
  • Biomedical predicate data from UMLS (McCrae et
    al.)
  • 134 concepts enzyme, hormone, organ, disease,
    cell function ...
  • 49 predicates affects(hormone, organ),
    complicates(enzyme, cell function), treats(drug,
    disease), diagnoses(procedure, disease)

165
Learning relational theories
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
166
Learning annotated hierarchies from
relational data(Roy, Kemp, Mansinghka, Tenenbaum
NIPS 06)
167
Learning abstract relational structures
Dominance hierarchy Tree
Cliques Ring
Primate troop Bush administration
Prison inmates Kula islands x
beats y x told y x likes y
x trades with y
168
Bayesian inference in neural networks
(Rao, in press)
169
The big problem of intelligence
  • The development of intuitive theories in
    childhood.
  • Psychology How do we learn to understand others
    actions in terms of beliefs, desires, plans,
    intentions, values, morals?
  • Biology How do we learn that people, dogs, bees,
    worms, trees, flowers, grass, coral, moss are
    alive, but chairs, cars, tricycles, computers,
    the sun, Roomba, robots, clocks, rocks are not?

170
The big problem of intelligence
  • Common sense reasoning.
  • Consider a man named Boris.
  • Is the mother of Boriss father his grandmother?
  • Is the mother of Boriss sister his mother?
  • Is the son of Boriss sister his son?

(Note Boris and his family were stranded on a
desert island when he was a young boy.)
Write a Comment
User Comments (0)