Title: Part III Hierarchical Bayesian Models
1Part IIIHierarchical Bayesian Models
2Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
3Vision
(Han and Zhu, 2006)
4Word learning
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Principles
Structure
Data
5Hierarchical Bayesian models
- Can represent and reason about knowledge at
multiple levels of abstraction. - Have been used by statisticians for many years.
6Hierarchical Bayesian models
- Can represent and reason about knowledge at
multiple levels of abstraction. - Have been used by statisticians for many years.
- Have been applied to many cognitive problems
- causal reasoning (Mansinghka et al, 06)
- language (Chater and Manning, 06)
- vision (Fei-Fei, Fergus, Perona, 03)
- word learning (Kemp, Perfors, Tenenbaum,06)
- decision making (Lee, 06)
7Outline
- A high-level view of HBMs
- A case study
- Semantic knowledge
8Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
P(grammar UG)
Grammar
P(phrase structure grammar)
Phrase structure
P(utterance phrase structure)
Utterance
P(speech utterance)
Speech signal
9Hierarchical Bayesian model
U
Universal Grammar
P(GU)
G
Grammar
P(sG)
s1
s2
s3
s4
s5
s6
Phrase structure
P(us)
u1
u2
u3
u4
u5
u6
Utterance
10Hierarchical Bayesian model
U
Universal Grammar
P(GU)
G
Grammar
P(sG)
s1
s2
s3
s4
s5
s6
Phrase structure
P(us)
u1
u2
u3
u4
u5
u6
Utterance
A hierarchical Bayesian model specifies a joint
distribution over all variables in the
hierarchy P(ui, si, G U)
P (ui si) P(si G)
P(GU)
11Knowledge at multiple levels
- Top-down inferences
- How does abstract knowledge guide inferences at
lower levels? - Bottom-up inferences
- How can abstract knowledge be acquired?
- Simultaneous learning at multiple levels of
abstraction
12Top-down inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Given grammar G and a collection of utterances,
construct a phrase structure for each utterance.
13Top-down inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer si given ui, G P( si ui, G)
a P( ui si ) P( si G)
14Bottom-up inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Given a collection of phrase structures, learn a
grammar G.
15Bottom-up inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer G given si and U P(G si, U) a P(
si G) P(GU)
16Simultaneous learning at multiple levels
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Given a set of utterances ui and innate
knowledge U, construct a grammar G and a phrase
structure for each utterance.
17Simultaneous learning at multiple levels
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
- A chicken-or-egg problem
- Given a grammar, phrase structures can be
constructed - Given a set of phrase structures, a grammar can
be learned
18Simultaneous learning at multiple levels
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer G and si given ui and U P(G, si
ui, U) a P( ui si )P(si G)P(GU)
19Hierarchical Bayesian model
U
Universal Grammar
P(GU)
G
Grammar
P(sG)
s1
s2
s3
s4
s5
s6
Phrase structure
P(us)
u1
u2
u3
u4
u5
u6
Utterance
20Knowledge at multiple levels
- Top-down inferences
- How does abstract knowledge guide inferences at
lower levels? - Bottom-up inferences
- How can abstract knowledge be acquired?
- Simultaneous learning at multiple levels of
abstraction
21Outline
- A high-level view of HBMs
- A case study Semantic knowledge
22Folk Biology
The relationships between living kinds are well
described by tree-structured representations
R principles
mouse
S structure
squirrel
chimp
gorilla
D data
Gorillas have hands
23Folk Biology
R principles
Structural form tree
mouse
squirrel
S structure
chimp
gorilla
D data
24Outline
- A high-level view of HBMs
- A case study Semantic knowledge
- Property induction
- Learning structured representations
- Learning the abstract organizing principles of a
domain
25Property induction
R principles
Structural form tree
mouse
squirrel
S structure
chimp
gorilla
D data
26Property Induction
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
Approach work with the distribution P(DS,R)
27Property Induction
Previous approaches Rips (75), Osherson et al
(90), Sloman (93), Heit (98)
28Bayesian Property Induction
Hypotheses
29Bayesian Property Induction
Hypotheses
30D
C
31Choosing a prior
32Bayesian Property Induction
- A challenge
- We have to specify the prior, which typically
includes many numbers - An opportunity
- The prior can capture knowledge about the
problem.
33Property Induction
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
34Biological properties
- Structure
- Living kinds are organized into a tree
- Stochastic process
- Nearby species in the tree tend to share
properties
35Structure
36Structure
37Stochastic Process
- Nearby species in the tree tend to share
properties. - In other words, properties tend to be smooth over
the tree.
Smooth
Not smooth
38Stochastic process
Hypotheses
39Generating a property
y
h
where y tends to be smooth over the tree
threshold
40S
41The diffusion process
- where ?(yi) is 1 if yi 0 and 0 otherwise
the covariance K encourages y to be
smooth over the graph S
42p(yS,R) Generating a property
Let yi be the feature value at node i
i
j
(Zhu, Lafferty, Ghahramani 03)
43Biological properties
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
Approach work with the distribution P(DS,R)
44D
C
45Results
Human
Model
(Osherson et al)
46Results
Human
Model
Cows have property P. Elephants have property
P. Horses have property P. All mammals have
property P.
47Spatial model
Structural form 2D space Stochastic
process diffusion
R principles
squirrel
mouse
S structure
gorilla
chimp
D data
48Structure
49Structure
50Tree vs 2D
Tree diffusion
2D diffusion
horse
all mammals
51Biological Properties
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
52Three inductive contexts
can bite through wire
carries E. Spirus bacteria
has T4 cells
tree diffusion process
chain drift process
network causal transmission
R
Class D
Class D
Class A
Class A
Class A
Class B
Class F
Class E
Class C
S
Class C
Class C
Class B
Class D
Class G
Class E
Class E
Class B
Class F
Class F
Class G
Class G
53Threshold properties
- can bite through wire
- has skin that is more resistant to penetration
than most synthetic fibers
Doberman
Poodle
Collie
Hippo
Elephant
Cat
Lion
Camel
(Osherson et al Blok et al)
54Threshold properties
- Structure
- The categories can be organized along a single
dimension - Stochastic process
- Categories towards one end of the dimension are
more likely to have the novel property
55Results
has skin that is more resistant to penetration
than most synthetic fibers
1D drift
1D diffusion
(Blok et al, Smith et al)
56Three inductive contexts
can bite through wire
carries E. Spirus bacteria
has T4 cells
tree diffusion process
chain drift process
network causal transmission
R
Class D
Class D
Class A
Class A
Class A
Class B
Class F
Class E
Class C
S
Class C
Class C
Class B
Class D
Class G
Class E
Class E
Class B
Class F
Class F
Class G
Class G
57Causally transmitted properties
Grizzly bear
Salmon
(Medin et al Shafto and Coley)
58Causally transmitted properties
- Structure
- The categories can be organized into a directed
network - Stochastic process
- Properties are generated by a noisy transmission
process
59Experiment disease properties
(Shafto et al)
Island
Mammals
60Results disease properties
Web transmission
Island
Mammals
61Three inductive contexts
can bite through wire
carries E. Spirus bacteria
has T4 cells
tree diffusion process
chain drift process
network causal transmission
R
Class D
Class D
Class A
Class A
Class A
Class B
Class F
Class E
Class C
S
Class C
Class C
Class B
Class D
Class G
Class E
Class E
Class B
Class F
Class F
Class G
Class G
62Property Induction
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
Approach work with the distribution P(DS,R)
63Conclusions property induction
- Hierarchical Bayesian models help to explain how
abstract knowledge can be used for induction
64Outline
- A high-level view of HBMs
- A case study Semantic knowledge
- Property induction
- Learning structured representations
- Learning the abstract organizing principles of a
domain
65Structure learning
Structural form tree Stochastic process
diffusion
R Principles
mouse
squirrel
S structure
chimp
gorilla
D data
66Structure learning
Structural form tree Stochastic process
diffusion
R principles
?
S structure
D data
Goal find S that maximizes P(SD,R)
67Structure learning
Structural form tree Stochastic process
diffusion
R principles
?
S structure
D data
Goal find S that maximizes P(SD,R) a P(DS,R)
P(SR)
68Structure learning
Structural form tree Stochastic process
diffusion
R principles
The distribution previously used for property
induction
?
S structure
D data
Goal find S that maximizes P(SD,R) a P(DS,R)
P(SR)
69Generating features over the tree
mouse
squirrel
chimp
gorilla
70Generating features over the tree
mouse
squirrel
chimp
gorilla
71Structure learning
Structural form tree Stochastic process
diffusion
R principles
?
S structure
D data
Goal find S that maximizes P(SD,R) a P(DS,R)
P(SR)
72P(SR) Generating structures
Inconsistent with R
Consistent with R
73P(SR) Generating structures
Simple
Complex
74P(SR) Generating structures
- Each structure is weighted by the number of
nodes
it contains
if S inconsistent with R
otherwise
where is the number of nodes in S
75Structure Learning
R principles
- P(SD,R) will be high when
- The features in D vary smoothly over S
- S is a simple graph (a graph with few nodes)
S structure
D data
Aim find S that maximizes P(SD,R) a P(DS)
P(SR)
76Structure Learning
R principles
- P(SD,R) will be high when
- The features in D vary smoothly over S
- S is a simple graph (a graph with few nodes)
S structure
D data
Aim find S that maximizes P(SD,R) a P(DS)
P(SR)
77Structure learning example
- Participants rated the goodness of 85 features
for 48 animals - E.g., elephant
gray hairless toughskin big
bulbous longleg tail chewteeth
tusks smelly walks slow strong
muscle quadrapedal inactive
vegetation grazer oldworld bush
jungle ground timid smart group
(Osherson et al)
78Biological Data
Features
Animals
79Tree
80Spatial model
Structural form 2D space Stochastic
process diffusion
R principles
squirrel
mouse
S structure
gorilla
chimp
D data
812D space
82Conclusions structure learning
- Hierarchical Bayesian models provide a unified
framework for the acquisition and use of
structured representations
83Outline
- A high-level view of HBMs
- A case study Semantic knowledge
- Property induction
- Learning structured representations
- Learning the abstract organizing principles of a
domain
84Learning structural form
Structural form tree Stochastic process
diffusion
R principles
mouse
squirrel
S structure
chimp
gorilla
D data
85Which form is best?
Ostrich
Robin
Crocodile
Snake
Bat
Orangutan
Turtle
Robin
Crocodile
Snake
Bat
Turtle
Orangutan
Ostrich
86Structural forms
Order
Chain
Ring
Partition
Hierarchy
Tree
Grid
Cylinder
87Learning structural form
could be tree, 2D space, ring, .
Structural form F Stochastic process
diffusion
R principles
?
S structure
D data
Goal find S,F that maximize P(S,FD)
88Learning structural form
Structural form F Stochastic process
diffusion
R principles
?
S structure
Uniform distribution on the set of forms
D data
Aim find S,F that maximize P(S,FD) a
P(DS)P(SF) P(F)
89Learning structural form
Structural form F Stochastic process
diffusion
R principles
The distribution used for property induction
?
S structure
D data
Aim find S,F that maximize P(S,FD) a P(DS)
P(SF)P(F)
90Learning structural form
Structural form F Stochastic process
diffusion
R principles
The distribution used for structure learning
?
S structure
D data
Aim find S,F that maximize P(S,FD) a P(DS)
P(SF)P(F)
91P(SF) Generating structures from forms
- Each structure is weighted by the number of
nodes
it contains
if S inconsistent with F
otherwise
where is the number of nodes in S
92P(SF) Generating structures from forms
- Simpler forms are preferred
Chain
Grid
P(SF)
All possible graph structures S
A
B
C
D
93Learning structural form
?
F form
?
S structure
D data
Goal find S,F that maximize P(S,FD)
94Learning structural form
F form
- P(S,FD) will be high when
- The features in D vary smoothly over S
- S is a simple graph (a graph with few nodes)
- F is a simple form (a form that can generate only
a few structures)
S structure
D data
Aim find S,F that maximize P(S,FD) a P(DS)
P(SF)P(F)
95Learning structural form
F form
- P(S,FD) will be high when
- The features in D vary smoothly over F
- S is a simple graph (a graph with few nodes)
- F is a simple form (a form that can generate only
a few structures)
S structure
D data
Aim find S,F that maximize P(S,FD) a P(DS)
P(SF)P(F)
96Form learning Biological Data
Features
Animals
97Form learning Biological Data
98Supreme Court (Spaeth)
- Votes on 1600 cases (1987-2005)
99Color (Ekman)
100Outline
- A high-level view of HBMs
- A case study Semantic knowledge
- Property induction
- Learning structured representations
- Learning the abstract organizing principles of a
domain
101Where do priors come from?
102Stochastic process diffusion
mouse
squirrel
chimp
gorilla
103Structural form tree Stochastic process
diffusion
mouse
squirrel
chimp
gorilla
104Structural form tree Stochastic process
diffusion
mouse
squirrel
chimp
gorilla
105Where do structural forms come from?
Order
Chain
Ring
Partition
Hierarchy
Tree
Grid
Cylinder
106Where do structural forms come from?
Form
Form
Process
Process
107Node-replacement graph grammars
Production (Chain)
Derivation
108Node-replacement graph grammars
Production (Chain)
Derivation
109Node-replacement graph grammars
Production (Chain)
Derivation
110Where do structural forms come from?
Form
Form
Process
Process
111The complete space of grammars
1
...
...
4096
112When can we stop adding levels?
- When the knowledge at the top level is simple or
general enough that it can be plausibly assumed
to be innate.
113Conclusions
- Hierarchical Bayesian models provide a unified
framework which can - Explain how abstract knowledge is used for
induction - Explain how abstract knowledge can be acquired
114Learning abstract knowledge
- Applications of hierarchical Bayesian models at
this conference - Semantic knowledge Schmidt et al.
- Learning the M-constraint
- Syntax Perfors et al.
- Learning that language is hierarchically
organized - Word learning Kemp et al.
- Learning the shape bias