Title: Part III Learning structured representations Hierarchical Bayesian models
1Part IIILearning structured representationsHier
archical Bayesian models
2Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
3Outline
- Learning structured representations
- grammars
- logical theories
- Learning at multiple levels of abstraction
4A historical divide
Structured Representations
Unstructured Representations
vs
Innate knowledge
Learning
5 Structured Representations
Chomsky Keil
Structure Learning
Learning
McClelland, Rumelhart
Unstructured Representations
6Representations
asbestos
Causal networks
lung cancer
coughing
chest pain
Grammars
Logical theories
7Representations
Phonological rules
cause
Chemicals
Diseases
affect
affect
interact with
Semantic networks
disrupt
Biologicalfunctions
Bio-active substances
8How to learn a R
- Search for R that maximizes
- Prerequisites
- Put a prior over a hypothesis space of Rs.
- Decide how observable data are generated from an
underlying R.
9How to learn a R
anything
- Search for R that maximizes
- Prerequisites
- Put a prior over a hypothesis space of Rs.
- Decide how observable data are generated from an
underlying R.
10Context free grammar
S ? N VP
VP ? V
N ? Alice
V ? scratched
VP ? V N
N ? Bob
V ? cheered
S
S
N
VP
N
VP
V
Alice
V
Alice
N
cheered
scratched
Bob
11Probabilistic context free grammar
1.0
0.6
0.5
0.5
S ? N VP
VP ? V
N ? Alice
V ? scratched
0.4
0.5
0.5
VP ? V N
N ? Bob
V ? cheered
S
S
1.0
1.0
N
VP
N
VP
0.5
0.4
0.5
0.6
V
Alice
V
Alice
N
0.5
0.5
cheered
scratched
Bob
probability 1.0 0.5 0.6 0.3
probability 1.00.50.40.50.5 0.05
12The learning problem
Grammar G
1.0
0.6
0.5
0.5
S ? N VP
VP ? V
N ? Alice
V ? scratched
0.4
0.5
0.5
VP ? V N
N ? Bob
V ? cheered
Data D
Alice scratched. Bob scratched. Alice scratched
Alice. Alice scratched Bob. Bob scratched
Alice. Bob scratched Bob.
Alice cheered. Bob cheered. Alice cheered
Alice. Alice cheered Bob. Bob cheered Alice. Bob
cheered Bob.
13Grammar learning
- Search for G that maximizes
- Prior
- Likelihood
- assume that sentences in the data are
independently generated from the grammar.
(Horning 1969 Stolcke 1994)
14Experiment
...
(Stolcke, 1994)
15Generating grammar
Model solution
16Predicate logic
For all x and y, if y is the sibling of x then x
is the sibling of y
For all x, y and z, if x is the ancestor of y and
y is the ancestor of z, then x is the ancestor of
z.
17Learning a kinship theory
Theory T
Data D
- Sibling(victoria, arthur), Sibling(arthur,victori
a), - Ancestor(chris,victoria), Ancestor(chris,colin),
- Parent(chris,victoria), Parent(victoria,colin)
, - Uncle(arthur,colin), Brother(arthur,victoria)
(Hinton, Quinlan, )
18Learning logical theories
- Search for T that maximizes
- Prior
- Likelihood
- assume that the data include all facts that are
true according to T
(Conklin and Witten Kemp et al 08 Katz et al 08)
19Theory-learning in the lab
R(c,b)
R(k,c)
R(f,c)
R(c,l)
R(f,k)
R(k,l)
R(l,b)
R(f,l)
R(l,h)
R(f,b)
R(k,b)
R(f,h)
R(b,h)
R(c,h)
R(k,h)
(cf Krueger 1979)
20Theory-learning in the lab
Transitive
R(f,k). R(k,c). R(c,l). R(l,b). R(b,h).
R(X,Z) ? R(X,Y), R(Y,Z).
21Learning time
Complexity
trans.
trans.
Theory length
Goodman
(Kemp et al 08)
trans.
excep.
trans.
22Conclusion Part 1
- Bayesian models can combine structured
representations with statistical inference.
23Outline
- Learning structured representations
- grammars
- logical theories
- Learning at multiple levels of abstraction
24Vision
(Han and Zhu, 2006)
25Motor Control
(Wolpert et al., 2003)
26Causal learning
chemicals
Schema
diseases
symptoms
asbestos
mercury
Causal models
lung cancer
minamata disease
muscle wasting
coughing
chest pain
Patient 1 asbestos exposure, coughing, chest pain
ContingencyData
Patient 2 mercury exposure, muscle wasting
(Kelley Cheng Waldmann)
27Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
P(grammar UG)
Grammar
P(phrase structure grammar)
Phrase structure
P(utterance phrase structure)
Utterance
P(speech utterance)
Speech signal
28Hierarchical Bayesian model
U
Universal Grammar
P(GU)
G
Grammar
P(sG)
s1
s2
s3
s4
s5
s6
Phrase structure
P(us)
u1
u2
u3
u4
u5
u6
Utterance
A hierarchical Bayesian model specifies a joint
distribution over all variables in the
hierarchy P(ui, si, G U)
P (ui si) P(si G)
P(GU)
29Top-down inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer si given ui, G P( si ui, G)
a P( ui si ) P( si G)
30Bottom-up inferences
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer G given si and U P(G si, U) a P(
si G) P(GU)
31Simultaneous learning at multiple levels
U
Universal Grammar
G
Grammar
s1
s2
s3
s4
s5
s6
Phrase structure
u1
u2
u3
u4
u5
u6
Utterance
Infer G and si given ui and U P(G, si
ui, U) a P( ui si )P(si G)P(GU)
32Word learning
Whole-object bias Shape bias
Words in general
Individual words
gavagai
duck
monkey
car
Data
33A hierarchical Bayesian model
physical knowledge
Coins
q Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
q200
q1
q2
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
- Qualitative physical knowledge (symmetry) can
influence estimates of continuous parameters (FH,
FT).
- Explains why 10 flips of 200 coins are better
than 2000 flips of a single coin more
informative about FH, FT.
34Word Learning
This is a dax.
Show me the dax.
- 24 month olds show a shape bias
- 20 month olds do not
(Landau, Smith Gleitman)
35Is the shape bias learned?
- Smith et al (2002) trained 17-month-olds on
labels for 4 artificial categories - After 8 weeks of training 19-month-olds show the
shape bias
Show me the dax.
This is a dax.
36Learning about feature variability
?
(cf. Goodman)
37Learning about feature variability
?
(cf. Goodman)
38A hierarchical model
Meta-constraints
M
Color varies across bags but not much within bags
Bags in general
mostly red
mostly brown
mostly green
mostly yellow
mostly blue?
Bag proportions
Data
39A hierarchical Bayesian model
M
Meta-constraints
Within-bag variability
0.1
Bags in general
0.4, 0.4, 0.2
1,0,0
0,1,0
1,0,0
0,1,0
.1,.1,.8
Bag proportions
Data
6,0,0
0,6,0
6,0,0
0,6,0
0,0,1
40A hierarchical Bayesian model
M
Meta-constraints
Within-bag variability
5
Bags in general
0.4, 0.4, 0.2
.5,.5,0
.5,.5,0
.5,.5,0
.5,.5,0
.4,.4,.2
Bag proportions
Data
3,3,0
3,3,0
3,3,0
3,3,0
0,0,1
41Shape of the Beta prior
42A hierarchical Bayesian model
Meta-constraints
M
Bags in general
Bag proportions
Data
43A hierarchical Bayesian model
Meta-constraints
M
Bags in general
Bag proportions
Data
44Learning about feature variability
Meta-constraints
M
Categories in general
Individual categories
Data
45wib
lug
zup
div
46wib
lug
zup
div
dax
47Model predictions
Choice probability
dax
Show me the dax
48Where do priors come from?
Meta-constraints
M
Categories in general
Individual categories
Data
49Knowledge representation
50The discovery of structural form
BIOLOGY
POLITICS
Scalia
Ginsburg
Stevens
Thomas
COLOR
FRIENDSHIP
CHEMISTRY
51Children discover structural form
- Children may discover that
- Social networks are often organized into cliques
- The months form a cycle
- Heavier than is transitive
- Category labels can be organized into hierarchies
52A hierarchical Bayesian model
Meta-constraints
M
Form
Tree
Structure
Data
53A hierarchical Bayesian model
Meta-constraints
M
F form
Tree
S structure
D data
54Structural forms
Order
Chain
Ring
Partition
Hierarchy
Tree
Grid
Cylinder
55P(SF,n) Generating structures
- Each structure is weighted by the number of
nodes
it contains
if S inconsistent with F
otherwise
where is the number of nodes in S
56P(SF, n) Generating structures from forms
- Simpler forms are preferred
Chain
Grid
P(SF)
All possible graph structures S
A
B
C
D
57A hierarchical Bayesian model
Meta-constraints
M
F form
Tree
S structure
D data
58p(DS) Generating feature data
- Intuition features should be smooth over graph S
Relatively smooth
Not smooth
59p(DS) Generating feature data
i
Let be the feature value at node i
j
(Zhu, Lafferty Ghahramani)
60A hierarchical Bayesian model
Meta-constraints
M
F form
Tree
S structure
D data
61Feature data results
features
animals
cases
judges
62Developmental shifts
5 features
20 features
110 features
63Similarity data results
colors
colors
64Relational data
Meta-constraints
M
Form
Cliques
1
2
4
5
Structure
7
8
3
6
Data
65Relational data results
Primates
Prisoners
Bush cabinet
x dominates y
x is friends with y
x tells y
66Universal Structure grammar
U
Form
Structure
warm
Data
67Node-replacement graph grammars
Production (Chain)
Derivation
68A hypothesis space of forms
Form
Form
Process
Process
69The complete space of grammars
1
...
...
4096
70Universal Structure grammar
U
Form
Structure
feature
Data
71Conclusions Part 2
- Hierarchical Bayesian models provide a unified
framework which helps to explain - How abstract knowledge is acquired
- How abstract knowledge is used for induction
72Outline
- Learning structured representations
- grammars
- logical theories
- Learning at multiple levels of abstraction
73Handbook of Mathematical Psychology, 1963