Title: A%20Bayesian%20view%20of%20language%20evolution%20by%20iterated%20learning
1A Bayesian view of language evolution by iterated
learning
- Tom Griffiths
- Brown University
Mike Kalish University of Louisiana
2Linguistic universals
- Human languages are a subset of all logically
possible communication schemes - universal properties common to all languages
- (Comrie, 1981 Greenberg, 1963 Hawkins, 1988)
- Two questions
- why do linguistic universals exist?
- why are particular properties universal?
3Possible explanations
- Traditional answer
- linguistic universals reflect innate constraints
specific to a system for acquiring language - (e.g., Chomsky, 1965)
- Alternative answer
- linguistic universals emerge as the result of the
fact that language is learned anew by each
generation - (e.g., Briscoe, 1998 Kirby, 2001)
4Iterated learning(Kirby, 2001)
- Each learner sees data, forms a hypothesis,
produces the data given to the next learner - c.f. the playground game telephone
5The information bottleneck(Kirby, 2001)
size indicates compressibility
6Analyzing iterated learning
What are the consequences of iterated learning?
?
Komarova, Niyogi, Nowak (2002) Brighton (2002)
Kirby (2001) Smith, Kirby, Brighton (2003)
7Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
8Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
9Bayesian inference
- Rational procedure for updating beliefs
- Foundation of many learning algorithms
- (e.g., Mackay, 2003)
- Widely used for language learning
- (e.g., Charniak, 1993)
Reverend Thomas Bayes
10Bayes theorem
h hypothesis d data
11Iterated Bayesian learning
p(hd)
p(hd)
p(dh)
p(dh)
12Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
13Markov chains
x
x
x
x
x
x
x
x
Transition matrix P(x(t1)x(t))
- Variables x(t1) independent of history given
x(t) - Converges to a stationary distribution under
easily checked conditions for ergodicity
14Markov chain Monte Carlo
- A strategy for sampling from complex probability
distributions - Key idea construct a Markov chain which
converges to a particular distribution - e.g. Metropolis algorithm
- e.g. Gibbs sampling
15Gibbs sampling
- For variables x x1, x2, , xn
- Draw xi(t1) from P(xix-i)
- x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
xn(t) - Converges to P(x1, x2, , xn)
(Geman Geman, 1984)
(a.k.a. the heat bath algorithm in statistical
physics)
16Gibbs sampling
(MacKay, 2003)
17Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
18Analyzing iterated learning
- Iterated learning is a Markov chain on (h,d)
19Analyzing iterated learning
p(hd)
p(hd)
p(dh)
p(dh)
- Iterated learning is a Markov chain on (h,d)
- Iterated Bayesian learning is a Gibbs sampler for
20Analytic results
- Iterated Bayesian learning converges to
(geometrically Liu, Wong, Kong, 1995)
21Analytic results
- Iterated Bayesian learning converges to
- Corollaries
- distribution over hypotheses converges to p(h)
- distribution over data converges to p(d)
- the proportion of a population of iterated
learners with hypothesis h converges to p(h)
(geometrically Liu, Wong, Kong, 1995)
22Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
23A simple language model
24A simple language model
0
1
- Data m event-utterance pairs
- Hypotheses languages, with error ?
0
1
holistic
0
1
0
1
25Analysis technique
- Compute transition matrix on languages
- Sample Markov chains
- Compare language frequencies with prior
- (can also compute eigenvalues etc.)
26Convergence to priors
? 0.50, ? 0.05, m 3
Chain
Prior
? 0.01, ? 0.05, m 3
Iteration
27The information bottleneck
? 0.50, ? 0.05, m 1
Chain
Prior
? 0.01, ? 0.05, m 3
? 0.50, ? 0.05, m 10
Iteration
28The information bottleneck
Bottleneck affects relative stability of
languages favored by prior
29Outline
- Iterated Bayesian learning
- Markov chains
-
- Convergence results
- Example Emergence of compositionality
- Conclusion
30Implications for linguistic universals
- Two questions
- why do linguistic universals exist?
- why are particular properties universal?
- Different answers
- existence explained through iterated learning
- universal properties depend on the prior
- Focuses inquiry on the priors of the learners
- languages reflect the biases of human learners
31Extensions and future directions
- Results extend to
- unbounded populations
- continuous time population dynamics
- Iterated learning applies to other knowledge
- religious concepts, social norms, legends
- Provides a method for evaluating priors
- experiments in iterated learning with humans
32(No Transcript)
33Iterated function learning
- Each learner sees a set of (x,y) pairs
- Makes predictions of y for new x values
- Predictions are data for the next learner
34Function learning in the lab
Examine iterated learning with different initial
data
35Initial data
Iteration
1 2 3 4
5 6 7 8 9
(Kalish, 2004)
36(No Transcript)
37An example Gaussians
- If we assume
- data, d, is a single real number, x
- hypotheses, h, are means of a Gaussian, ?
- prior, p(?), is Gaussian(?0,?02)
- then p(xn1xn) is Gaussian(?n, ?x2 ?n2)
38?0 0, ?02 1, x0 20 Iterated learning
results in rapid convergence to prior
39An example Linear regression
- Assume
- data, d, are pairs of real numbers (x, y)
- hypotheses, h, are functions
- An example linear regression
- hypotheses have slope ? and pass through origin
- p(?) is Gaussian(?0,?02)
y
?
x 1
40y
?
?0 1, ?02 0.1, y0 -1
x 1
41(No Transcript)