Title: Logistics
1Logistics
- Class size? Who is new? Who is listening?
- Everyone on Athena mailing list
concepts-and-theories? If not write to me. - Everyone on stellar yet? If not, write to Melissa
Yeh (mjyeh_at_mit.edu). - Interest in having a printed course pack, even if
a few readings get changed?
2Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
3Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
4Virtues of Bayesian framework
- Generates principled models with strong
explanatory and descriptive power.
5Virtues of Bayesian framework
- Generates principled models with strong
explanatory and descriptive power. - Unifies models of cognition across tasks and
domains. - Categorization
- Concept learning
- Word learning
- Inductive reasoning
- Causal inference
- Conceptual change
- Biology
- Physics
- Psychology
- Language
- . . .
6Virtues of Bayesian framework
- Generates principled models with strong
explanatory and descriptive power. - Unifies models of cognition across tasks and
domains. - Explains which processing models work, and why.
- Associative learning
- Connectionist networks
- Similarity to examples
- Toolkit of simple heuristics
7Virtues of Bayesian framework
- Generates principled models with strong
explanatory and descriptive power. - Unifies models of cognition across tasks and
domains. - Explains which processing models work, and why.
- Allows us to move beyond classic dichotomies.
- Symbols (rules, logic, hierarchies, relations)
versus Statistics - Domain-general versus Domain-specific
- Nature versus Nurture
8Virtues of Bayesian framework
- Generates principled models with strong
explanatory and descriptive power. - Unifies models of cognition across tasks and
domains. - Explains which processing models work, and why.
- Allows us to move beyond classic dichotomies.
- A framework for understanding theory-based
cognition - How are theories used to learn about the
structure of the world? - How are theories acquired?
9Rational statistical inference(Bayes, Laplace)
- Fundamental question
- How do we update beliefs in light of data?
-
- Fundamental (and only) assumption
- Represent degrees of belief as probabilities.
- The answer
- Mathematics of probability theory.
10What does probability mean?
- Frequentists Probability as expected frequency
- P(A) 1 A will always occur.
- P(A) 0 A will never occur.
- 0.5 lt P(A) lt 1 A will occur more often than not.
- Subjectivists Probability as degree of belief
- P(A) 1 believe A is true.
- P(A) 0 believe A is false.
- 0.5 lt P(A) lt 1 believe A is more likely to be
true than false.
11What does probability mean?
- Frequentists Probability as expected frequency
- P(heads) 0.5 If we flip 100 times, we
expect to see about 50 heads. - Subjectivists Probability as degree of belief
- P(heads) 0.5 On the next flip, its an
even bet whether it comes up heads or tails. - P(rain tomorrow) 0.8
- P(Saddam Hussein is dead) 0.1
- . . .
12Is subjective probability cognitively viable?
- Evolutionary psychologists (Gigerenzer, Cosmides,
Tooby, Pinker) argue it is not.
13- To understand the design of statistical
inference mechanisms, then, one needs to examine
what form inductive-reasoning problems -- and the
information relevant to solving them -- regularly
took in ancestral environments. Asking for
the probability of a single event seems
unexceptionable in the modern world, where we are
bombarded with numerically expressed statistical
information, such as weather forecasts telling us
there is a 60 chance of rain today. In
ancestral environments, the only external
database available from which to reason
inductively was one's own observations and,
possibly, those communicated by the handful of
other individuals with whom one lived. - The probability of a single event cannot be
observed by an individual, however. Single
events either happen or they dont -- either it
will rain today or it will not. Natural
selection cannot build cognitive mechanisms
designed to reason about, or receive as input,
information in a format that did not regularly
exist.
(Brase, Cosmides and Tooby, 1998)
14Is subjective probability cognitively viable?
- Evolutionary psychologists (Gigerenzer, Cosmides,
Tooby, Pinker) argue it is not. - Reasons to think it is
- Intuitions are old and potentially universal
(Aristotle, the Talmud). - Represented in semantics (and syntax?) of natural
language. - Extremely useful .
15Why be subjectivist?
- Often need to make inferences about singular
events - e.g., How likely is it to rain tomorrow?
- Cox Axioms
- A formal model of common sense
- Dutch Book Survival of the Fittest
- If your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord. - Provides a theory of learning
- A common currency for combining prior knowledge
and the lessons of experience.
16Cox Axioms (via Jaynes)
- Degrees of belief are represented by real
numbers. - Qualitative correspondence with common sense,
e.g. - Consistency
- If a conclusion can be reasoned in more than one
way, then every possible way must lead to the
same result. - All available evidence should be taken into
account when inferring a degree of belief. - Equivalent states of knowledge should be
represented with equivalent degrees of belief. - Accepting these axioms implies Bel can be
represented as a probability measure.
17Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
18Example flipping coins
- Flip a coin 10 times and see 5 heads, 5 tails.
- P(heads) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Future will be like the past.
- Suppose we had seen 4 heads and 6 tails.
- P(heads) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.
19Example flipping coins
- Represent prior knowledge as fictional
observations F. - E.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair. - After seeing 4 heads, 6 tails, P(heads) on next
flip 1004 / (10041006) 49.95 - E.g., F 3 heads, 3 tails weak expectation
that any new coin will be fair. - After seeing 4 heads, 6 tails, P(heads) on next
flip 7 / (79) 43.75. Prior knowledge too
weak.
20Example flipping thumbtacks
- Represent prior knowledge as fictional
observations F. - E.g., F 4 heads, 3 tails weak expectation
that tacks are slightly biased towards heads. - After seeing 2 heads, 0 tails, P(heads) on next
flip 6 / (63) 67. - Some prior knowledge is always necessary to avoid
jumping to hasty conclusions. - Suppose F After seeing 2 heads, 0 tails,
P(heads) on next flip 2 / (20) 100.
21Origin of prior knowledge
- Tempting answer prior experience
- Suppose you have previously seen 2000 coin flips
1000 heads, 1000 tails. - By assuming all coins (and flips) are alike,
these observations of other coins are as good as
actual observations of the present coin.
22Problems with simple empiricism
- Havent really seen 2000 coin flips, or any
thumbtack flips. - Prior knowledge is stronger than raw experience
justifies. - Havent seen exactly equal number of heads and
tails. - Prior knowledge is smoother than raw experience
justifies. - Should be a difference between observing 2000
flips of a single coin versus observing 10 flips
each for 200 coins, or 1 flip each for 2000
coins. - Prior knowledge is more structured than raw
experience.
23A simple theory
- Coins are manufactured by a standardized
procedure that is effective but not perfect. - Justifies generalizing from previous coins to the
present coin. - Justifies smoother and stronger prior than raw
experience alone. - Explains why seeing 10 flips each for 200 coins
is more valuable than seeing 2000 flips of one
coin. - Tacks are asymmetric, and manufactured to less
exacting standards.
24Limitations
- Can all domain knowledge be represented so
simply, in terms of an equivalent number of
fictional observations? - Suppose you flip a coin 25 times and get all
heads. Something funny is going on . - But with F 1000 heads, 1000 tails, P(heads) on
next flip 1025 / (10251000) 50.6. Looks
like nothing unusual.
25Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
26Basics
- Propositions A, B, C, . . . .
- Negation
- Logical operators and, or
- Obey classical logic, e.g.,
27Basics
- Conservation of belief
- Joint probability
- For independent propositions
- More generally
28Basics
- Example
- A Heads on flip 2
- B Tails on flip 2
29Basics
- All probabilities should be conditioned on
background knowledge K e.g., - All the same rules hold conditioned on any K
e.g., - Often background knowledge will be implicit,
brought in as needed.
30Bayesian inference
- Definition of conditional probability
- Bayes theorem
31Bayesian inference
- Definition of conditional probability
- Bayes rule
- Posterior probability
- Prior probability
- Likelihood
32Bayesian inference
- Bayes rule
- What makes a good scientific argument? P(HD) is
high if - Hypothesis is plausible P(H) is high
- Hypothesis strongly predicts the observed data
- P(DH) is high
- Data are surprising P(D) is low
33Bayesian inference
- Deriving a more useful version
34Bayesian inference
- Deriving a more useful version
35Bayesian inference
- Deriving a more useful version
Conditionalization
36Bayesian inference
- Deriving a more useful version
37Bayesian inference
- Deriving a more useful version
38Bayesian inference
- Deriving a more useful version
39Bayesian inference
- Deriving a more useful version
40Random variables
- Random variable X denotes a set of mutually
exclusive exhaustive propositions (states of the
world) - Bayes theorem for random variables
41Random variables
- Random variable X denotes a set of mutually
exclusive exhaustive propositions (states of the
world) - Bayes rule for more than two hypotheses
42Sherlock Holmes
- How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth? (The Sign
of the Four)
43Sherlock Holmes
- How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth? (The Sign
of the Four)
44Sherlock Holmes
- How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth? (The Sign
of the Four)
0
45Sherlock Holmes
- How often have I said to you that when you have
eliminated the impossible whatever remains,
however improbable, must be the truth? (The Sign
of the Four)
46Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
47Representativeness in reasoning
- Which sequence is more likely to be produced by
flipping a fair coin? - HHTHT
- HHHHH
48A reasoning fallacy
- Kahneman Tversky people judge the probability
of an outcome based on the extent to which it is
representative of the generating process.
49Predictive versus inductive reasoning
Hypothesis
H
Data
D
50Predictive versus inductive reasoning
Prediction given ?
H
D
51Predictive versus inductive reasoning
Prediction given ?
Induction ? given
H
D
52Bayes Rule in odds form
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D data
- H1, H2 models
- P(H1D) posterior probability that model 1
generated the data. - P(DH1) likelihood of data given model 1
- P(H1) prior probability that model 1 generated
the data
x
53Bayesian analysis of coin flipping
- D HHTHT
- H1, H2 fair coin, trick all heads coin.
- P(DH1) 1/32 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
- P(H1D) / P(H2D) infinity
P(H1D) P(DH1) P(H1) P(H2D)
P(DH2) P(H2)
x
54Bayesian analysis of coin flipping
- D HHHHH
- H1, H2 fair coin, trick all heads coin.
- P(DH1) 1/32 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) 999/32 301
P(H1D) P(DH1) P(H1) P(H2D)
P(DH2) P(H2)
x
55Bayesian analysis of coin flipping
- D HHHHHHHHHH
- H1, H2 fair coin, trick all heads coin.
- P(DH1) 1/1024 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) 999/1024 11
P(H1D) P(DH1) P(H1) P(H2D)
P(DH2) P(H2)
x
56The role of theories
- The fact that HHTHT looks representative of a
fair coin and HHHHH does not reflects our
implicit theories of how the world works. - Easy to imagine how a trick all-heads coin could
work high prior probability. - Hard to imagine how a trick HHTHT coin could
work low prior probability.
57Plan for tonight
- Why be Bayesian?
- Informal introduction to learning as
probabilistic inference - Formal introduction to probabilistic inference
- A little bit of mathematical psychology
- An introduction to Bayes nets
58Scaling up
- Three binary variables Cavity, Toothache, Catch
(whether dentists probe catches in your tooth).
59Scaling up
- Three binary variables Cavity, Toothache, Catch
(whether dentists probe catches in your tooth). - With n pieces of evidence, we need 2n1
conditional probabilities. - Here n2. Realistically, many more X-ray, diet,
oral hygiene, personality, . . . .
60Conditional independence
- All three variables are dependent, but Toothache
and Catch are independent given the presence or
absence of Cavity. - Both Toothache and Catch are caused by Cavity,
but via independent causal mechanisms. - In probabilistic terms
- With n pieces of evidence, x1, , xn, we need 2 n
conditional probabilities
61A simple Bayes net
- Graphical representation of relations between a
set of random variables - Causal interpretation independent local
mechanisms - Probabilistic interpretation factorizing complex
terms
62A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
63A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
64A more complex system
Battery
Radio
Ignition
Gas
Starts
On time to work
- Joint distribution sufficient for any inference
- General inference algorithm local message passing
65Explaining away
- Assume grass will be wet if and only if it rained
last night, or if the sprinklers were left on
66Explaining away
Compute probability it rained last night, given
that the grass is wet
67Explaining away
Compute probability it rained last night, given
that the grass is wet
68Explaining away
Compute probability it rained last night, given
that the grass is wet
69Explaining away
Compute probability it rained last night, given
that the grass is wet
70Explaining away
Compute probability it rained last night, given
that the grass is wet
71Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
72Explaining away
Compute probability it rained last night, given
that the grass is wet and sprinklers were left
on
73Explaining away
Discounting to prior probability.
74Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
- Observing rain, Wet becomes more active.
- Observing grass wet, Rain and Sprinkler become
more active. - Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!
- Excitatory links Rain Wet, Sprinkler
Wet
75Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
- Excitatory links Rain Wet, Sprinkler
Wet - Inhibitory link Rain Sprinkler
- Observing grass wet, Rain and Sprinkler become
more active. - Observing grass wet and sprinkler, Rain becomes
less active explaining away.
76Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
- Each new variable requires more inhibitory
connections. - Interactions between variables are not causal.
- Not modular.
- Whether a connection exists depends on what other
connections exist, in non-transparent ways. - Big holism problem.
- Combinatorial explosion.
77Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
78Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
79Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
80Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
81Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
82Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Example
83Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - Does not capture the correlation between
symptoms falsely believe P(Ache, Catch)
P(Ache) P(Catch).
Ache
Catch
Cavity
84Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - Inserting a new arrow allows us to capture this
correlation. - This model is too complex do not believe that
Ache
Catch
Cavity
85Causality and the Markov property
- Markov property Any variable is conditionally
independent of its non-descendants, given its
parents. - Suppose we get the direction of causality wrong,
thinking that symptoms causes diseases - New symptoms require a combinatorial
proliferation of new arrows. Too general, not
modular, holism, yuck . . . .
Ache
X-ray
Catch
Cavity
86Still to come
- Applications to models of categorization
- More on the relation between causality and
probability - Learning causal graph structures.
- Learning causal abstractions (diseases cause
symptoms) - Whats missing
87The end
88Mathcamp data raw
89Mathcamp data collapsed over parity
90Zenith radio data collapsed over parity
91(No Transcript)