Title: CHAPTER 3: Bayesian Decision Theory
1CHAPTER 3Bayesian Decision Theory
Slides based on Lecture slides from E. Alpaydin
from Bogazici Univ. Tutorial from Josh
Tenenbaum Tom Griffiths from MIT
2Everyday inductive leaps
- Learning concepts and words from examples
horse
horse
horse
3Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
4Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
5Everyday inductive leaps
- How can we learn so much about . . .
- Properties of natural kinds
- Meanings of words
- Future outcomes of a dynamic process
- Causal laws governing a domain
- . . .
- from such limited data?
6The Challenge
- How do we generalize successfully from very
limited data? - Just one or a few examples
- Often only positive examples
- Machine learning and statistics focus on
- generalization from many examples, both positive
and negative.
7Why Bayes?
- A framework for explaining cognition.
- How people can learn so much from such limited
data. - Strong quantitative models with minimal ad hoc
assumptions. - ...
- A framework for understanding how structured
knowledge and statistical inference interact. - How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor). - ...
8Bayesian models of inductive learning some
recent history
- Shepard (1987)
- Analysis of one-shot stimulus generalization, to
explain the universal exponential law. - Anderson (1990)
- Models of categorization and causal induction.
- Oaksford Chater (1994)
- Model of conditional reasoning (Wason selection
task). - Heit (1998)
- Framework for category-based inductive reasoning.
9Bayes Rule
likelihood
prior
posterior
evidence
10The origin of Bayes rule
- A simple consequence of using probability to
represent degrees of belief - For any two random variables
11Rational statistical inference (Bayes)
Sum over space of hypotheses
12Theory-Based Bayesian Models
- Rational statistical inference (Bayes)
- Learners domain theories generate their
hypothesis space H and prior p(h). - Well-matched to structure of the natural world.
- Learnable from limited data.
13Coin Flipping Example
14Coin flipping
HHTHT
HHHHH
What process produced these sequences?
15Hypotheses in Bayesian inference
- Hypotheses H refer to processes that could have
generated the data D - Bayesian inference provides a distribution over
these hypotheses, given D - P(DH) is the probability of D being generated by
the process identified by H - Hypotheses H are mutually exclusive only one
process could have generated D
16Why represent degrees of belief with
probabilities?
- consistency and worst-case error bounds.
- necessary to cohere with common sense
- if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord. - a common currency for combining prior knowledge
and the lessons of experience.
17Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...
Statistical Generative models
18Representing generative models
- Graphical model notation
- Pearl (1988), Jordan (1998)
- Variables are nodes, edges indicate dependency
- Directed edges show causal process of data
generation
19Models with latent structure
- Not all nodes in a graphical model need to be
observed - Some variables reflect latent structure, used in
generating D but unobserved
20Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
- Comparing infinitely many hypotheses
- P(H) p
- Psychology Representativeness
21Comparing two simple hypotheses
- Contrast simple hypotheses
- H1 fair coin, P(H) 0.5
- H2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form
22Bayes rule in odds form
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D data
- H1, H2 models
- P(H1D) posterior probability H1 generated the
data - P(DH1) likelihood of data under model H1
- P(H1) prior probability H1 generated the data
x
23Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
- P(H1D) / P(H2D) infinity
x
25Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 30
x
26Comparing two simple hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 1
x
27Comparing two simple hypotheses
- Bayes rule tells us how to combine prior beliefs
with new data - top-down and bottom-up influences
- As a model of human inference
- predicts conclusions drawn from data
- identifies point at which prior beliefs are
overwhelmed by new experiences - But more complex cases?
- Handled, but we wont cover in CS512
28Hypothesis Testing and Human Predictions
29Bayesian Decision for Classification
30Probability and Inference Reminder
- Result of tossing a coin is ÃŽ Heads,Tails
- Sample X xt Nt 1
- Prediction of next toss
- Heads if po gt ½, Tails otherwise
- Estimation po Heads/Tosses ?t xt / N
- Random var X ÃŽHeads, Tails
- Bernoulli P X Heads poX (1 - po)(1 -X)
31Bayes Rule
likelihood
prior
posterior
evidence
32Classification
- Credit scoring Inputs are income and savings.
- Output is low-risk vs high-risk
- Input x x1,x2T ,Output C ÃŽ 0,1
- Prediction
33Bayes Rule Kgt2 Classes
34Losses and Risks
- Actions ai (e.g. Choose class Ci)
- Loss of ai when the state is Ck ?ik (loss for
choosing the ith class when the true class is the
kth class) - Expected risk (Duda and Hart, 1973)
35Losses and Risks 0/1 Loss
For minimum risk, choose the most probable class
36Losses and Risks Reject
37Discriminant Functions
K decision regions R1,...,RK
38K2 Classes
- Dichotomizer (K2) vs Polychotomizer (Kgt2)
- g(x) g1(x) g2(x)
- Log odds
39Bayesian Belief Nets
40Conditional Independence
- Independence among two variables X and Y
- P(X,Y) P(X)P(Y)
- P(XY) P(X)
- Conditional Independence among two variables X
and Y, given Z. - P(X,YZ) P(XZ)P(YZ)
- P(XY,Z) P(XZ)
41Conditional Independence
42The books slides for Chp. 3
43Causes and Bayes Rule
Diagnostic inference Knowing that the grass is
wet, what is the probability that rain is the
cause?
diagnostic
causal
44Causal vs Diagnostic Inference
Causal inference If the sprinkler is on, what
is the probability that the grass is
wet? P(WS) P(WR,S) P(RS) P(WR,S)
P(RS) P(WR,S) P(R) P(WR,S) P(R)
0.95 0.4 0.9 0.6 0.92
Diagnostic inference If the grass is wet, what
is the probability that the sprinkler is on?
P(SW) 0.35 gt 0.2 P(S) P(SR,W)
0.21 Explaining away Knowing that it has
rained decreases the probability that the
sprinkler is on.
45Bayesian Networks Causes
Causal inference P(WC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) and use the fact that
P(R,SC) P(RC) P(SC) Diagnostic P(CW ) ?
46Bayesian Nets Local structure
P (F C) ?
47(No Transcript)
48(No Transcript)
49Bayesian Networks Classification
Bayes rule inverts the arc
diagnostic P (C x )
50Naive Bayes Classifier
- P(CauseEffect1,...EffectN) a P (Cause) Pi
P(EffectiCause) - P(Class X1,...XN ) a P (Class) Pi P(XiClass)
- Often used as a simplifying assumption (when in
fact the Effect variables are not conditionally
independent given Cause) - Naive Bayes
- Surprisingly good
51Naive Bayes Classifier
Given C, xj are independent p(xC) p(x1C)
p(x2C) ... p(xdC)
52Belief Propagation in BBNs
53(No Transcript)
54Entering hard evidence this is the simple
case. What about belief update in the other
direction?
55Entering hard evidence Note that our belief in
Martin being late is also increased. How does
evidence propagate in Belief Networks?
56Diverging connection entering some evidence
(hard or soft) about NormanLate is propagated to
TrainStrike and MartinLate. If we had hard
evidence about TrainStrike, any new evidence
about NormanLate would not be propagated to
MartinLate.
57Diverging connection If we had hard evidence
about TrainStrike, any new evidence about
NormanLate would not be propagated to MartinLate
(the chidren are then conditionally independent
given the parent)
58(No Transcript)
59Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike and Oversleep.
60Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike, Oversleep and NormanLate.
61Converging connection If we have no info about
MartinLate, Oversleep and TrainStrike are
independent no evidence is transmitted between
them.
62 What about the other direction? (we have some
evidence about C)? More lecture notes in
Bayesian Belief Nets.doc
63Bayesian Networks Inference
- Inference in BBNs are NP-hard in the general
case. Efficiency depends on sparseness of graph
structure. - Inference in singly connected networks (at most
one undirected path between any two nodes in the
network), also called polytrees, is linear in the
size of the network. - For multiply connected networks, inference may
have exponential time, even when the number of
parents are small. - Stochastic sampling to approximate inference
- Belief propagation (Pearl, 1988) for trees
- Junction trees (Lauritzen and Spiegelhalter,
1988) - converting Dir.Acyc.Graphs to trees
64- How to learn a network structure from experience
- Network structure is given simple calculation of
the conditionals - Network structure not given
- Trade-off between the network complexity and
accuracy over the training data - Some variable values are not observed
- EM algorithm
65Decision Theory
66(No Transcript)
67(No Transcript)
68(No Transcript)
69Influence Diagrams
decision node
utility node
chance node
70(No Transcript)
71(No Transcript)
72testj
no test
testz
...
B
A
Possible outcome ejk
Possible actions
Expectation over Possible outcomes
...
73(No Transcript)
74Association Rules
- Association rule X Y (roughly, meaning that X
implies Y) - Confidence of the association rule X Y
- Support of the association rule (X Y)
-
Apriori algorithm (Agrawal et al., 1996)
75Skip Rest
76(No Transcript)
77(No Transcript)
78(No Transcript)