CHAPTER 3: Bayesian Decision Theory presentation

About This Presentation

Transcript and Presenter's Notes

Title: CHAPTER 3: Bayesian Decision Theory

1
CHAPTER 3Bayesian Decision Theory
Slides based on Lecture slides from E. Alpaydin
from Bogazici Univ. Tutorial from Josh
Tenenbaum Tom Griffiths from MIT
2
Everyday inductive leaps

Learning concepts and words from examples

horse
horse
horse
3
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
4
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
5
Everyday inductive leaps

How can we learn so much about . . .
Properties of natural kinds
Meanings of words
Future outcomes of a dynamic process
Causal laws governing a domain
. . .
from such limited data?

6
The Challenge

How do we generalize successfully from very
limited data?
Just one or a few examples
Often only positive examples
Machine learning and statistics focus on
generalization from many examples, both positive
and negative.

7
Why Bayes?

A framework for explaining cognition.
How people can learn so much from such limited
data.
Strong quantitative models with minimal ad hoc
assumptions.
...
A framework for understanding how structured
knowledge and statistical inference interact.
How simplicity trades off with fit to the data in
evaluating structural hypotheses (Occams razor).
...

8
Bayesian models of inductive learning some
recent history

Shepard (1987)
Analysis of one-shot stimulus generalization, to
explain the universal exponential law.
Anderson (1990)
Models of categorization and causal induction.
Oaksford Chater (1994)
Model of conditional reasoning (Wason selection
task).
Heit (1998)
Framework for category-based inductive reasoning.

9
Bayes Rule
likelihood
prior
posterior
evidence
10
The origin of Bayes rule

A simple consequence of using probability to
represent degrees of belief
For any two random variables

11
Rational statistical inference (Bayes)
Sum over space of hypotheses
12
Theory-Based Bayesian Models

Rational statistical inference (Bayes)
Learners domain theories generate their
hypothesis space H and prior p(h).
Well-matched to structure of the natural world.
Learnable from limited data.

13
Coin Flipping Example
14
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
15
Hypotheses in Bayesian inference

Hypotheses H refer to processes that could have
generated the data D
Bayesian inference provides a distribution over
these hypotheses, given D
P(DH) is the probability of D being generated by
the process identified by H
Hypotheses H are mutually exclusive only one
process could have generated D

16
Why represent degrees of belief with
probabilities?

consistency and worst-case error bounds.
necessary to cohere with common sense
if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord.
a common currency for combining prior knowledge
and the lessons of experience.

17
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT

Fair coin, P(H) 0.5
Coin with P(H) p
Markov model
Hidden Markov model
...

Statistical Generative models
18
Representing generative models

Graphical model notation
Pearl (1988), Jordan (1998)
Variables are nodes, edges indicate dependency
Directed edges show causal process of data
generation

19
Models with latent structure

Not all nodes in a graphical model need to be
observed
Some variables reflect latent structure, used in
generating D but unobserved

20
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p
Comparing infinitely many hypotheses
P(H) p
Psychology Representativeness

21
Comparing two simple hypotheses

Contrast simple hypotheses
H1 fair coin, P(H) 0.5
H2always heads, P(H) 1.0
Bayes rule
With two hypotheses, use odds form

22
Bayes rule in odds form

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D data
H1, H2 models
P(H1D) posterior probability H1 generated the
data
P(DH1) likelihood of data under model H1
P(H1) prior probability H1 generated the data

x
23
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHTHT
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 0 P(H2) 1/1000
P(H1D) / P(H2D) infinity

x
25
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHH
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 30

x
26
Comparing two simple hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHHHHHHH
H1, H2 fair coin, always heads
P(DH1) 1/210 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 1

x
27
Comparing two simple hypotheses

Bayes rule tells us how to combine prior beliefs
with new data
top-down and bottom-up influences
As a model of human inference
predicts conclusions drawn from data
identifies point at which prior beliefs are
overwhelmed by new experiences
But more complex cases?
Handled, but we wont cover in CS512

28
Hypothesis Testing and Human Predictions
29
Bayesian Decision for Classification
30
Probability and Inference Reminder

Result of tossing a coin is Î Heads,Tails
Sample X xt Nt 1
Prediction of next toss
Heads if po gt ½, Tails otherwise
Estimation po Heads/Tosses ?t xt / N
Random var X ÎHeads, Tails
Bernoulli P X Heads poX (1 - po)(1 -X)

31
Bayes Rule
likelihood
prior
posterior
evidence
32
Classification

Credit scoring Inputs are income and savings.
Output is low-risk vs high-risk
Input x x1,x2T ,Output C Î 0,1
Prediction

33
Bayes Rule Kgt2 Classes
34
Losses and Risks

Actions ai (e.g. Choose class Ci)
Loss of ai when the state is Ck ?ik (loss for
choosing the ith class when the true class is the
kth class)
Expected risk (Duda and Hart, 1973)

35
Losses and Risks 0/1 Loss
For minimum risk, choose the most probable class
36
Losses and Risks Reject
37
Discriminant Functions
K decision regions R1,...,RK
38
K2 Classes

Dichotomizer (K2) vs Polychotomizer (Kgt2)
g(x) g1(x) g2(x)
Log odds

39
Bayesian Belief Nets
40
Conditional Independence

Independence among two variables X and Y
P(X,Y) P(X)P(Y)
P(XY) P(X)
Conditional Independence among two variables X
and Y, given Z.
P(X,YZ) P(XZ)P(YZ)
P(XY,Z) P(XZ)

41
Conditional Independence
42
The books slides for Chp. 3
43
Causes and Bayes Rule
Diagnostic inference Knowing that the grass is
wet, what is the probability that rain is the
cause?
diagnostic
causal
44
Causal vs Diagnostic Inference
Causal inference If the sprinkler is on, what
is the probability that the grass is
wet? P(WS) P(WR,S) P(RS) P(WR,S)
P(RS) P(WR,S) P(R) P(WR,S) P(R)
0.95 0.4 0.9 0.6 0.92
Diagnostic inference If the grass is wet, what
is the probability that the sprinkler is on?
P(SW) 0.35 gt 0.2 P(S) P(SR,W)
0.21 Explaining away Knowing that it has
rained decreases the probability that the
sprinkler is on.
45
Bayesian Networks Causes
Causal inference P(WC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) and use the fact that
P(R,SC) P(RC) P(SC) Diagnostic P(CW ) ?
46
Bayesian Nets Local structure
P (F C) ?
47
(No Transcript)
48
(No Transcript)
49
Bayesian Networks Classification
Bayes rule inverts the arc
diagnostic P (C x )
50
Naive Bayes Classifier

P(CauseEffect1,...EffectN) a P (Cause) Pi
P(EffectiCause)
P(Class X1,...XN ) a P (Class) Pi P(XiClass)
Often used as a simplifying assumption (when in
fact the Effect variables are not conditionally
independent given Cause)
Naive Bayes
Surprisingly good

51
Naive Bayes Classifier
Given C, xj are independent p(xC) p(x1C)
p(x2C) ... p(xdC)
52
Belief Propagation in BBNs
53
(No Transcript)
54
Entering hard evidence this is the simple
case. What about belief update in the other
direction?
55
Entering hard evidence Note that our belief in
Martin being late is also increased. How does
evidence propagate in Belief Networks?
56
Diverging connection entering some evidence
(hard or soft) about NormanLate is propagated to
TrainStrike and MartinLate. If we had hard
evidence about TrainStrike, any new evidence
about NormanLate would not be propagated to
MartinLate.
57
Diverging connection If we had hard evidence
about TrainStrike, any new evidence about
NormanLate would not be propagated to MartinLate
(the chidren are then conditionally independent
given the parent)
58
(No Transcript)
59
Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike and Oversleep.
60
Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike, Oversleep and NormanLate.
61
Converging connection If we have no info about
MartinLate, Oversleep and TrainStrike are
independent no evidence is transmitted between
them.
62

Serial connection

What about the other direction? (we have some
evidence about C)? More lecture notes in
Bayesian Belief Nets.doc
63
Bayesian Networks Inference

Inference in BBNs are NP-hard in the general
case. Efficiency depends on sparseness of graph
structure.
Inference in singly connected networks (at most
one undirected path between any two nodes in the
network), also called polytrees, is linear in the
size of the network.
For multiply connected networks, inference may
have exponential time, even when the number of
parents are small.
Stochastic sampling to approximate inference
Belief propagation (Pearl, 1988) for trees
Junction trees (Lauritzen and Spiegelhalter,
1988)
converting Dir.Acyc.Graphs to trees

How to learn a network structure from experience
Network structure is given simple calculation of
the conditionals
Network structure not given
Trade-off between the network complexity and
accuracy over the training data
Some variable values are not observed
EM algorithm

65
Decision Theory
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Influence Diagrams
decision node
utility node
chance node
70
(No Transcript)
71
(No Transcript)
72
testj
no test
testz
...
B
A
Possible outcome ejk
Possible actions
Expectation over Possible outcomes
...
73
(No Transcript)
74
Association Rules

Association rule X Y (roughly, meaning that X
implies Y)
Confidence of the association rule X Y

Support of the association rule (X Y)

Apriori algorithm (Agrawal et al., 1996)
75
Skip Rest
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

CHAPTER 3: Bayesian Decision Theory PowerPoint PPT Presentation