CHAPTER 3: Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

CHAPTER 3: Bayesian Decision Theory

Description:

Lecture s from E. Alpaydin from Bogazici Univ. Tutorial from Josh Tenenbaum & Tom Griffiths from MIT. 2. Everyday ... necessary to cohere with common sense ... – PowerPoint PPT presentation

Number of Views:348
Avg rating:3.0/5.0
Slides: 76
Provided by: eth8
Category:

less

Transcript and Presenter's Notes

Title: CHAPTER 3: Bayesian Decision Theory


1
CHAPTER 3Bayesian Decision Theory
Slides based on Lecture slides from E. Alpaydin
from Bogazici Univ. Tutorial from Josh
Tenenbaum Tom Griffiths from MIT
2
Everyday inductive leaps
  • Learning concepts and words from examples

horse
horse
horse
3
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
4
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
5
Everyday inductive leaps
  • How can we learn so much about . . .
  • Properties of natural kinds
  • Meanings of words
  • Future outcomes of a dynamic process
  • Causal laws governing a domain
  • . . .
  • from such limited data?

6
The Challenge
  • How do we generalize successfully from very
    limited data?
  • Just one or a few examples
  • Often only positive examples
  • Machine learning and statistics focus on
  • generalization from many examples, both positive
    and negative.

7
Why Bayes?
  • A framework for explaining cognition.
  • How people can learn so much from such limited
    data.
  • Strong quantitative models with minimal ad hoc
    assumptions.
  • ...
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How simplicity trades off with fit to the data in
    evaluating structural hypotheses (Occams razor).
  • ...

8
Bayesian models of inductive learning some
recent history
  • Shepard (1987)
  • Analysis of one-shot stimulus generalization, to
    explain the universal exponential law.
  • Anderson (1990)
  • Models of categorization and causal induction.
  • Oaksford Chater (1994)
  • Model of conditional reasoning (Wason selection
    task).
  • Heit (1998)
  • Framework for category-based inductive reasoning.

9
Bayes Rule
likelihood
prior
posterior
evidence
10
The origin of Bayes rule
  • A simple consequence of using probability to
    represent degrees of belief
  • For any two random variables

11
Rational statistical inference (Bayes)
Sum over space of hypotheses
12
Theory-Based Bayesian Models
  • Rational statistical inference (Bayes)
  • Learners domain theories generate their
    hypothesis space H and prior p(h).
  • Well-matched to structure of the natural world.
  • Learnable from limited data.

13
Coin Flipping Example
14
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
15
Hypotheses in Bayesian inference
  • Hypotheses H refer to processes that could have
    generated the data D
  • Bayesian inference provides a distribution over
    these hypotheses, given D
  • P(DH) is the probability of D being generated by
    the process identified by H
  • Hypotheses H are mutually exclusive only one
    process could have generated D

16
Why represent degrees of belief with
probabilities?
  • consistency and worst-case error bounds.
  • necessary to cohere with common sense
  • if your beliefs do not accord with the laws of
    probability, then you can always be out-gambled
    by someone whose beliefs do so accord.
  • a common currency for combining prior knowledge
    and the lessons of experience.

17
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
  • Fair coin, P(H) 0.5
  • Coin with P(H) p
  • Markov model
  • Hidden Markov model
  • ...

Statistical Generative models
18
Representing generative models
  • Graphical model notation
  • Pearl (1988), Jordan (1998)
  • Variables are nodes, edges indicate dependency
  • Directed edges show causal process of data
    generation

19
Models with latent structure
  • Not all nodes in a graphical model need to be
    observed
  • Some variables reflect latent structure, used in
    generating D but unobserved

20
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p
  • Comparing infinitely many hypotheses
  • P(H) p
  • Psychology Representativeness

21
Comparing two simple hypotheses
  • Contrast simple hypotheses
  • H1 fair coin, P(H) 0.5
  • H2always heads, P(H) 1.0
  • Bayes rule
  • With two hypotheses, use odds form

22
Bayes rule in odds form
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D data
  • H1, H2 models
  • P(H1D) posterior probability H1 generated the
    data
  • P(DH1) likelihood of data under model H1
  • P(H1) prior probability H1 generated the data

x
23
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
24
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 0 P(H2) 1/1000
  • P(H1D) / P(H2D) infinity

x
25
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 30

x
26
Comparing two simple hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHHHHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/210 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 1

x
27
Comparing two simple hypotheses
  • Bayes rule tells us how to combine prior beliefs
    with new data
  • top-down and bottom-up influences
  • As a model of human inference
  • predicts conclusions drawn from data
  • identifies point at which prior beliefs are
    overwhelmed by new experiences
  • But more complex cases?
  • Handled, but we wont cover in CS512

28
Hypothesis Testing and Human Predictions
29
Bayesian Decision for Classification
30
Probability and Inference Reminder
  • Result of tossing a coin is ÃŽ Heads,Tails
  • Sample X xt Nt 1
  • Prediction of next toss
  • Heads if po gt ½, Tails otherwise
  • Estimation po Heads/Tosses ?t xt / N
  • Random var X ÃŽHeads, Tails
  • Bernoulli P X Heads poX (1 - po)(1 -X)

31
Bayes Rule
likelihood
prior
posterior
evidence
32
Classification
  • Credit scoring Inputs are income and savings.
  • Output is low-risk vs high-risk
  • Input x x1,x2T ,Output C ÃŽ 0,1
  • Prediction

33
Bayes Rule Kgt2 Classes
34
Losses and Risks
  • Actions ai (e.g. Choose class Ci)
  • Loss of ai when the state is Ck ?ik (loss for
    choosing the ith class when the true class is the
    kth class)
  • Expected risk (Duda and Hart, 1973)

35
Losses and Risks 0/1 Loss
For minimum risk, choose the most probable class
36
Losses and Risks Reject
37
Discriminant Functions
K decision regions R1,...,RK
38
K2 Classes
  • Dichotomizer (K2) vs Polychotomizer (Kgt2)
  • g(x) g1(x) g2(x)
  • Log odds

39
Bayesian Belief Nets
40
Conditional Independence
  • Independence among two variables X and Y
  • P(X,Y) P(X)P(Y)
  • P(XY) P(X)
  • Conditional Independence among two variables X
    and Y, given Z.
  • P(X,YZ) P(XZ)P(YZ)
  • P(XY,Z) P(XZ)

41
Conditional Independence
42
The books slides for Chp. 3
43
Causes and Bayes Rule
Diagnostic inference Knowing that the grass is
wet, what is the probability that rain is the
cause?
diagnostic
causal
44
Causal vs Diagnostic Inference
Causal inference If the sprinkler is on, what
is the probability that the grass is
wet? P(WS) P(WR,S) P(RS) P(WR,S)
P(RS) P(WR,S) P(R) P(WR,S) P(R)
0.95 0.4 0.9 0.6 0.92
Diagnostic inference If the grass is wet, what
is the probability that the sprinkler is on?
P(SW) 0.35 gt 0.2 P(S) P(SR,W)
0.21 Explaining away Knowing that it has
rained decreases the probability that the
sprinkler is on.
45
Bayesian Networks Causes
Causal inference P(WC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) P(WR,S) P(R,SC)
P(WR,S) P(R,SC) and use the fact that
P(R,SC) P(RC) P(SC) Diagnostic P(CW ) ?
46
Bayesian Nets Local structure
P (F C) ?
47
(No Transcript)
48
(No Transcript)
49
Bayesian Networks Classification
Bayes rule inverts the arc
diagnostic P (C x )
50
Naive Bayes Classifier
  • P(CauseEffect1,...EffectN) a P (Cause) Pi
    P(EffectiCause)
  • P(Class X1,...XN ) a P (Class) Pi P(XiClass)
  • Often used as a simplifying assumption (when in
    fact the Effect variables are not conditionally
    independent given Cause)
  • Naive Bayes
  • Surprisingly good

51
Naive Bayes Classifier
Given C, xj are independent p(xC) p(x1C)
p(x2C) ... p(xdC)
52
Belief Propagation in BBNs
53
(No Transcript)
54
Entering hard evidence this is the simple
case. What about belief update in the other
direction?
55
Entering hard evidence Note that our belief in
Martin being late is also increased. How does
evidence propagate in Belief Networks?
56
Diverging connection entering some evidence
(hard or soft) about NormanLate is propagated to
TrainStrike and MartinLate. If we had hard
evidence about TrainStrike, any new evidence
about NormanLate would not be propagated to
MartinLate.
57
Diverging connection If we had hard evidence
about TrainStrike, any new evidence about
NormanLate would not be propagated to MartinLate
(the chidren are then conditionally independent
given the parent)
58
(No Transcript)
59
Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike and Oversleep.
60
Converging connection entering some evidence
(hard or soft) about MartinLate is propagated to
TrainStrike, Oversleep and NormanLate.
61
Converging connection If we have no info about
MartinLate, Oversleep and TrainStrike are
independent no evidence is transmitted between
them.
62
  • Serial connection

What about the other direction? (we have some
evidence about C)? More lecture notes in
Bayesian Belief Nets.doc
63
Bayesian Networks Inference
  • Inference in BBNs are NP-hard in the general
    case. Efficiency depends on sparseness of graph
    structure.
  • Inference in singly connected networks (at most
    one undirected path between any two nodes in the
    network), also called polytrees, is linear in the
    size of the network.
  • For multiply connected networks, inference may
    have exponential time, even when the number of
    parents are small.
  • Stochastic sampling to approximate inference
  • Belief propagation (Pearl, 1988) for trees
  • Junction trees (Lauritzen and Spiegelhalter,
    1988)
  • converting Dir.Acyc.Graphs to trees

64
  • How to learn a network structure from experience
  • Network structure is given simple calculation of
    the conditionals
  • Network structure not given
  • Trade-off between the network complexity and
    accuracy over the training data
  • Some variable values are not observed
  • EM algorithm

65
Decision Theory
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Influence Diagrams
decision node
utility node
chance node
70
(No Transcript)
71
(No Transcript)
72
testj
no test
testz
...
B
A
Possible outcome ejk
Possible actions
Expectation over Possible outcomes
...
73
(No Transcript)
74
Association Rules
  • Association rule X Y (roughly, meaning that X
    implies Y)
  • Confidence of the association rule X Y
  • Support of the association rule (X Y)

Apriori algorithm (Agrawal et al., 1996)
75
Skip Rest
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com