Bayesian Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Bayesian Learning

Description:

Note: randomized decisions do not help. 0-1 Loss ... Formally, s(D) is a sufficient statistics if for any two datasets D and D' s(D) = s(D' ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 42

Provided by: yishaym4

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Learning

1
Bayesian Learning

Thanks to Nir Friedman, HU

2
Example

Suppose we are required to build a controller
that removes bad oranges from a packaging line
Decision are made based on a sensor that reports
the overall color of the orange

Bad oranges
3
Classifying oranges

Suppose we know all the aspects of the problem
Prior Probabilities
Probability of good (1) and bad (-1) oranges
P(C 1) probability of a good orange
P(C -1) probability of a bad orange
Note P(C 1) P(C -1) 1
Assumption oranges are independent ?The
occurrence of a bad orange does not depend on
previous

4
Classifying oranges (cont)

Sensor performance
Let X denote sensor measurement from each type of
oranges

5
Bayes Rule

Given this knowledge, we can compute the
posterior probabilities
Bayes Rule

6
Posterior of Oranges
7
Decision making

Intuition
Predict Good if P(C1 X) gt P(C-1 X)
Predict Bad, otherwise

8
Loss function

Assume we have classes 1, -1
Suppose we can make predictions a1,,ak
A loss function L(ai, cj) describes the loss
associated with making prediction ai when the
class is cj

Real Label
-1 1
Bad 1 5
Good 10 0
Prediction
9
Expected Risk

Given the estimates of P(C X) we can compute
the expected conditional risk of each decision

10
The Risk in Oranges
-1 1
Bad 1 5
Good 10 0
10
R(GoodX)
5
R(BadX)
0
11
Optimal Decisions

Goal
Minimize risk
Optimal decision rule
Given X x, predict ai if R(aiXx) mina
R(aXx)
(break ties arbitrarily)
Note randomized decisions do not help

12
0-1 Loss

If we dont have prior knowledge, it is common to
use the 0-1 loss
L(a,c) 0 if a c
L(a,c) 1 otherwise
Consequence
R(aX) P(a ?cX)
Decision rulechoose ai if P(C ai X) maxa
P(C aX)

13
Bayesian Decisions Summery

Decisions based on two components
Conditional distribution P(CX)
Loss function L(A,C)
Pros
Specifies optimal actions in presence of noisy
signals
Can deal with skewed loss functions
Cons
Requires P(CX)

14
Simple Statistics Binomial Experiment
Head
Tail

When tossed, it can land in one of two positions
Head or Tail
We denote by ? the (unknown) probability P(H).
Estimation task
Given a sequence of toss samples x1, x2, ,
xM we want to estimate the probabilities P(H)
? and P(T) 1 - ?

15
Why Learning is Possible?

Suppose we perform M independent flips of the
thumbtack
The number of heads we see has a binomial
distribution
and thus
This suggests, that we can estimate ? by

16
Maximum Likelihood Estimation

MLE Principle
Learn parameters that maximize the likelihood
function
This is one of the most commonly used estimators
in statistics
Intuitively appealing
Well studied properties

17
Computing the Likelihood Functions

To compute the likelihood in the coin tossing
example we only require NH and NT (the number of
heads and the number of tails)
Applying the MLE principle we get
NH and NT are sufficient statistics for the
binomial distribution

18
Sufficient Statistics

A sufficient statistic is a function of the data
that summarizes the relevant information for the
likelihood
Formally, s(D) is a sufficient statistics if for
any two datasets D and D
s(D) s(D ) ? L(? D) L(? D)

19
Maximum A Posterior (MAP)

Suppose we observe the sequence
H, H
MLE estimate is P(H) 1, P(T) 0
Should we really believe that tails are
impossible at this stage?
Such an estimate can have disastrous effect
If we assume that P(T) 0, then we are willing
to act as though this outcome is impossible

20
Laplace Correction

Suppose we observe n coin flips with k heads
MLE
Laplace correction
As though we observed one additional H and one
additional T
Can we justify this estimate? Uniform prior!

21
Bayesian Reasoning

In Bayesian reasoning we represent our
uncertainty about the unknown parameter ? by a
probability distribution
This probability distribution can be viewed as
subjective probability
This is a personal judgment of uncertainty

22
Bayesian Inference

We start with
P(?) - prior distribution about the values of ?
P(x1, , xn?) - likelihood of examples given a
known value ?
Given examples x1, , xn, we can compute
posterior distribution on ?
Where the marginal likelihood is

23
Binomial Distribution Laplace Est.

In this case the unknown parameter is ? P(H)
Simplest prior P(?) 1 for 0lt? lt1
Likelihood
where k is number of heads in the sequence
Marginal Likelihood

24
Marginal Likelihood

Using integration by parts we have
Multiply both side by n choose k, we have

25
Marginal Likelihood - Cont

The recursion terminates when k n
Thus
We conclude that the posterior is

26
Bayesian Prediction

How do we predict using the posterior?
We can think of this as computing the probability
of the next element in the sequence
Assumption if we know ?, the probability of Xn1
is independent of X1, , Xn

27
Bayesian Prediction