Title: Bayesian Learning
1Bayesian Learning
- Thanks to Nir Friedman, HU
2Example
- Suppose we are required to build a controller
that removes bad oranges from a packaging line - Decision are made based on a sensor that reports
the overall color of the orange
Bad oranges
3Classifying oranges
- Suppose we know all the aspects of the problem
- Prior Probabilities
- Probability of good (1) and bad (-1) oranges
- P(C 1) probability of a good orange
- P(C -1) probability of a bad orange
- Note P(C 1) P(C -1) 1
- Assumption oranges are independent ?The
occurrence of a bad orange does not depend on
previous
4Classifying oranges (cont)
- Sensor performance
- Let X denote sensor measurement from each type of
oranges
5Bayes Rule
- Given this knowledge, we can compute the
posterior probabilities - Bayes Rule
6Posterior of Oranges
7Decision making
- Intuition
- Predict Good if P(C1 X) gt P(C-1 X)
- Predict Bad, otherwise
8Loss function
- Assume we have classes 1, -1
- Suppose we can make predictions a1,,ak
- A loss function L(ai, cj) describes the loss
associated with making prediction ai when the
class is cj
Real Label
-1 1
Bad 1 5
Good 10 0
Prediction
9Expected Risk
- Given the estimates of P(C X) we can compute
the expected conditional risk of each decision
10The Risk in Oranges
-1 1
Bad 1 5
Good 10 0
10
R(GoodX)
5
R(BadX)
0
11Optimal Decisions
- Goal
- Minimize risk
- Optimal decision rule
- Given X x, predict ai if R(aiXx) mina
R(aXx) - (break ties arbitrarily)
- Note randomized decisions do not help
120-1 Loss
- If we dont have prior knowledge, it is common to
use the 0-1 loss - L(a,c) 0 if a c
- L(a,c) 1 otherwise
- Consequence
- R(aX) P(a ?cX)
- Decision rulechoose ai if P(C ai X) maxa
P(C aX)
13Bayesian Decisions Summery
- Decisions based on two components
- Conditional distribution P(CX)
- Loss function L(A,C)
- Pros
- Specifies optimal actions in presence of noisy
signals - Can deal with skewed loss functions
- Cons
- Requires P(CX)
14Simple Statistics Binomial Experiment
Head
Tail
- When tossed, it can land in one of two positions
Head or Tail - We denote by ? the (unknown) probability P(H).
- Estimation task
- Given a sequence of toss samples x1, x2, ,
xM we want to estimate the probabilities P(H)
? and P(T) 1 - ?
15Why Learning is Possible?
- Suppose we perform M independent flips of the
thumbtack - The number of heads we see has a binomial
distribution - and thus
- This suggests, that we can estimate ? by
16Maximum Likelihood Estimation
- MLE Principle
- Learn parameters that maximize the likelihood
function - This is one of the most commonly used estimators
in statistics - Intuitively appealing
- Well studied properties
17Computing the Likelihood Functions
- To compute the likelihood in the coin tossing
example we only require NH and NT (the number of
heads and the number of tails) - Applying the MLE principle we get
- NH and NT are sufficient statistics for the
binomial distribution
18Sufficient Statistics
- A sufficient statistic is a function of the data
that summarizes the relevant information for the
likelihood - Formally, s(D) is a sufficient statistics if for
any two datasets D and D - s(D) s(D ) ? L(? D) L(? D)
19Maximum A Posterior (MAP)
- Suppose we observe the sequence
- H, H
- MLE estimate is P(H) 1, P(T) 0
- Should we really believe that tails are
impossible at this stage? - Such an estimate can have disastrous effect
- If we assume that P(T) 0, then we are willing
to act as though this outcome is impossible
20Laplace Correction
- Suppose we observe n coin flips with k heads
- MLE
- Laplace correction
- As though we observed one additional H and one
additional T - Can we justify this estimate? Uniform prior!
21Bayesian Reasoning
- In Bayesian reasoning we represent our
uncertainty about the unknown parameter ? by a
probability distribution - This probability distribution can be viewed as
subjective probability - This is a personal judgment of uncertainty
-
22Bayesian Inference
- We start with
- P(?) - prior distribution about the values of ?
- P(x1, , xn?) - likelihood of examples given a
known value ? - Given examples x1, , xn, we can compute
posterior distribution on ? - Where the marginal likelihood is
23Binomial Distribution Laplace Est.
- In this case the unknown parameter is ? P(H)
- Simplest prior P(?) 1 for 0lt? lt1
- Likelihood
- where k is number of heads in the sequence
- Marginal Likelihood
24Marginal Likelihood
- Using integration by parts we have
- Multiply both side by n choose k, we have
25Marginal Likelihood - Cont
- The recursion terminates when k n
- Thus
- We conclude that the posterior is
26Bayesian Prediction
- How do we predict using the posterior?
- We can think of this as computing the probability
of the next element in the sequence - Assumption if we know ?, the probability of Xn1
is independent of X1, , Xn
27Bayesian Prediction
28Naïve Bayes
29Bayesian Classification Binary Domain
- Consider the following situation
- Two classes -1, 1
- Each example is described by by N attributes
- Xn is a binary variable with value 0,1
- Example dataset
X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1
0 0 0 1
30Binary Domain - Priors
- How do we estimate P(C) ?
- Simple Binomial estimation
- Count of instances with C -1, and with C 1
X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1
0 0 0 1
31Binary Domain - Attribute Probability
- How do we estimate P(X1,,XNC) ?
- Two sub-problems
X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1
0 0 0 1
32Naïve Bayes
- Naïve Bayes
- Assume
- This is an independence assumption
- Each attribute Xi is independent of the other
attributes once we know the value of C
33Naïve BayesBoolean Domain
- Parameters
- for each i
-
- How do we estimate ?11?
- Simple binomial estimation
- Count 1 and 0 values of X1in instances where
C1
X1 X2 XN C
0 1 1 1
1 0 1 -1
1 1 0 1
0 0 0 1
34Interpretation of Naïve Bayes
35Interpretation of Naïve Bayes
- Each Xi votes about the prediction
- If P(XiC-1) P(XiC1) then Xi has no say in
classification - If P(XiC-1) 0 then Xi overrides all other
votes (veto)
36Interpretation of Naïve Bayes
37Normal Distribution
- The Gaussian distribution
0.4
0.3
0.2
0.1
0
-4
-2
0
2
4
6
8
10
38Maximum Likelihood Estimate
- Suppose we observe x1, , xm
- Simple calculations show that the MLE is
- Sufficient statistics are
39Naïve Bayes with Gaussian Distributions
- Recall,
- Assume
- Mean of Xi depends on class
- Variance of Xi does not
40Naïve Bayes with Gaussian Distributions
Distance between means
Distance of Xi to midway point
41Different Variances?
- If we allow different variances, the
classification rule is more complex - The term is quadratic in Xi