Title: Bayesian Learning
1- Bayesian Learning
- Machine Learning by Mitchell-Chp. 6
- Neural Networks for Pattern Recognition by Bishop
Chp. 1 - Berrin Yanikoglu
- Oct 2009
2Basic Probability
3Probability Theory
- Marginal Probability
- Conditional Probability
Joint Probability
4Probability Theory
- Marginal Probability
- Conditional Probability
Joint Probability
5(No Transcript)
6Probability Theory
Product Rule
7The Rules of Probability
8Probability - Basics
Note that when events are mutually exclusive, the
RHS is simply the addition.
9Independence
- If P(X,Y)P(X)P(Y)
- the random variables X and Y are said to be
independent. - Equivalently, P(X Y) P(X)
10Bayes Theorem
posterior ? likelihood prior
11Bayesian Decision Theory
12- Imagine that your task is to classify as (C1)
from bs (C2) - How would you decide if you had to decide without
seeing a new instance? - Choose C1 if P(C1) gt P(C2) prior probabilities
- Choose C2 otherwise
132) How about if you have one measured feature X
about your instance?
14Definition of probabilities based on frequences
P(C1,Xx) num. samples in corresponding box
num. all samples //joint
probability of C1 and X P(XxC1) num. samples
in corresponding box num.
of samples in C1-row //class-conditional
probability of X P(C1) num. of of
samples in C1-row num.
all samples //prior probability of C1
P(C1,Xx) P(XxC1) P(C1) Bayes Thm.
15Histogram representation better highlights the
decision problem.
16- You would minimize misclassification errors if
you choose the class that has the maximum
posterior probability - Choose C1 if p(C1Xx) gt p(C2Xx)
- Choose C2 otherwise
- Equivalently, since p(C1Xx) p(XxC1)P(C1)/P(X
x) - Choose C1 if p(XxC1)P(C1) gt p(XxC2)P(C2)
- Choose C2 otherwise
- Notice that both p(XxC1) and P(C1) are easy to
compute.
17Posterior Probability Distribution
18Probability Densities
Cumulative Probability
19Continuous valued attributes
- P(x ? a, b) 1 if the interval a, b
corresponds to the whole of X-space. - Note that to be proper, we use upper-case letters
for probabilities and lower-case letters for
probability densities but this is not always
followed. - For continuous variables, the class-conditional
probabilities introduced above become
class-conditional probability density functions,
which we write in the form p(xCk).
20Multible attributes
- If there are d variables/attributes x1,...,xd, we
may group them into a vector x x1,... ,xdT
corresponding to a point in a d-dimensional
space. - The distribution of values of x can be described
by probability density function p(x), such that
the probability of x lying in a region R of the - d-dimensional space is given by
21Bayes Thm. in General
- The prior probabilities can be combined with the
class conditional densities to give the posterior
probabilities P(Ckx) using Bayes theorem - Note that you can show (and generalize to k
classes)
22Decision Regions
- In general, assign a feature x to Ck if Ckargmax
(P(Cjx)) -
j - Equivalently, assign a feature x to Ck if
- This generates c decision regions R1Rc such that
a point falling in region Rk is assigned to class
Ck. - Note that each of these regions need not be
contiguous, but may itself be divided into
several disjoint regions all of which are
associated with the same class. - The boundaries between these regions are known as
decision surfaces or decision boundaries. -
23Probability of Error
- For two regions R1 R2 (you can generalize)
Not ideal decision boundary!
24Justification for the Decision Criteriabased on
max. Posterior probability
25Minimum Misclassification Rate
26Justification for the Decision Criteriabased on
max. Posterior probability
- For the more general case of K classes, it is
slightly easier to maximize the probability of
being correct
27Expected Value
- The expected value of a function f(x), where x
has the probability density p(x) is - Discrete
Continuous - For a finite set of data points x1 , . . . ,xn,
drawn from the distribution p(x), the expectation
can be approximated by the average over the data
points
28Minimum Expected Loss/Risk
- We define a loss matrix with elements Lkj
specifying the penalty associated with assigning
a pattern to class Cj when in fact it belongs to
class Ck. - Example classify medical images as cancer or
normal
Decision
Truth
29Minimum Expected Loss/Risk
30Minimum Expected Loss/Risk
- We define a loss matrix with elements Lkj
specifying the penalty associated with assigning
a pattern to class Cj when in fact it belongs to
class Ck. (k-to-j) - Consider all the patterns x which belong to class
Ck. The expected loss for only those patterns is
given by - Overall expected loss/risk
Risk associated with instances from class k
31Minimizing Expected Risk
- This risk is minimized if the integrand is
minimized at each point x, that is if the regions
Rj are chosen such that x ? Rj when - full generalization of the simple function
minimizing the number of misclassifications
32End of class on Oct 22
33Reject Option
34Discriminant Functions
- Although we have focused on probability
distribution functions, the decision on class
membership in our classifiers has been based
solely on the relative sizes of the
probabilities. - This observation allows us to reformulate the
classification process in terms of a set of
discriminant functions y1(x),...., yc(x) such
that an input vector x is assigned to class Ck
if - We can recast the decision rule for minimizing
the probability of misclassification in terms of
discriminant functions, by choosing
35Discriminant Functions
We can use any monotonic function of yk(x) that
would simplify calculations, since a monotonic
transformation does not change the order of yks.
36- In fact, we can categorize three fundamental
approaches to classification - Generative models Model p(xCk) and P(Ck)
separately and use the Bayes theorem to find the
posterior probabilities P(Ckx) - E.g. Naive Bayes, Gaussian Mixture Models, Hidden
Markov Models, - Discriminative models
- Determine P(Ckx) directly and use in decision
- E.g. Linear discriminant analysis, SVMs, NNs,
- Find a discriminant function f that maps x onto a
class label directly without calculating
probabilities - Advantages? Disadvantages?
37Generative vs Discriminative Model Complexities
38Why Separate Inference and Decision?
- Having probabilities are useful
- Minimizing risk (loss matrix may change over
time) - If we only have a discriminant function, any
change in the loss function would require
re-training - Reject option
- Posterior probabilities allow us to determine a
rejection criterion that will minimize the
misclassification rate (or more generally the
expected loss) for a given fraction of rejected
data points - Unbalanced class priors
- Artificially balanced data
- After training, we can divide the obtained
posteriors by the class fractions in the data set
and multiply with class fractions for the true
population - Combining models
- We may wish to break a complex problem into
smaller subproblems - E.g. Blood tests, X-Rays,
- As long as each model gives posteriors for each
class, we can combine the outputs using rules of
probability. How?
39Mitchell Chp.6
- Maximum Likelihood (ML)
- Maximum A Posteriori (MAP)
- Hypotheses
40Advantages of Bayesian Learning
- Bayesian approaches, including the Naive Bayes
classifier, are among the most common and
practical ones in machine learning - Bayesian methods provide a useful perspective for
understanding many learning algorithms that do
not manipulate probabilities
41Features of Bayesian Learning
- Each observed training data can incrementally
decrease or increase the estimated probability of
a hypothesis rather than completely eliminating
a hypothesis if it is found to be inconsistent
with a single example - Prior knowledge can be combined with observed
data to determine the final probability of a
hypothesis - New instances can be combined by combining
predictions of multiple hypotheses - Even in computationally intractable cases,
Bayesian optimal classifier provides a standard
of optimal decision against which other
practical methods can be compared
42Bayes Theorem
- also called likelihood
We are interested in finding the best
hypothesis from some space H, given the observed
data D. - most probable hypothesis any initial
knowledge about the prior probabilities of
various hypotheses in H
43Choosing Hypotheses
44Choosing Hypotheses
45Example to Work on
What are our hypotheses in this case?
46(No Transcript)
47Bayes Optimal ClassifierNaive Bayes Classifier
48Bayes Optimal Classifier
- Skip 6.5.
- So far we have considered the question
- "what is the most probable hypothesis given the
training data? - In fact, the question that is often of most
significance is - "what is the most probable classiffication of the
new - instance given the training data?
- Although it may seem that this second question
can be answered by simply applying the MAP
hypothesis to the new instance, in fact it is
possible to do better.
49Bayes Optimal Classifier
50Bayes Optimal Classifier
No other classifier using the same hypothesis
space and same prior knowledge can outperform
this method on average
51Gibbs Classifier (Opper and Haussler, 1991, 1994)
52Naive Bayes Classifier
53Naive Bayes Classifier
- But it is difficult (requires a lot of data) to
estimate - P(a1,a2,an vj)
- Naive Bayes assumption
54(No Transcript)
55Illustrative Example
56Illustrative Example
57Naive Bayes Subtleties
58Naive Bayes Subtleties
59End of class on Oct 28
60- In the following few sections, we will give some
justification to some machine learning approaches
using a Bayesian perspective. - Concept learning vs MAP hypothesis
- Justification for minimizing the sum squared
error - Minimum Description Length principle
- Then we will go back to the more practical side
by estimating mean, variance and covariance
estimates from a given data, so as to model a
Gaussian distribution - Finally, we will use this distribution
(parameters of which are now known) to find the
most likely class label.
61Concept Learning
62- What is the relation between Bayes thm. allowing
us to compute posterior probabilities and concept
learning? - Compare a Brute Force MAP algorithm to concept
learning algorithms such as candidate-elimination,
find-s - Brute Force MAP algorithm
- Calculate the posterior probability of each
hypothesis and output the one which is most
likely. - Computationally complex, but theoretically
interesting
63- For the Brute Force MAP algorithm, we must
specify P(h) and P(Dh) - P(D) will be found from these two
- Lets choose them to be consistent with the
following assumptions that are also used in
Find-S and Candidate Elimination - Training data D is noise free
- The target concept c is in the hypothesis space H
- Each hypothesis is equally probable (a priori)
64- Lets choose them to be consistent with our
assumptions - Each hypothesis is equally probable (a priori)
and correct hypothesis is in H - P(h) 1/H for all h in H (they are equally
likely AND sum to 1) - Training data D is noise free
- P(Dh) is the probability of observing D, given h
- P(Dh) 1 if dih(xi) for all di in D (since we
assume noise-free training data) - 0 otherwise
- P(Dh) 1 if h is consistent with D
- 0 otherwise
65- P(hD) P(Dh)P(h)
- P(D)
- For inconsistent hypotheses P(hD) 0. 1/H
0 -
P(D) - For consistent hypotheses P(hD) 1. 1/H
1___ -
P(D) VSH,D -
with P(D)VSH,D/H as shown in the
following slide.
66(No Transcript)
67- In short, P(hD) under our assumed P(h) and
P(Dh) is - P(hD) 1/VSH,D if h is consistent with D
- 0 otherwise
- Every consistent hypothesis is a MAP hypothesis
- (since they all have equal posterior
probability) -
68Generalizing to All Consistent Learners
- Consistent learner One that outputs a hypothesis
that commits zero error over the training data. - Every consistent learner outputs a MAP
hypothesis, if we assume equal priors noise
free data. - E.g. Find_S, CE
69Characterizing Learning Algorithms by Equivalent
MAP Learners
Using the Bayesian framework, we can characterize
the implicit assumptions (i.e. the probability
distributions), under which the algorithm outputs
optimal (i.e. MAP) hypothesis. This is similar
to determining the inductive bias of a learner.
70Evolution of Posterior Probabilities
The evolution of the probabilities associated
with the hypotheses As we gather more data
(nothing, then sample D1, then sample D2),
inconsistent hypotheses gets 0 posterior
probability and consistent ones share the
remaining probabilities (summing up to 1). Here
Di is used to indicate one training instance.
71Deriving hML in a regression example
72- Bayesian analysis will show that under certain
assumptions, any learning algorithm that
minimizes the squared error between the output
hypothesis predictions and the training data will
output a Maximum Likelihood Hypothesis - Providing a Bayesian justification for methods
such as NNs, curve fitting, using the squared
error.
73Learning a Real Valued Function
and standard deviation s.
74(No Transcript)
75Proof
Considering the probability of di given that h is
the correct description of the target function,
we use f(xi) h(xi)
76From PRML Chp. 1
- We can justify the minimization of the squared
error, strictly from the curve fitting
perspective. This is very similar to the previous
few slides, but makes some notations and concepts
more explicit and gives a slightly different
emphasis. - The goal is to make a prediction for the target
variable t given some new value of the input
variable x on the basis of a set of training data
comprising N input values xx1,x2,,xN and
their corresponding target values tt1,t2,,tN. - We first proceed by finding the coefficients w of
the polynomial that maximizes the likelihood of
the data and then we output y(x,w) - (yw(x) may also be used, to indicate the
dependence on w)
77- We can express our uncertainty over the value of
the target variable using a probability
distribution. - For this, lets assume that given the value of x,
the corresponding value of t has a Gaussian
distribution with a mean equal to the value
y(x,w).
In PRML, b1/s2 is used for consistency with
future chapters
78Maximum Likelihood
Use the training data x,t to determine the
values of the unknown parameters w and b by
maximum likelihood criteria
Determine by minimizing sum-of-squares
error, .
79Predictive Distribution
80MAP A Step towards Bayes SKIP
Determine by minimizing regularized
sum-of-squares error, .
81Bayesian Curve FittingSKIP
82Bayesian Predictive Distribution SKIP
83Minimum Description Length Principle
84Minimum Description Length Principle
- Remember Occams razor, a popular inductive bias
that prefers shortest explanation for the
observed data. - Now we can give a Bayesian intuition to support
to it.
85Minimum Description Length Principle
This can be seen as a justification for
preferring shorter hypotheses, assuming a
particular representation scheme for encoding
hypotheses and data. Information Theory In
designing a code to transmit messages drawn at
random, one should assign shorter codes to
messages that are more probable in order to
minimize the transmission time/length in fact
optimal code is shown to be -log2 pi bits
(Shannon 1949). The expected length for
transmitting one message is Si pi log2pi which
is the entropy of the set of possible messages.
- hMAP argmax P(hD)
- h
- argmax p(Dh)P(h)
- h P(D)
- argmax log2P(Dh) log2 P(h)
- h
- argmin log2P(Dh) -log2 P(h)
- h
86Reminder Entropy
- What if we have the following distribution for a
random variable x? - In order to save on transmission costs, we would
design codes that - reflect this distribution
87Reminder Entropy
88Minimum Description Length Principle
- Description length of message i with respect to
code C - the number of bits required to encode message i
using code C - LC(i)
-
- hMAP argmin log2P(Dh) -log2 P(h)
- h
- argmin LCHP(Dh) LCDh P(h)
- h
CH is the optimal encoding for H CDH is the
optimal encoding for D given h
89- Prefer the hypothesis that minimizes
- length(h) length(additional information to
encode D given h) - length(h) length(misclassifications)
- since we only need to send a message when the
data sample is not in agreement with h hence,
only for misclassifications. - E.g. Encoding using Decision trees and MDL
- Conclusions
- If we choose the optimal encodings, HMDL HMAP
- MDL based methods perform similar to standard
pruning techniques.
90Expectations
91Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
92Variance and Covariance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
93MultiVariable and Conditional Expectations
Exf(x,y) S p(x) f(x,y)
x expectation of a function of two variable -
average of f(x,y) w.r.t the distribution of x.
Subscript indicates which variable is being
averaged over.
Conditional Expectation (discrete)
94Normal Distribution
95The Gaussian Distribution
96Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
For normally distributed x
97Gaussian Mean and Variance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
For normally distributed x
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
98Normal Distribution Multivariate Normal
Distribution
- For a single variable, the normal density
function is - For variables in higher dimensions, this
generalizes to - where the mean m is now a d-dimensional vector,
- is a d x d covariance matrix
- S is the determinant of S
99Decision Rules for the Normal Distribution
The general multivariate normal density is given
by a d-dimensional mean vector and a d-by-d
covariance matrix
100Multivariate Normal Distribution
101From the equation for the normal density, it is
apparent that points which have the same density
must have the same constant term Measures
the distance from x to µ in terms of ?
102(No Transcript)
103Why Mahalanobis Distance
It takes into account the covariance of the data.
E.g. Point P is at actually closer (Euclidean)
to the mean for the orange class, but using the
Mahalanobis distance, it is found to be closer to
'apple class.
104The Multivariate Gaussian
Mahalanobis Distance
Contours of equal density
105Contours of constant probability density for a 2D
Gaussian distribution with a) general covariance
matrix b) diagonal covariance matrix c) S
proportional to the identity matrix
106Covariance Matrices