Bayesian Learning

About This Presentation

Title:

Bayesian Learning

Description:

If P(X,Y)=P(X)P(Y) the random variables X and Y are said to be independent. ... Remember Occam's razor, a popular inductive bias that 'prefers shortest ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 104

Provided by: berrinya

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Learning

1

Bayesian Learning
Machine Learning by Mitchell-Chp. 6
Neural Networks for Pattern Recognition by Bishop
Chp. 1
Berrin Yanikoglu
Oct 2009

2
Basic Probability
3
Probability Theory

Marginal Probability
Conditional Probability

Joint Probability
4
Probability Theory

Marginal Probability
Conditional Probability

Joint Probability
5
(No Transcript)
6
Probability Theory

Sum Rule

Product Rule
7
The Rules of Probability

Sum Rule
Product Rule

8
Probability - Basics
Note that when events are mutually exclusive, the
RHS is simply the addition.
9
Independence

If P(X,Y)P(X)P(Y)
the random variables X and Y are said to be
independent.
Equivalently, P(X Y) P(X)

10
Bayes Theorem
posterior ? likelihood prior
11
Bayesian Decision Theory
12

Imagine that your task is to classify as (C1)
from bs (C2)
How would you decide if you had to decide without
seeing a new instance?
Choose C1 if P(C1) gt P(C2) prior probabilities
Choose C2 otherwise

13
2) How about if you have one measured feature X
about your instance?
14
Definition of probabilities based on frequences
P(C1,Xx) num. samples in corresponding box
num. all samples //joint
probability of C1 and X P(XxC1) num. samples
in corresponding box num.
of samples in C1-row //class-conditional
probability of X P(C1) num. of of
samples in C1-row num.
all samples //prior probability of C1
P(C1,Xx) P(XxC1) P(C1) Bayes Thm.
15
Histogram representation better highlights the
decision problem.
16

You would minimize misclassification errors if
you choose the class that has the maximum
posterior probability
Choose C1 if p(C1Xx) gt p(C2Xx)
Choose C2 otherwise
Equivalently, since p(C1Xx) p(XxC1)P(C1)/P(X
x)
Choose C1 if p(XxC1)P(C1) gt p(XxC2)P(C2)
Choose C2 otherwise
Notice that both p(XxC1) and P(C1) are easy to
compute.

17
Posterior Probability Distribution
18
Probability Densities
Cumulative Probability
19
Continuous valued attributes

P(x ? a, b) 1 if the interval a, b
corresponds to the whole of X-space.
Note that to be proper, we use upper-case letters
for probabilities and lower-case letters for
probability densities but this is not always
followed.
For continuous variables, the class-conditional
probabilities introduced above become
class-conditional probability density functions,
which we write in the form p(xCk).

20
Multible attributes

If there are d variables/attributes x1,...,xd, we
may group them into a vector x x1,... ,xdT
corresponding to a point in a d-dimensional
space.
The distribution of values of x can be described
by probability density function p(x), such that
the probability of x lying in a region R of the
d-dimensional space is given by

21
Bayes Thm. in General

The prior probabilities can be combined with the
class conditional densities to give the posterior
probabilities P(Ckx) using Bayes theorem
Note that you can show (and generalize to k
classes)

22
Decision Regions

In general, assign a feature x to Ck if Ckargmax
(P(Cjx))
j
Equivalently, assign a feature x to Ck if
This generates c decision regions R1Rc such that
a point falling in region Rk is assigned to class
Ck.
Note that each of these regions need not be
contiguous, but may itself be divided into
several disjoint regions all of which are
associated with the same class.
The boundaries between these regions are known as
decision surfaces or decision boundaries.

23
Probability of Error

For two regions R1 R2 (you can generalize)

Not ideal decision boundary!
24
Justification for the Decision Criteriabased on
max. Posterior probability
25
Minimum Misclassification Rate
26
Justification for the Decision Criteriabased on
max. Posterior probability

For the more general case of K classes, it is
slightly easier to maximize the probability of
being correct

27
Expected Value

The expected value of a function f(x), where x
has the probability density p(x) is
Discrete

Continuous
For a finite set of data points x1 , . . . ,xn,
drawn from the distribution p(x), the expectation
can be approximated by the average over the data
points

28
Minimum Expected Loss/Risk

We define a loss matrix with elements Lkj
specifying the penalty associated with assigning
a pattern to class Cj when in fact it belongs to
class Ck.
Example classify medical images as cancer or
normal

Decision
Truth
29
Minimum Expected Loss/Risk
30
Minimum Expected Loss/Risk

We define a loss matrix with elements Lkj
specifying the penalty associated with assigning
a pattern to class Cj when in fact it belongs to
class Ck. (k-to-j)
Consider all the patterns x which belong to class
Ck. The expected loss for only those patterns is
given by
Overall expected loss/risk

Risk associated with instances from class k
31
Minimizing Expected Risk

This risk is minimized if the integrand is
minimized at each point x, that is if the regions
Rj are chosen such that x ? Rj when
full generalization of the simple function
minimizing the number of misclassifications

32
End of class on Oct 22
33
Reject Option
34
Discriminant Functions

Although we have focused on probability
distribution functions, the decision on class
membership in our classifiers has been based
solely on the relative sizes of the
probabilities.
This observation allows us to reformulate the
classification process in terms of a set of
discriminant functions y1(x),...., yc(x) such
that an input vector x is assigned to class Ck
if
We can recast the decision rule for minimizing
the probability of misclassification in terms of
discriminant functions, by choosing

35
Discriminant Functions
We can use any monotonic function of yk(x) that
would simplify calculations, since a monotonic
transformation does not change the order of yks.
36

In fact, we can categorize three fundamental
approaches to classification
Generative models Model p(xCk) and P(Ck)
separately and use the Bayes theorem to find the
posterior probabilities P(Ckx)
E.g. Naive Bayes, Gaussian Mixture Models, Hidden
Markov Models,
Discriminative models
Determine P(Ckx) directly and use in decision
E.g. Linear discriminant analysis, SVMs, NNs,
Find a discriminant function f that maps x onto a
class label directly without calculating
probabilities
Advantages? Disadvantages?

37
Generative vs Discriminative Model Complexities
38
Why Separate Inference and Decision?

Having probabilities are useful
Minimizing risk (loss matrix may change over
time)
If we only have a discriminant function, any
change in the loss function would require
re-training
Reject option
Posterior probabilities allow us to determine a
rejection criterion that will minimize the
misclassification rate (or more generally the
expected loss) for a given fraction of rejected
data points
Unbalanced class priors
Artificially balanced data
After training, we can divide the obtained
posteriors by the class fractions in the data set
and multiply with class fractions for the true
population
Combining models
We may wish to break a complex problem into
smaller subproblems
E.g. Blood tests, X-Rays,
As long as each model gives posteriors for each
class, we can combine the outputs using rules of
probability. How?

39
Mitchell Chp.6

Maximum Likelihood (ML)
Maximum A Posteriori (MAP)
Hypotheses

40
Advantages of Bayesian Learning

Bayesian approaches, including the Naive Bayes
classifier, are among the most common and
practical ones in machine learning
Bayesian methods provide a useful perspective for
understanding many learning algorithms that do
not manipulate probabilities

41
Features of Bayesian Learning

Each observed training data can incrementally
decrease or increase the estimated probability of
a hypothesis rather than completely eliminating
a hypothesis if it is found to be inconsistent
with a single example
Prior knowledge can be combined with observed
data to determine the final probability of a
hypothesis
New instances can be combined by combining
predictions of multiple hypotheses
Even in computationally intractable cases,
Bayesian optimal classifier provides a standard
of optimal decision against which other
practical methods can be compared

42
Bayes Theorem
- also called likelihood
We are interested in finding the best
hypothesis from some space H, given the observed
data D. - most probable hypothesis any initial
knowledge about the prior probabilities of
various hypotheses in H
43
Choosing Hypotheses
44
Choosing Hypotheses
45
Example to Work on
What are our hypotheses in this case?
46
(No Transcript)
47
Bayes Optimal ClassifierNaive Bayes Classifier

Mitchell 6.7-6.9

48
Bayes Optimal Classifier

Skip 6.5.
So far we have considered the question
"what is the most probable hypothesis given the
training data?
In fact, the question that is often of most
significance is
"what is the most probable classiffication of the
new
instance given the training data?
Although it may seem that this second question
can be answered by simply applying the MAP
hypothesis to the new instance, in fact it is
possible to do better.

49
Bayes Optimal Classifier
50
Bayes Optimal Classifier
No other classifier using the same hypothesis
space and same prior knowledge can outperform
this method on average
51
Gibbs Classifier (Opper and Haussler, 1991, 1994)
52
Naive Bayes Classifier
53
Naive Bayes Classifier

But it is difficult (requires a lot of data) to
estimate
P(a1,a2,an vj)
Naive Bayes assumption

54
(No Transcript)
55
Illustrative Example
56
Illustrative Example
57
Naive Bayes Subtleties
58
Naive Bayes Subtleties
59
End of class on Oct 28
60

In the following few sections, we will give some
justification to some machine learning approaches
using a Bayesian perspective.
Concept learning vs MAP hypothesis
Justification for minimizing the sum squared
error
Minimum Description Length principle
Then we will go back to the more practical side
by estimating mean, variance and covariance
estimates from a given data, so as to model a
Gaussian distribution
Finally, we will use this distribution
(parameters of which are now known) to find the
most likely class label.

61
Concept Learning

Mitchell 6.3

What is the relation between Bayes thm. allowing
us to compute posterior probabilities and concept
learning?
Compare a Brute Force MAP algorithm to concept
learning algorithms such as candidate-elimination,
find-s
Brute Force MAP algorithm
Calculate the posterior probability of each
hypothesis and output the one which is most
likely.
Computationally complex, but theoretically
interesting

For the Brute Force MAP algorithm, we must
specify P(h) and P(Dh)
P(D) will be found from these two
Lets choose them to be consistent with the
following assumptions that are also used in
Find-S and Candidate Elimination
Training data D is noise free
The target concept c is in the hypothesis space H
Each hypothesis is equally probable (a priori)

Lets choose them to be consistent with our
assumptions
Each hypothesis is equally probable (a priori)
and correct hypothesis is in H
P(h) 1/H for all h in H (they are equally
likely AND sum to 1)
Training data D is noise free
P(Dh) is the probability of observing D, given h
P(Dh) 1 if dih(xi) for all di in D (since we
assume noise-free training data)
0 otherwise
P(Dh) 1 if h is consistent with D
0 otherwise

P(hD) P(Dh)P(h)
P(D)
For inconsistent hypotheses P(hD) 0. 1/H
0
P(D)
For consistent hypotheses P(hD) 1. 1/H
1___
P(D) VSH,D
with P(D)VSH,D/H as shown in the
following slide.

66
(No Transcript)
67

In short, P(hD) under our assumed P(h) and
P(Dh) is
P(hD) 1/VSH,D if h is consistent with D
0 otherwise
Every consistent hypothesis is a MAP hypothesis
(since they all have equal posterior
probability)

68
Generalizing to All Consistent Learners

Consistent learner One that outputs a hypothesis
that commits zero error over the training data.
Every consistent learner outputs a MAP
hypothesis, if we assume equal priors noise
free data.
E.g. Find_S, CE

69
Characterizing Learning Algorithms by Equivalent
MAP Learners
Using the Bayesian framework, we can characterize
the implicit assumptions (i.e. the probability
distributions), under which the algorithm outputs
optimal (i.e. MAP) hypothesis. This is similar
to determining the inductive bias of a learner.
70
Evolution of Posterior Probabilities
The evolution of the probabilities associated
with the hypotheses As we gather more data
(nothing, then sample D1, then sample D2),
inconsistent hypotheses gets 0 posterior
probability and consistent ones share the
remaining probabilities (summing up to 1). Here
Di is used to indicate one training instance.
71
Deriving hML in a regression example

Mitchell 6.4

Bayesian analysis will show that under certain
assumptions, any learning algorithm that
minimizes the squared error between the output
hypothesis predictions and the training data will
output a Maximum Likelihood Hypothesis
Providing a Bayesian justification for methods
such as NNs, curve fitting, using the squared
error.

73
Learning a Real Valued Function
and standard deviation s.
74
(No Transcript)
75
Proof
Considering the probability of di given that h is
the correct description of the target function,
we use f(xi) h(xi)
76
From PRML Chp. 1

We can justify the minimization of the squared
error, strictly from the curve fitting
perspective. This is very similar to the previous
few slides, but makes some notations and concepts
more explicit and gives a slightly different
emphasis.
The goal is to make a prediction for the target
variable t given some new value of the input
variable x on the basis of a set of training data
comprising N input values xx1,x2,,xN and
their corresponding target values tt1,t2,,tN.
We first proceed by finding the coefficients w of
the polynomial that maximizes the likelihood of
the data and then we output y(x,w)
(yw(x) may also be used, to indicate the
dependence on w)

We can express our uncertainty over the value of
the target variable using a probability
distribution.
For this, lets assume that given the value of x,
the corresponding value of t has a Gaussian
distribution with a mean equal to the value
y(x,w).

In PRML, b1/s2 is used for consistency with
future chapters
78
Maximum Likelihood
Use the training data x,t to determine the
values of the unknown parameters w and b by
maximum likelihood criteria
Determine by minimizing sum-of-squares
error, .
79
Predictive Distribution
80
MAP A Step towards Bayes SKIP
Determine by minimizing regularized
sum-of-squares error, .
81
Bayesian Curve FittingSKIP
82
Bayesian Predictive Distribution SKIP
83
Minimum Description Length Principle

Mitchell 6.6
Skip 6.5

84
Minimum Description Length Principle

Remember Occams razor, a popular inductive bias
that prefers shortest explanation for the
observed data.
Now we can give a Bayesian intuition to support
to it.

85
Minimum Description Length Principle
This can be seen as a justification for
preferring shorter hypotheses, assuming a
particular representation scheme for encoding
hypotheses and data. Information Theory In
designing a code to transmit messages drawn at
random, one should assign shorter codes to
messages that are more probable in order to
minimize the transmission time/length in fact
optimal code is shown to be -log2 pi bits
(Shannon 1949). The expected length for
transmitting one message is Si pi log2pi which
is the entropy of the set of possible messages.

hMAP argmax P(hD)
h
argmax p(Dh)P(h)
h P(D)
argmax log2P(Dh) log2 P(h)
h
argmin log2P(Dh) -log2 P(h)
h

86
Reminder Entropy

What if we have the following distribution for a
random variable x?
In order to save on transmission costs, we would
design codes that
reflect this distribution

87
Reminder Entropy
88
Minimum Description Length Principle

Description length of message i with respect to
code C
the number of bits required to encode message i
using code C
LC(i)
hMAP argmin log2P(Dh) -log2 P(h)
h
argmin LCHP(Dh) LCDh P(h)
h

CH is the optimal encoding for H CDH is the
optimal encoding for D given h
89

Prefer the hypothesis that minimizes
length(h) length(additional information to
encode D given h)
length(h) length(misclassifications)
since we only need to send a message when the
data sample is not in agreement with h hence,
only for misclassifications.
E.g. Encoding using Decision trees and MDL
Conclusions
If we choose the optimal encodings, HMDL HMAP
MDL based methods perform similar to standard
pruning techniques.

90
Expectations
91
Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
92
Variance and Covariance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
93
MultiVariable and Conditional Expectations
Exf(x,y) S p(x) f(x,y)
x expectation of a function of two variable -
average of f(x,y) w.r.t the distribution of x.
Subscript indicates which variable is being
averaged over.
Conditional Expectation (discrete)
94
Normal Distribution
95
The Gaussian Distribution
96
Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
For normally distributed x
97
Gaussian Mean and Variance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
For normally distributed x
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
98
Normal Distribution Multivariate Normal
Distribution

For a single variable, the normal density
function is
For variables in higher dimensions, this
generalizes to
where the mean m is now a d-dimensional vector,
is a d x d covariance matrix
S is the determinant of S

99
Decision Rules for the Normal Distribution
The general multivariate normal density is given
by a d-dimensional mean vector and a d-by-d
covariance matrix
100
Multivariate Normal Distribution
101
From the equation for the normal density, it is
apparent that points which have the same density
must have the same constant term Measures
the distance from x to µ in terms of ?
102
(No Transcript)
103
Why Mahalanobis Distance
It takes into account the covariance of the data.
E.g. Point P is at actually closer (Euclidean)
to the mean for the orange class, but using the
Mahalanobis distance, it is found to be closer to
'apple class.
104
The Multivariate Gaussian
Mahalanobis Distance
Contours of equal density
105
Contours of constant probability density for a 2D
Gaussian distribution with a) general covariance
matrix b) diagonal covariance matrix c) S
proportional to the identity matrix
106
Covariance Matrices

Write a Comment

User Comments (0)