Bayesian Learning - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

Bayesian Learning

Description:

If P(X,Y)=P(X)P(Y) the random variables X and Y are said to be independent. ... Remember Occam's razor, a popular inductive bias that 'prefers shortest ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 104
Provided by: berrinya
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Learning


1
  • Bayesian Learning
  • Machine Learning by Mitchell-Chp. 6
  • Neural Networks for Pattern Recognition by Bishop
    Chp. 1
  • Berrin Yanikoglu
  • Oct 2009

2
Basic Probability
3
Probability Theory
  • Marginal Probability
  • Conditional Probability

Joint Probability
4
Probability Theory
  • Marginal Probability
  • Conditional Probability

Joint Probability
5
(No Transcript)
6
Probability Theory
  • Sum Rule

Product Rule
7
The Rules of Probability
  • Sum Rule
  • Product Rule

8
Probability - Basics
Note that when events are mutually exclusive, the
RHS is simply the addition.
9
Independence
  • If P(X,Y)P(X)P(Y)
  • the random variables X and Y are said to be
    independent.
  • Equivalently, P(X Y) P(X)

10
Bayes Theorem
posterior ? likelihood prior
11
Bayesian Decision Theory
12
  • Imagine that your task is to classify as (C1)
    from bs (C2)
  • How would you decide if you had to decide without
    seeing a new instance?
  • Choose C1 if P(C1) gt P(C2) prior probabilities
  • Choose C2 otherwise

13
2) How about if you have one measured feature X
about your instance?
14
Definition of probabilities based on frequences
P(C1,Xx) num. samples in corresponding box
num. all samples //joint
probability of C1 and X P(XxC1) num. samples
in corresponding box num.
of samples in C1-row //class-conditional
probability of X P(C1) num. of of
samples in C1-row num.
all samples //prior probability of C1
P(C1,Xx) P(XxC1) P(C1) Bayes Thm.
15
Histogram representation better highlights the
decision problem.
16
  • You would minimize misclassification errors if
    you choose the class that has the maximum
    posterior probability
  • Choose C1 if p(C1Xx) gt p(C2Xx)
  • Choose C2 otherwise
  • Equivalently, since p(C1Xx) p(XxC1)P(C1)/P(X
    x)
  • Choose C1 if p(XxC1)P(C1) gt p(XxC2)P(C2)
  • Choose C2 otherwise
  • Notice that both p(XxC1) and P(C1) are easy to
    compute.

17
Posterior Probability Distribution
18
Probability Densities
Cumulative Probability
19
Continuous valued attributes
  • P(x ? a, b) 1 if the interval a, b
    corresponds to the whole of X-space.
  • Note that to be proper, we use upper-case letters
    for probabilities and lower-case letters for
    probability densities but this is not always
    followed.
  • For continuous variables, the class-conditional
    probabilities introduced above become
    class-conditional probability density functions,
    which we write in the form p(xCk).

20
Multible attributes
  • If there are d variables/attributes x1,...,xd, we
    may group them into a vector x x1,... ,xdT
    corresponding to a point in a d-dimensional
    space.
  • The distribution of values of x can be described
    by probability density function p(x), such that
    the probability of x lying in a region R of the
  • d-dimensional space is given by

21
Bayes Thm. in General
  • The prior probabilities can be combined with the
    class conditional densities to give the posterior
    probabilities P(Ckx) using Bayes theorem
  • Note that you can show (and generalize to k
    classes)

22
Decision Regions
  • In general, assign a feature x to Ck if Ckargmax
    (P(Cjx))

  • j
  • Equivalently, assign a feature x to Ck if
  • This generates c decision regions R1Rc such that
    a point falling in region Rk is assigned to class
    Ck.
  • Note that each of these regions need not be
    contiguous, but may itself be divided into
    several disjoint regions all of which are
    associated with the same class.
  • The boundaries between these regions are known as
    decision surfaces or decision boundaries.


23
Probability of Error
  • For two regions R1 R2 (you can generalize)

Not ideal decision boundary!
24
Justification for the Decision Criteriabased on
max. Posterior probability
25
Minimum Misclassification Rate
26
Justification for the Decision Criteriabased on
max. Posterior probability
  • For the more general case of K classes, it is
    slightly easier to maximize the probability of
    being correct

27
Expected Value
  • The expected value of a function f(x), where x
    has the probability density p(x) is
  • Discrete

    Continuous
  • For a finite set of data points x1 , . . . ,xn,
    drawn from the distribution p(x), the expectation
    can be approximated by the average over the data
    points

28
Minimum Expected Loss/Risk
  • We define a loss matrix with elements Lkj
    specifying the penalty associated with assigning
    a pattern to class Cj when in fact it belongs to
    class Ck.
  • Example classify medical images as cancer or
    normal

Decision
Truth
29
Minimum Expected Loss/Risk
30
Minimum Expected Loss/Risk
  • We define a loss matrix with elements Lkj
    specifying the penalty associated with assigning
    a pattern to class Cj when in fact it belongs to
    class Ck. (k-to-j)
  • Consider all the patterns x which belong to class
    Ck. The expected loss for only those patterns is
    given by
  • Overall expected loss/risk

Risk associated with instances from class k
31
Minimizing Expected Risk
  • This risk is minimized if the integrand is
    minimized at each point x, that is if the regions
    Rj are chosen such that x ? Rj when
  • full generalization of the simple function
    minimizing the number of misclassifications

32
End of class on Oct 22
33
Reject Option
34
Discriminant Functions
  • Although we have focused on probability
    distribution functions, the decision on class
    membership in our classifiers has been based
    solely on the relative sizes of the
    probabilities.
  • This observation allows us to reformulate the
    classification process in terms of a set of
    discriminant functions y1(x),...., yc(x) such
    that an input vector x is assigned to class Ck
    if
  • We can recast the decision rule for minimizing
    the probability of misclassification in terms of
    discriminant functions, by choosing

35
Discriminant Functions
We can use any monotonic function of yk(x) that
would simplify calculations, since a monotonic
transformation does not change the order of yks.
36
  • In fact, we can categorize three fundamental
    approaches to classification
  • Generative models Model p(xCk) and P(Ck)
    separately and use the Bayes theorem to find the
    posterior probabilities P(Ckx)
  • E.g. Naive Bayes, Gaussian Mixture Models, Hidden
    Markov Models,
  • Discriminative models
  • Determine P(Ckx) directly and use in decision
  • E.g. Linear discriminant analysis, SVMs, NNs,
  • Find a discriminant function f that maps x onto a
    class label directly without calculating
    probabilities
  • Advantages? Disadvantages?

37
Generative vs Discriminative Model Complexities
38
Why Separate Inference and Decision?
  • Having probabilities are useful
  • Minimizing risk (loss matrix may change over
    time)
  • If we only have a discriminant function, any
    change in the loss function would require
    re-training
  • Reject option
  • Posterior probabilities allow us to determine a
    rejection criterion that will minimize the
    misclassification rate (or more generally the
    expected loss) for a given fraction of rejected
    data points
  • Unbalanced class priors
  • Artificially balanced data
  • After training, we can divide the obtained
    posteriors by the class fractions in the data set
    and multiply with class fractions for the true
    population
  • Combining models
  • We may wish to break a complex problem into
    smaller subproblems
  • E.g. Blood tests, X-Rays,
  • As long as each model gives posteriors for each
    class, we can combine the outputs using rules of
    probability. How?

39
Mitchell Chp.6
  • Maximum Likelihood (ML)
  • Maximum A Posteriori (MAP)
  • Hypotheses

40
Advantages of Bayesian Learning
  • Bayesian approaches, including the Naive Bayes
    classifier, are among the most common and
    practical ones in machine learning
  • Bayesian methods provide a useful perspective for
    understanding many learning algorithms that do
    not manipulate probabilities

41
Features of Bayesian Learning
  • Each observed training data can incrementally
    decrease or increase the estimated probability of
    a hypothesis rather than completely eliminating
    a hypothesis if it is found to be inconsistent
    with a single example
  • Prior knowledge can be combined with observed
    data to determine the final probability of a
    hypothesis
  • New instances can be combined by combining
    predictions of multiple hypotheses
  • Even in computationally intractable cases,
    Bayesian optimal classifier provides a standard
    of optimal decision against which other
    practical methods can be compared

42
Bayes Theorem
- also called likelihood
We are interested in finding the best
hypothesis from some space H, given the observed
data D. - most probable hypothesis any initial
knowledge about the prior probabilities of
various hypotheses in H
43
Choosing Hypotheses
44
Choosing Hypotheses
45
Example to Work on
What are our hypotheses in this case?
46
(No Transcript)
47
Bayes Optimal ClassifierNaive Bayes Classifier
  • Mitchell 6.7-6.9

48
Bayes Optimal Classifier
  • Skip 6.5.
  • So far we have considered the question
  • "what is the most probable hypothesis given the
    training data?
  • In fact, the question that is often of most
    significance is
  • "what is the most probable classiffication of the
    new
  • instance given the training data?
  • Although it may seem that this second question
    can be answered by simply applying the MAP
    hypothesis to the new instance, in fact it is
    possible to do better.

49
Bayes Optimal Classifier
50
Bayes Optimal Classifier
No other classifier using the same hypothesis
space and same prior knowledge can outperform
this method on average
51
Gibbs Classifier (Opper and Haussler, 1991, 1994)
52
Naive Bayes Classifier
53
Naive Bayes Classifier
  • But it is difficult (requires a lot of data) to
    estimate
  • P(a1,a2,an vj)
  • Naive Bayes assumption

54
(No Transcript)
55
Illustrative Example
56
Illustrative Example
57
Naive Bayes Subtleties
58
Naive Bayes Subtleties
59
End of class on Oct 28
60
  • In the following few sections, we will give some
    justification to some machine learning approaches
    using a Bayesian perspective.
  • Concept learning vs MAP hypothesis
  • Justification for minimizing the sum squared
    error
  • Minimum Description Length principle
  • Then we will go back to the more practical side
    by estimating mean, variance and covariance
    estimates from a given data, so as to model a
    Gaussian distribution
  • Finally, we will use this distribution
    (parameters of which are now known) to find the
    most likely class label.

61
Concept Learning
  • Mitchell 6.3

62
  • What is the relation between Bayes thm. allowing
    us to compute posterior probabilities and concept
    learning?
  • Compare a Brute Force MAP algorithm to concept
    learning algorithms such as candidate-elimination,
    find-s
  • Brute Force MAP algorithm
  • Calculate the posterior probability of each
    hypothesis and output the one which is most
    likely.
  • Computationally complex, but theoretically
    interesting

63
  • For the Brute Force MAP algorithm, we must
    specify P(h) and P(Dh)
  • P(D) will be found from these two
  • Lets choose them to be consistent with the
    following assumptions that are also used in
    Find-S and Candidate Elimination
  • Training data D is noise free
  • The target concept c is in the hypothesis space H
  • Each hypothesis is equally probable (a priori)

64
  • Lets choose them to be consistent with our
    assumptions
  • Each hypothesis is equally probable (a priori)
    and correct hypothesis is in H
  • P(h) 1/H for all h in H (they are equally
    likely AND sum to 1)
  • Training data D is noise free
  • P(Dh) is the probability of observing D, given h
  • P(Dh) 1 if dih(xi) for all di in D (since we
    assume noise-free training data)
  • 0 otherwise
  • P(Dh) 1 if h is consistent with D
  • 0 otherwise

65
  • P(hD) P(Dh)P(h)
  • P(D)
  • For inconsistent hypotheses P(hD) 0. 1/H
    0

  • P(D)
  • For consistent hypotheses P(hD) 1. 1/H
    1___

  • P(D) VSH,D

  • with P(D)VSH,D/H as shown in the
    following slide.

66
(No Transcript)
67
  • In short, P(hD) under our assumed P(h) and
    P(Dh) is
  • P(hD) 1/VSH,D if h is consistent with D
  • 0 otherwise
  • Every consistent hypothesis is a MAP hypothesis
  • (since they all have equal posterior
    probability)

68
Generalizing to All Consistent Learners
  • Consistent learner One that outputs a hypothesis
    that commits zero error over the training data.
  • Every consistent learner outputs a MAP
    hypothesis, if we assume equal priors noise
    free data.
  • E.g. Find_S, CE

69
Characterizing Learning Algorithms by Equivalent
MAP Learners
Using the Bayesian framework, we can characterize
the implicit assumptions (i.e. the probability
distributions), under which the algorithm outputs
optimal (i.e. MAP) hypothesis. This is similar
to determining the inductive bias of a learner.
70
Evolution of Posterior Probabilities
The evolution of the probabilities associated
with the hypotheses As we gather more data
(nothing, then sample D1, then sample D2),
inconsistent hypotheses gets 0 posterior
probability and consistent ones share the
remaining probabilities (summing up to 1). Here
Di is used to indicate one training instance.
71
Deriving hML in a regression example
  • Mitchell 6.4

72
  • Bayesian analysis will show that under certain
    assumptions, any learning algorithm that
    minimizes the squared error between the output
    hypothesis predictions and the training data will
    output a Maximum Likelihood Hypothesis
  • Providing a Bayesian justification for methods
    such as NNs, curve fitting, using the squared
    error.

73
Learning a Real Valued Function
and standard deviation s.
74
(No Transcript)
75
Proof
Considering the probability of di given that h is
the correct description of the target function,
we use f(xi) h(xi)
76
From PRML Chp. 1
  • We can justify the minimization of the squared
    error, strictly from the curve fitting
    perspective. This is very similar to the previous
    few slides, but makes some notations and concepts
    more explicit and gives a slightly different
    emphasis.
  • The goal is to make a prediction for the target
    variable t given some new value of the input
    variable x on the basis of a set of training data
    comprising N input values xx1,x2,,xN and
    their corresponding target values tt1,t2,,tN.
  • We first proceed by finding the coefficients w of
    the polynomial that maximizes the likelihood of
    the data and then we output y(x,w)
  • (yw(x) may also be used, to indicate the
    dependence on w)

77
  • We can express our uncertainty over the value of
    the target variable using a probability
    distribution.
  • For this, lets assume that given the value of x,
    the corresponding value of t has a Gaussian
    distribution with a mean equal to the value
    y(x,w).

In PRML, b1/s2 is used for consistency with
future chapters
78
Maximum Likelihood
Use the training data x,t to determine the
values of the unknown parameters w and b by
maximum likelihood criteria
Determine by minimizing sum-of-squares
error, .
79
Predictive Distribution
80
MAP A Step towards Bayes SKIP
Determine by minimizing regularized
sum-of-squares error, .
81
Bayesian Curve FittingSKIP
82
Bayesian Predictive Distribution SKIP
83
Minimum Description Length Principle
  • Mitchell 6.6
  • Skip 6.5

84
Minimum Description Length Principle
  • Remember Occams razor, a popular inductive bias
    that prefers shortest explanation for the
    observed data.
  • Now we can give a Bayesian intuition to support
    to it.

85
Minimum Description Length Principle
This can be seen as a justification for
preferring shorter hypotheses, assuming a
particular representation scheme for encoding
hypotheses and data. Information Theory In
designing a code to transmit messages drawn at
random, one should assign shorter codes to
messages that are more probable in order to
minimize the transmission time/length in fact
optimal code is shown to be -log2 pi bits
(Shannon 1949). The expected length for
transmitting one message is Si pi log2pi which
is the entropy of the set of possible messages.
  • hMAP argmax P(hD)
  • h
  • argmax p(Dh)P(h)
  • h P(D)
  • argmax log2P(Dh) log2 P(h)
  • h
  • argmin log2P(Dh) -log2 P(h)
  • h

86
Reminder Entropy
  • What if we have the following distribution for a
    random variable x?
  • In order to save on transmission costs, we would
    design codes that
  • reflect this distribution

87
Reminder Entropy
88
Minimum Description Length Principle
  • Description length of message i with respect to
    code C
  • the number of bits required to encode message i
    using code C
  • LC(i)
  • hMAP argmin log2P(Dh) -log2 P(h)
  • h
  • argmin LCHP(Dh) LCDh P(h)
  • h

CH is the optimal encoding for H CDH is the
optimal encoding for D given h
89
  • Prefer the hypothesis that minimizes
  • length(h) length(additional information to
    encode D given h)
  • length(h) length(misclassifications)
  • since we only need to send a message when the
    data sample is not in agreement with h hence,
    only for misclassifications.
  • E.g. Encoding using Decision trees and MDL
  • Conclusions
  • If we choose the optimal encodings, HMDL HMAP
  • MDL based methods perform similar to standard
    pruning techniques.

90
Expectations
91
Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
92
Variance and Covariance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
93
MultiVariable and Conditional Expectations
Exf(x,y) S p(x) f(x,y)
x expectation of a function of two variable -
average of f(x,y) w.r.t the distribution of x.
Subscript indicates which variable is being
averaged over.
Conditional Expectation (discrete)
94
Normal Distribution
95
The Gaussian Distribution
96
Expectations
The average value of a function f(x) under a
probability distribution p(x) is called the
expectation of f(x). I.e. Average is weighted by
the relative probabilities of different values of
x.
Approximate Expectation (discrete and continuous)
For normally distributed x
97
Gaussian Mean and Variance
The variance of f(x) provides a measure for how
much f(x) varies around its mean Ef(x).
For normally distributed x
Co-variance of two random variables x and y
measures the extent to which they vary together.
two random variables x, y the extent to which
x and y vary together two vectors of random
variables x, y covariance is a matrix
98
Normal Distribution Multivariate Normal
Distribution
  • For a single variable, the normal density
    function is
  • For variables in higher dimensions, this
    generalizes to
  • where the mean m is now a d-dimensional vector,
  • is a d x d covariance matrix
  • S is the determinant of S

99
Decision Rules for the Normal Distribution
The general multivariate normal density is given
by a d-dimensional mean vector and a d-by-d
covariance matrix
100
Multivariate Normal Distribution
101
From the equation for the normal density, it is
apparent that points which have the same density
must have the same constant term Measures
the distance from x to µ in terms of ?
102
(No Transcript)
103
Why Mahalanobis Distance
It takes into account the covariance of the data.
E.g. Point P is at actually closer (Euclidean)
to the mean for the orange class, but using the
Mahalanobis distance, it is found to be closer to
'apple class.
104
The Multivariate Gaussian
Mahalanobis Distance
Contours of equal density
105
Contours of constant probability density for a 2D
Gaussian distribution with a) general covariance
matrix b) diagonal covariance matrix c) S
proportional to the identity matrix
106
Covariance Matrices
Write a Comment
User Comments (0)
About PowerShow.com