CS 570 Artificial Intelligence Chapter 20. Bayesian Learning - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

Description:

Oct. 8, 2003. Jahwan Kim CS 570 Artificial Intelligence. 1. CS 570 ... We can embed knowledge by means of prior. It also controls the complexity of the model. ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 29

Provided by: aiKai

Category:

more less

Transcript and Presenter's Notes

Title: CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

1
CS 570 Artificial IntelligenceChapter 20.
Bayesian Learning

Jahwan Kim
Dept. of CS, KAIST

2
Contents

Bayesian Learning
Bayesian inference
MAP and ML
Naïve Bayes method
Bayesian network
Parameter Learning
Examples
Regression and LMS
EM Algorithm
Algorithm
Mixture of Gaussian

3
Bayesian Learning

Let h1,,hn be possible hypotheses.
Let d(d1,dn) be the observed data vectors.
Often (always) iid assumption is made.
Let X denote the prediction.
In Bayesian Learning,
Compute the probability of each hypothesis given
the data. Predict based on that basis.
Predictions are made by using all hypotheses.
Learning in Bayesian setting is reduced to
probabilistic inference.

4
Bayesian Learning

The probability that the prediction is X, when
the data d is observed is
P(Xd)åi P(Xd, hi)P(hid)
åi P(Xhi)P(hid)
Prediction is weighted average over the
predictions of individual hypothesis.
Hypotheses are intermediaries between the data
and the predictions.
Requires computing P(hid) for all i. This is
usually intractable.

5
Bayesian Learning BasicsTerms

P(hid) is called posterior (or a posteriori)
probability.
Using Bayes rule,
P(hid)/ P(dhi)P(hi)
P(hi) is called the (hypothesis) prior.
We can embed knowledge by means of prior.
It also controls the complexity of the model.
P(dhi) is called the likelihood of the data.
Under iid assumption,
P(dhi)Õj P(djhi).
Let hMAP be the hypothesis for which the
posterior probability P(hid) is maximal. It is
called the maximum a posteriori (or MAP)
hypothesis.

6
Bayesian Learning BasicsMAP Approximation

Since calculating the exact probability is often
impractical, we use approximation by MAP
hypothesis. That is,
P(Xd)¼P(XhMAP).
MAP is often easier than the full Bayesian
method, because instead of large summation
(integration), an optimization problem can be
solved.

7
Bayesian Learning BasicsMDL Principle

Since P(hid)/ P(dhi)P(hi), instead of
maximizing P(hid), we may maximize P(dhi)P(hi).
Equivalently, we may minimize
log P(dhi)P(hi)-log P(dhi)-log P(hi).
We can interpret this as choosing hi to minimize
the number of bits that is required to encode the
hypothesis hi and the data d under that
hypothesis.
The principle of minimizing code length (under
some pre-determined coding scheme) is called the
minimum description length (or MDL) principle.
MDL is used in wide range of practical machine
learning applications.

8
Bayesian Learning BasicsMaximum Likelihood

Assume furthermore that P(hi)s are all equal,
i.e., assume the uniform prior.
It is a reasonable approach when there is no
reason to prefer one hypothesis over another a
priori.
In that case, to obtain MAP hypothesis, it
suffices to maximize P(dhi), the likelihood.
Such hypothesis is called the maximum likelihood
hypothesis hML.
In other words,
MAP and uniform prior , ML

9
Bayesian Learning BasicsCandy Example

Two flavors of candy, cherry and lime.
Each piece of candy is wrapped in the same opaque
wrapper.
Sold in very large bags, of which there are known
to be five kinds h1 100 cherry, h2 75 cherry
25 lime, h3 50-50, h4 25-75, h5 100 lime
Priors known P(h1),,P(h5) are 0.1, 0.2, 0.4,
0.2, 0.1
Suppose from a bag of candy, we took N pieces of
candy and all of them were lime (data dN). What
are posterior probabilities P(hidN)?

10
Bayesian Learning BasicsCandy Example

P(h1dN) / P(dNh1)P(h1)0,P(h2dN) /
P(dNh2)P(h2) 0.2(.25)N,P(h3dN) /
P(dNh3)P(h3)0.4(.5)N,P(h4dN) /
P(dNh4)P(h4)0.2(.75)N,P(h5dN) /
P(dNh5)P(h5)P(h5)0.1.

Normalize them by requiring them to sum up to 1.

11
Bayesian Learning BasicsParameter Learning

Introduce parametric probability model with
parameter q.
Then the hypotheses are hq, i.e., hypotheses are
parametrized.
In the simplest case, q is a single scalar. In
more complex cases, q consists of many
components.
Using the data d, predict the parameter q.

12
Parameter Learning ExampleDiscrete Case

A bag of candy whose lime-cherry proportions are
completely unknown.
In this case we have hypotheses parametrized by
the probability q of cherry.
P(dhq)Õj P(djhq)qcherry(1-q)lime
Two wrappers, green and red, are selected
according to some unknown conditional
distribution, depending on the flavor.
It has three parameters qP(Fcherry),
q1P(WredFcherry), q2P(WredFlime).
P(dhQ) qcherry(1-q)lime q1red,cherry(1-q1)green
,cherry q2red,lime(1-q2)green,lime

13
Parameter Learning ExampleSingle Variable
Gaussian

Gaussian pdf on a single variable
Suppose x1,,xN are observed. Then the log
likelihood is
We want to find m and s that will maximize this.
Find where gradient is zero.

14
Parameter Learning ExampleSingle Variable
Gaussian

Solving this, we find
This verifies ML agrees with our common sense.

15
Parameter Learning ExampleLinear Regression

Consider a linear Gaussian model with one
continuous parent X and a continuous child Y.
Y has a Gaussian distribution whose mean depends
linearly on the value of X
Y has fixed standard deviation s.
The data are (xi, yi).
Let the mean of Y be q1Xq2.
Then P(yx) / exp(-(y-(q1Xq2))2/2s2)/s.
Maximizing the log likelihood is equivalent to
minimizing Eåj (yj-(q1xjq2))2.
This quantity is the well-known sum of squared
errors. Thus in linear regression case,
ML , Least Mean-Square (LMS)

16
Parameter Learning ExampleBeta Distribution

Candy example revisited.
q is the value of a random variable Q in
Bayesian view.
P(Q) is a continuous distribution.
Uniform density is one candidate.
Another possibility is to use beta distributions.
Beta distribution has two hyperparameters a and
b, and is given by (a normalizing constant)
ba,b(q)aqa-1(1-q)b-1.
Has mean a/(ab).
More peaked when ab is large, suggesting greater
certainty about the value of Q.

17
Parameter Learning ExampleBeta Distribution

Beta distribution has nice property that if Q has
a prior ba,b, then the posterior distribution for
Q is also a beta distribution.
P(qdcherry) / P(dcherryq)P(q)
/ q ba,b(q)
/ qqa-1(1-q)b-1
/ qa(1-q)b-1 / ba1,b
Beta distribution is called the conjugate prior
for the family of distributions for a Boolean
variable.

18
Naïve Bayes Method

Attributes (components of observed data) are
assumed to be indepdendent in Naïve Bayes Method.
Works well for about 2/3 of real-world problems,
despite naivete of such assumption. Goal Predict
the class C, given the observed data Xixi.
By the independent assumption,
P(Cx1,xn)/ P(C)Õi P(xiC)
We choose the most likely class.
Merits of NB
Scales well No search is required.
Robust against noisy data.
Gives probabilistic predictions.

19
Bayesian Network

Combine all observations according to their
dependency relations.
More formally, a Bayesian Network consists of the
following
A set of variables (nodes)
A set of directed edges between variables
The graph is assumed to be acyclic (i.e., theres
no directed cycle).
To each variable A with parents B1,,Bn, there is
attached the potential table P(AB1,,Bn).

20
Bayesian Network
Examples of Bayesian Network

A compact representation of the joint probability
table (distribution)
Without dependency relation, the joint
probability is intractable.

21
Issues in Bayesian Network

Learning the structure No systematic method
exists.
Updating the network after observation is also
hard NP-hard in general.
There are algorithms to overcome this
computational complexity.

Hidden (latent) variables can simplify the
structure substantially.

22
EM AlgorithmLearning with Hidden Variables

Latent (hidden) variables are not directly
observable.
Latent variables are everywhere, in HMM, mixture
of Gaussians, Bayesian Networks,
EM (Expectation-Maximization) Algorithm solves
the problem of learning parameters in the
presence of latent variables
In a very general way
Also in a very simple way.
EM algorithm is an iterative algorithm
It iterates over E- and M-steps repeatedly,
updating the parameter at each step.

23
EM Algorithm

An iterative algorithm.
Let q be the parameters of the model, q(i) be
its estimated value at i-th step, Z be the
hidden variable.
Expectation (E-Step)
Compute expectation w.r.t the hidden variable of
completed data log-likelihood function
åz P(Zzx, q(i)) log P(x,Zzq)
Maximization (M-Step)
Update q by maximizing this expectation
q(i1) arg maxq åz P(Zzx, q(i)) log
P(x,Zzq)
Iterate (1)-(2) until convergence!

24
EM Algorithm

Resembles gradient-descent algorithm, but no
step-size parameter.
EM increases log likelihood at every step.
May have problems in convergence.
Several variants of EM algorithm are suggested to
overcome such difficulties.
Putting priors, different initialization, and
reasonable initial values all help.

25
EM Algorithm Prototypical ExampleMixture of
Gaussians

A mixture distribution
P(X)åi1k P(Ci) P(XCi)
P(XCi) is a distribution for i-th component.
When each P(XCi) is (multivariate) Gaussian,
this distribution is called a mixture of
Gaussians.
Has the following parameters
Weight wiP(Ci)
Means mi
Covariances Si
Problem in learning parameters we dont know
which component generated each data points.

26
EM Algorithm Prototypical ExampleMixture of
Gaussians

Introduce the indicator hidden variables Z(Zj)
From which component xj was generated?
Can derive answer analytically, but its
complicated. (See for example http//www.lans.ece.
utexas.edu/course/ee380l/2002sp/blimes98gentle.pdf
)
Skipping the details, the answers are as follows
Let pijP(Cixj)/ P(xjCi)P(Ci), piåj pij,
wiP(Ci).
Update mi Ã åj pijxj/p
Si Ã åj pijxjxjT/pj
wi Ã pi

27
EM Algorithm Prototypical ExampleMixture of
Gaussians