Title: CS 570 Artificial Intelligence Chapter 20. Bayesian Learning
1CS 570 Artificial IntelligenceChapter 20.
Bayesian Learning
- Jahwan Kim
- Dept. of CS, KAIST
2Contents
- Bayesian Learning
- Bayesian inference
- MAP and ML
- Naïve Bayes method
- Bayesian network
- Parameter Learning
- Examples
- Regression and LMS
- EM Algorithm
- Algorithm
- Mixture of Gaussian
3Bayesian Learning
- Let h1,,hn be possible hypotheses.
- Let d(d1,dn) be the observed data vectors.
- Often (always) iid assumption is made.
- Let X denote the prediction.
- In Bayesian Learning,
- Compute the probability of each hypothesis given
the data. Predict based on that basis. - Predictions are made by using all hypotheses.
- Learning in Bayesian setting is reduced to
probabilistic inference.
4Bayesian Learning
- The probability that the prediction is X, when
the data d is observed is - P(Xd)åi P(Xd, hi)P(hid)
- åi P(Xhi)P(hid)
- Prediction is weighted average over the
predictions of individual hypothesis. - Hypotheses are intermediaries between the data
and the predictions. - Requires computing P(hid) for all i. This is
usually intractable.
5Bayesian Learning BasicsTerms
- P(hid) is called posterior (or a posteriori)
probability. - Using Bayes rule,
- P(hid)/ P(dhi)P(hi)
- P(hi) is called the (hypothesis) prior.
- We can embed knowledge by means of prior.
- It also controls the complexity of the model.
- P(dhi) is called the likelihood of the data.
- Under iid assumption,
- P(dhi)Õj P(djhi).
- Let hMAP be the hypothesis for which the
posterior probability P(hid) is maximal. It is
called the maximum a posteriori (or MAP)
hypothesis.
6Bayesian Learning BasicsMAP Approximation
- Since calculating the exact probability is often
impractical, we use approximation by MAP
hypothesis. That is, - P(Xd)¼P(XhMAP).
- MAP is often easier than the full Bayesian
method, because instead of large summation
(integration), an optimization problem can be
solved.
7Bayesian Learning BasicsMDL Principle
- Since P(hid)/ P(dhi)P(hi), instead of
maximizing P(hid), we may maximize P(dhi)P(hi). - Equivalently, we may minimize
- log P(dhi)P(hi)-log P(dhi)-log P(hi).
- We can interpret this as choosing hi to minimize
the number of bits that is required to encode the
hypothesis hi and the data d under that
hypothesis. - The principle of minimizing code length (under
some pre-determined coding scheme) is called the
minimum description length (or MDL) principle. - MDL is used in wide range of practical machine
learning applications.
8Bayesian Learning BasicsMaximum Likelihood
- Assume furthermore that P(hi)s are all equal,
i.e., assume the uniform prior. - It is a reasonable approach when there is no
reason to prefer one hypothesis over another a
priori. - In that case, to obtain MAP hypothesis, it
suffices to maximize P(dhi), the likelihood.
Such hypothesis is called the maximum likelihood
hypothesis hML. - In other words,
- MAP and uniform prior , ML
9Bayesian Learning BasicsCandy Example
- Two flavors of candy, cherry and lime.
- Each piece of candy is wrapped in the same opaque
wrapper. - Sold in very large bags, of which there are known
to be five kinds h1 100 cherry, h2 75 cherry
25 lime, h3 50-50, h4 25-75, h5 100 lime - Priors known P(h1),,P(h5) are 0.1, 0.2, 0.4,
0.2, 0.1 - Suppose from a bag of candy, we took N pieces of
candy and all of them were lime (data dN). What
are posterior probabilities P(hidN)?
10Bayesian Learning BasicsCandy Example
- P(h1dN) / P(dNh1)P(h1)0,P(h2dN) /
P(dNh2)P(h2) 0.2(.25)N,P(h3dN) /
P(dNh3)P(h3)0.4(.5)N,P(h4dN) /
P(dNh4)P(h4)0.2(.75)N,P(h5dN) /
P(dNh5)P(h5)P(h5)0.1.
- Normalize them by requiring them to sum up to 1.
11Bayesian Learning BasicsParameter Learning
- Introduce parametric probability model with
parameter q. - Then the hypotheses are hq, i.e., hypotheses are
parametrized. - In the simplest case, q is a single scalar. In
more complex cases, q consists of many
components. - Using the data d, predict the parameter q.
12Parameter Learning ExampleDiscrete Case
- A bag of candy whose lime-cherry proportions are
completely unknown. - In this case we have hypotheses parametrized by
the probability q of cherry. - P(dhq)Õj P(djhq)qcherry(1-q)lime
- Two wrappers, green and red, are selected
according to some unknown conditional
distribution, depending on the flavor. - It has three parameters qP(Fcherry),
q1P(WredFcherry), q2P(WredFlime). - P(dhQ) qcherry(1-q)lime q1red,cherry(1-q1)green
,cherry q2red,lime(1-q2)green,lime
13Parameter Learning ExampleSingle Variable
Gaussian
- Gaussian pdf on a single variable
- Suppose x1,,xN are observed. Then the log
likelihood is - We want to find m and s that will maximize this.
Find where gradient is zero.
14Parameter Learning ExampleSingle Variable
Gaussian
- Solving this, we find
- This verifies ML agrees with our common sense.
15Parameter Learning ExampleLinear Regression
- Consider a linear Gaussian model with one
continuous parent X and a continuous child Y. - Y has a Gaussian distribution whose mean depends
linearly on the value of X - Y has fixed standard deviation s.
- The data are (xi, yi).
- Let the mean of Y be q1Xq2.
- Then P(yx) / exp(-(y-(q1Xq2))2/2s2)/s.
- Maximizing the log likelihood is equivalent to
minimizing Eåj (yj-(q1xjq2))2. - This quantity is the well-known sum of squared
errors. Thus in linear regression case, - ML , Least Mean-Square (LMS)
16Parameter Learning ExampleBeta Distribution
- Candy example revisited.
- q is the value of a random variable Q in
Bayesian view. - P(Q) is a continuous distribution.
- Uniform density is one candidate.
- Another possibility is to use beta distributions.
- Beta distribution has two hyperparameters a and
b, and is given by (a normalizing constant) - ba,b(q)aqa-1(1-q)b-1.
- Has mean a/(ab).
- More peaked when ab is large, suggesting greater
certainty about the value of Q.
17Parameter Learning ExampleBeta Distribution
- Beta distribution has nice property that if Q has
a prior ba,b, then the posterior distribution for
Q is also a beta distribution. - P(qdcherry) / P(dcherryq)P(q)
- / q ba,b(q)
- / qqa-1(1-q)b-1
- / qa(1-q)b-1 / ba1,b
- Beta distribution is called the conjugate prior
for the family of distributions for a Boolean
variable.
18Naïve Bayes Method
- Attributes (components of observed data) are
assumed to be indepdendent in Naïve Bayes Method. - Works well for about 2/3 of real-world problems,
despite naivete of such assumption. Goal Predict
the class C, given the observed data Xixi. - By the independent assumption,
- P(Cx1,xn)/ P(C)Õi P(xiC)
- We choose the most likely class.
- Merits of NB
- Scales well No search is required.
- Robust against noisy data.
- Gives probabilistic predictions.
19Bayesian Network
- Combine all observations according to their
dependency relations. - More formally, a Bayesian Network consists of the
following - A set of variables (nodes)
- A set of directed edges between variables
- The graph is assumed to be acyclic (i.e., theres
no directed cycle). - To each variable A with parents B1,,Bn, there is
attached the potential table P(AB1,,Bn).
20Bayesian Network
Examples of Bayesian Network
- A compact representation of the joint probability
table (distribution) - Without dependency relation, the joint
probability is intractable.
21Issues in Bayesian Network
- Learning the structure No systematic method
exists. - Updating the network after observation is also
hard NP-hard in general. - There are algorithms to overcome this
computational complexity.
- Hidden (latent) variables can simplify the
structure substantially.
22EM AlgorithmLearning with Hidden Variables
- Latent (hidden) variables are not directly
observable. - Latent variables are everywhere, in HMM, mixture
of Gaussians, Bayesian Networks, - EM (Expectation-Maximization) Algorithm solves
the problem of learning parameters in the
presence of latent variables - In a very general way
- Also in a very simple way.
- EM algorithm is an iterative algorithm
- It iterates over E- and M-steps repeatedly,
updating the parameter at each step.
23EM Algorithm
- An iterative algorithm.
- Let q be the parameters of the model, q(i) be
its estimated value at i-th step, Z be the
hidden variable. - Expectation (E-Step)
- Compute expectation w.r.t the hidden variable of
completed data log-likelihood function - åz P(Zzx, q(i)) log P(x,Zzq)
- Maximization (M-Step)
- Update q by maximizing this expectation
- q(i1) arg maxq åz P(Zzx, q(i)) log
P(x,Zzq) - Iterate (1)-(2) until convergence!
24EM Algorithm
- Resembles gradient-descent algorithm, but no
step-size parameter. - EM increases log likelihood at every step.
- May have problems in convergence.
- Several variants of EM algorithm are suggested to
overcome such difficulties. - Putting priors, different initialization, and
reasonable initial values all help.
25EM Algorithm Prototypical ExampleMixture of
Gaussians
- A mixture distribution
- P(X)åi1k P(Ci) P(XCi)
- P(XCi) is a distribution for i-th component.
- When each P(XCi) is (multivariate) Gaussian,
this distribution is called a mixture of
Gaussians. - Has the following parameters
- Weight wiP(Ci)
- Means mi
- Covariances Si
- Problem in learning parameters we dont know
which component generated each data points.
26EM Algorithm Prototypical ExampleMixture of
Gaussians
- Introduce the indicator hidden variables Z(Zj)
From which component xj was generated? - Can derive answer analytically, but its
complicated. (See for example http//www.lans.ece.
utexas.edu/course/ee380l/2002sp/blimes98gentle.pdf
) - Skipping the details, the answers are as follows
- Let pijP(Cixj)/ P(xjCi)P(Ci), piåj pij,
wiP(Ci). - Update mi à åj pijxj/p
- Si à åj pijxjxjT/pj
- wi à pi
27EM Algorithm Prototypical ExampleMixture of
Gaussians
- For nice look-and-feel demo of EM algorithms on
mixture of Gaussians, see http//www.neurosci.aist
.go.jp/akaho/MixtureEM.html
28EM Algorithm ExampleBayesian Network, HMM
- Omitted.
- Covered later in class/student presentation (?)