CS 570 Artificial Intelligence Chapter 20. Bayesian Learning - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

Description:

Oct. 8, 2003. Jahwan Kim CS 570 Artificial Intelligence. 1. CS 570 ... We can embed knowledge by means of prior. It also controls the complexity of the model. ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 29
Provided by: aiKai
Category:

less

Transcript and Presenter's Notes

Title: CS 570 Artificial Intelligence Chapter 20. Bayesian Learning


1
CS 570 Artificial IntelligenceChapter 20.
Bayesian Learning
  • Jahwan Kim
  • Dept. of CS, KAIST


2
Contents
  • Bayesian Learning
  • Bayesian inference
  • MAP and ML
  • Naïve Bayes method
  • Bayesian network
  • Parameter Learning
  • Examples
  • Regression and LMS
  • EM Algorithm
  • Algorithm
  • Mixture of Gaussian

3
Bayesian Learning
  • Let h1,,hn be possible hypotheses.
  • Let d(d1,dn) be the observed data vectors.
  • Often (always) iid assumption is made.
  • Let X denote the prediction.
  • In Bayesian Learning,
  • Compute the probability of each hypothesis given
    the data. Predict based on that basis.
  • Predictions are made by using all hypotheses.
  • Learning in Bayesian setting is reduced to
    probabilistic inference.

4
Bayesian Learning
  • The probability that the prediction is X, when
    the data d is observed is
  • P(Xd)åi P(Xd, hi)P(hid)
  • åi P(Xhi)P(hid)
  • Prediction is weighted average over the
    predictions of individual hypothesis.
  • Hypotheses are intermediaries between the data
    and the predictions.
  • Requires computing P(hid) for all i. This is
    usually intractable.

5
Bayesian Learning BasicsTerms
  • P(hid) is called posterior (or a posteriori)
    probability.
  • Using Bayes rule,
  • P(hid)/ P(dhi)P(hi)
  • P(hi) is called the (hypothesis) prior.
  • We can embed knowledge by means of prior.
  • It also controls the complexity of the model.
  • P(dhi) is called the likelihood of the data.
  • Under iid assumption,
  • P(dhi)Õj P(djhi).
  • Let hMAP be the hypothesis for which the
    posterior probability P(hid) is maximal. It is
    called the maximum a posteriori (or MAP)
    hypothesis.

6
Bayesian Learning BasicsMAP Approximation
  • Since calculating the exact probability is often
    impractical, we use approximation by MAP
    hypothesis. That is,
  • P(Xd)¼P(XhMAP).
  • MAP is often easier than the full Bayesian
    method, because instead of large summation
    (integration), an optimization problem can be
    solved.

7
Bayesian Learning BasicsMDL Principle
  • Since P(hid)/ P(dhi)P(hi), instead of
    maximizing P(hid), we may maximize P(dhi)P(hi).
  • Equivalently, we may minimize
  • log P(dhi)P(hi)-log P(dhi)-log P(hi).
  • We can interpret this as choosing hi to minimize
    the number of bits that is required to encode the
    hypothesis hi and the data d under that
    hypothesis.
  • The principle of minimizing code length (under
    some pre-determined coding scheme) is called the
    minimum description length (or MDL) principle.
  • MDL is used in wide range of practical machine
    learning applications.

8
Bayesian Learning BasicsMaximum Likelihood
  • Assume furthermore that P(hi)s are all equal,
    i.e., assume the uniform prior.
  • It is a reasonable approach when there is no
    reason to prefer one hypothesis over another a
    priori.
  • In that case, to obtain MAP hypothesis, it
    suffices to maximize P(dhi), the likelihood.
    Such hypothesis is called the maximum likelihood
    hypothesis hML.
  • In other words,
  • MAP and uniform prior , ML

9
Bayesian Learning BasicsCandy Example
  • Two flavors of candy, cherry and lime.
  • Each piece of candy is wrapped in the same opaque
    wrapper.
  • Sold in very large bags, of which there are known
    to be five kinds h1 100 cherry, h2 75 cherry
    25 lime, h3 50-50, h4 25-75, h5 100 lime
  • Priors known P(h1),,P(h5) are 0.1, 0.2, 0.4,
    0.2, 0.1
  • Suppose from a bag of candy, we took N pieces of
    candy and all of them were lime (data dN). What
    are posterior probabilities P(hidN)?

10
Bayesian Learning BasicsCandy Example
  • P(h1dN) / P(dNh1)P(h1)0,P(h2dN) /
    P(dNh2)P(h2) 0.2(.25)N,P(h3dN) /
    P(dNh3)P(h3)0.4(.5)N,P(h4dN) /
    P(dNh4)P(h4)0.2(.75)N,P(h5dN) /
    P(dNh5)P(h5)P(h5)0.1.
  • Normalize them by requiring them to sum up to 1.

11
Bayesian Learning BasicsParameter Learning
  • Introduce parametric probability model with
    parameter q.
  • Then the hypotheses are hq, i.e., hypotheses are
    parametrized.
  • In the simplest case, q is a single scalar. In
    more complex cases, q consists of many
    components.
  • Using the data d, predict the parameter q.

12
Parameter Learning ExampleDiscrete Case
  • A bag of candy whose lime-cherry proportions are
    completely unknown.
  • In this case we have hypotheses parametrized by
    the probability q of cherry.
  • P(dhq)Õj P(djhq)qcherry(1-q)lime
  • Two wrappers, green and red, are selected
    according to some unknown conditional
    distribution, depending on the flavor.
  • It has three parameters qP(Fcherry),
    q1P(WredFcherry), q2P(WredFlime).
  • P(dhQ) qcherry(1-q)lime q1red,cherry(1-q1)green
    ,cherry q2red,lime(1-q2)green,lime

13
Parameter Learning ExampleSingle Variable
Gaussian
  • Gaussian pdf on a single variable
  • Suppose x1,,xN are observed. Then the log
    likelihood is
  • We want to find m and s that will maximize this.
    Find where gradient is zero.

14
Parameter Learning ExampleSingle Variable
Gaussian
  • Solving this, we find
  • This verifies ML agrees with our common sense.

15
Parameter Learning ExampleLinear Regression
  • Consider a linear Gaussian model with one
    continuous parent X and a continuous child Y.
  • Y has a Gaussian distribution whose mean depends
    linearly on the value of X
  • Y has fixed standard deviation s.
  • The data are (xi, yi).
  • Let the mean of Y be q1Xq2.
  • Then P(yx) / exp(-(y-(q1Xq2))2/2s2)/s.
  • Maximizing the log likelihood is equivalent to
    minimizing Eåj (yj-(q1xjq2))2.
  • This quantity is the well-known sum of squared
    errors. Thus in linear regression case,
  • ML , Least Mean-Square (LMS)

16
Parameter Learning ExampleBeta Distribution
  • Candy example revisited.
  • q is the value of a random variable Q in
    Bayesian view.
  • P(Q) is a continuous distribution.
  • Uniform density is one candidate.
  • Another possibility is to use beta distributions.
  • Beta distribution has two hyperparameters a and
    b, and is given by (a normalizing constant)
  • ba,b(q)aqa-1(1-q)b-1.
  • Has mean a/(ab).
  • More peaked when ab is large, suggesting greater
    certainty about the value of Q.

17
Parameter Learning ExampleBeta Distribution
  • Beta distribution has nice property that if Q has
    a prior ba,b, then the posterior distribution for
    Q is also a beta distribution.
  • P(qdcherry) / P(dcherryq)P(q)
  • / q ba,b(q)
  • / qqa-1(1-q)b-1
  • / qa(1-q)b-1 / ba1,b
  • Beta distribution is called the conjugate prior
    for the family of distributions for a Boolean
    variable.

18
Naïve Bayes Method
  • Attributes (components of observed data) are
    assumed to be indepdendent in Naïve Bayes Method.
  • Works well for about 2/3 of real-world problems,
    despite naivete of such assumption. Goal Predict
    the class C, given the observed data Xixi.
  • By the independent assumption,
  • P(Cx1,xn)/ P(C)Õi P(xiC)
  • We choose the most likely class.
  • Merits of NB
  • Scales well No search is required.
  • Robust against noisy data.
  • Gives probabilistic predictions.

19
Bayesian Network
  • Combine all observations according to their
    dependency relations.
  • More formally, a Bayesian Network consists of the
    following
  • A set of variables (nodes)
  • A set of directed edges between variables
  • The graph is assumed to be acyclic (i.e., theres
    no directed cycle).
  • To each variable A with parents B1,,Bn, there is
    attached the potential table P(AB1,,Bn).

20
Bayesian Network
Examples of Bayesian Network
  • A compact representation of the joint probability
    table (distribution)
  • Without dependency relation, the joint
    probability is intractable.

21
Issues in Bayesian Network
  • Learning the structure No systematic method
    exists.
  • Updating the network after observation is also
    hard NP-hard in general.
  • There are algorithms to overcome this
    computational complexity.
  • Hidden (latent) variables can simplify the
    structure substantially.

22
EM AlgorithmLearning with Hidden Variables
  • Latent (hidden) variables are not directly
    observable.
  • Latent variables are everywhere, in HMM, mixture
    of Gaussians, Bayesian Networks,
  • EM (Expectation-Maximization) Algorithm solves
    the problem of learning parameters in the
    presence of latent variables
  • In a very general way
  • Also in a very simple way.
  • EM algorithm is an iterative algorithm
  • It iterates over E- and M-steps repeatedly,
    updating the parameter at each step.

23
EM Algorithm
  • An iterative algorithm.
  • Let q be the parameters of the model, q(i) be
    its estimated value at i-th step, Z be the
    hidden variable.
  • Expectation (E-Step)
  • Compute expectation w.r.t the hidden variable of
    completed data log-likelihood function
  • åz P(Zzx, q(i)) log P(x,Zzq)
  • Maximization (M-Step)
  • Update q by maximizing this expectation
  • q(i1) arg maxq åz P(Zzx, q(i)) log
    P(x,Zzq)
  • Iterate (1)-(2) until convergence!

24
EM Algorithm
  • Resembles gradient-descent algorithm, but no
    step-size parameter.
  • EM increases log likelihood at every step.
  • May have problems in convergence.
  • Several variants of EM algorithm are suggested to
    overcome such difficulties.
  • Putting priors, different initialization, and
    reasonable initial values all help.

25
EM Algorithm Prototypical ExampleMixture of
Gaussians
  • A mixture distribution
  • P(X)åi1k P(Ci) P(XCi)
  • P(XCi) is a distribution for i-th component.
  • When each P(XCi) is (multivariate) Gaussian,
    this distribution is called a mixture of
    Gaussians.
  • Has the following parameters
  • Weight wiP(Ci)
  • Means mi
  • Covariances Si
  • Problem in learning parameters we dont know
    which component generated each data points.

26
EM Algorithm Prototypical ExampleMixture of
Gaussians
  • Introduce the indicator hidden variables Z(Zj)
    From which component xj was generated?
  • Can derive answer analytically, but its
    complicated. (See for example http//www.lans.ece.
    utexas.edu/course/ee380l/2002sp/blimes98gentle.pdf
    )
  • Skipping the details, the answers are as follows
  • Let pijP(Cixj)/ P(xjCi)P(Ci), piåj pij,
    wiP(Ci).
  • Update mi à åj pijxj/p
  • Si à åj pijxjxjT/pj
  • wi à pi

27
EM Algorithm Prototypical ExampleMixture of
Gaussians
  • For nice look-and-feel demo of EM algorithms on
    mixture of Gaussians, see http//www.neurosci.aist
    .go.jp/akaho/MixtureEM.html

28
EM Algorithm ExampleBayesian Network, HMM
  • Omitted.
  • Covered later in class/student presentation (?)
Write a Comment
User Comments (0)
About PowerShow.com