Title: Statistical Learning
1Statistical Learning
- Chapter 20 of AIMA
- KAIST CS570
- Lecture note
Based on AIMA slides, Jahwan Kims slides and
Duda, Hart Storks slides
2Statistical Learning
- We view LEARNING as a form of uncertain reasoning
from observation
3Outline
- Bayesian Learning
- Bayesian inference
- MAP and ML
- Naïve Bayesian method
- Parameter Learning
- Examples
- Regression and LMS
- Learning Probability Distribution
- Parametric method
- Non-parametric method
4Bayesian Learning 1
- View learning as Bayesian updating of a
probability distribution over the hypothesis
space - H is the hypothesis variable, values h1,,hn be
possible hypotheses. - Let d(d1,dn) be the observed data vectors.
- Often (always) iid assumption is made.
- Let X denote the prediction.
- In Bayesian Learning,
- Compute the probability of each hypothesis given
the data. Predict based on that basis. - Predictions are made by using all hypotheses.
- Learning in Bayesian setting is reduced to
probabilistic inference.
5Bayesian Learning 2
- The probability that the prediction is X, when
the data d is observed is - P(Xd) åi P(Xd, hi) P(hid)
- åi P(Xhi) P(hid)
- Prediction is weighted average over the
predictions of individual hypothesis. - Hypotheses are intermediaries between the data
and the predictions. - Requires computing P(hid) for all i. This is
usually intractable.
6Bayesian Learning Basics Terms
- P(hi) is called the (hypothesis) prior.
- We can embed knowledge by means of prior.
- It also controls the complexity of the model.
- P(hid) is called posterior (or a posteriori)
probability. - Using Bayes rule,
- P(hid)/ P(dhi)P(hi)
- P(dhi) is called the likelihood of the data.
- Under iid assumption,
- P(dhi)Õj P(djhi).
- Let hMAP be the hypothesis for which the
posterior probability P(hid) is maximal. It is
called the maximum a posteriori (or MAP)
hypothesis.
7Candy Example
- Two flavors of candy, cherry and lime, wrapped in
the same opaque wrapper. (cannot see inside) - Sold in very large bags, of which there are known
to be five kinds - h1 100 cherry
- h2 75 cherry 25 lime
- h3 50 -50
- h4 25 cherry -75 lime
- h5 100 lime
- Priors known P(h1),,P(h5) are 0.1, 0.2, 0.4,
0.2, 0.1 - Suppose from a bag of candy, we took N pieces of
candy and all of them were lime (data dN). - What kind of bag is it ?
- What flavor will the next candy be ?
8Candy ExamplePosterior probability of hypotheses
- P(h1dN) / P(dNh1)P(h1)0,P(h2dN) /
P(dNh2)P(h2) 0.2(.25)N,P(h3dN) /
P(dNh3)P(h3)0.4(.5)N,P(h4dN) /
P(dNh4)P(h4)0.2(.75)N,P(h5dN) /
P(dNh5)P(h5)P(h5)0.1.
- Normalize them by requiring them to sum up to 1.
9Candy ExamplePrediction Probability
10Maximum a posteriori (MAP) Learning
- Since calculating the exact probability is often
impractical, we use approximation by MAP
hypothesis. That is, - P(Xd)¼P(XhMAP).
- Make prediction with most probable hypothesis
- Summing over that hypotheses space is often
intractable - instead of large summation (integration), an
optimization problem can be solved. - For deterministic hypothesis, P(dhi) is 1 if
consistent, 0 otherwise ? MAP simplest
consistent hypothesis (cf. science) - The true hypothesis eventually dominates the
Bayesian prediction
11MAP approximation MDL Principle
- Since P(hid)/ P(dhi)P(hi), instead of
maximizing P(hid), we may maximize P(dhi)P(hi). - Equivalently, we may minimize
- log P(dhi)P(hi)-log P(dhi)-log P(hi).
- We can interpret this as choosing hi to minimize
the number of bits that is required to encode the
hypothesis hi and the data d under that
hypothesis. - The principle of minimizing code length (under
some pre-determined coding scheme) is called the
minimum description length (or MDL) principle. - MDL is used in wide range of practical machine
learning applications.
12Maximum Likelihood Approximation
- Assume furthermore that P(hi)s are all equal,
i.e., assume the uniform prior. - reasonable when there is no reason to prefer one
hypothesis over another a priori. - For Large data set, prior becomes irrelevant
- to obtain MAP hypothesis, it suffices to maximize
P(dhi), the likelihood. - the maximum likelihood hypothesis hML.
- MAP and uniform prior , ML
- ML is the standard statistical learning method
- Simply get the best fit to the data
13Naïve Bayes Method
- Attributes (components of observed data) are
assumed to be independent in Naïve Bayes Method. - Works well for about 2/3 of real-world problems,
despite naivety of such assumption. - Goal Predict the class C, given the observed
data Xixi. - By the independent assumption,
- P(Cx1,xn) / P(C) Õi P(xiC)
- We choose the most likely class.
- Merits of NB
- Scales well No search is required.
- Robust against noisy data.
- Gives probabilistic predictions.
14Learning Curve on the Restaurant Problem
15Learning with Data Parameter Learning
- Introduce parametric probability model with
parameter q. - Then the hypotheses are hq, i.e., hypotheses are
parameterized. - In the simplest case, q is a single scalar. In
more complex cases, q consists of many
components. - Using the data d, predict the parameter q.
16ML Parameter Learning Examples discrete case
- A bag of candy whose lime-cherry proportions are
completely unknown. - In this case we have hypotheses parameterized by
the probability q of cherry. - P(dhq)Õj P(djhq)qcherry(1-q)lime
- Find hq Maximize P(dhq)
- Two wrappers, green and red, are selected
according to some unknown conditional
distribution, depending on the flavor. - It has three parameters qP(Fcherry),
q1P(WredFcherry), q2P(WredFlime). - P(dhQ) qcherry(1-q)lime q1red,cherry(1-q1)green
,cherry q2red,lime(1-q2)green,lime - Find hQ Maximize P(dhQ)
17ML Parameter Learning Example continuous
caseSingle Variable Gaussian
- Gaussian pdf on a single variable
- Suppose x1,,xN are observed. Then the log
likelihood is - We want to find m and s that will maximize this.
Find where gradient is zero.
18ML Parameter Learning Example continuous case
Single Variable Gaussian
- Solving this, we find
- This verifies ML agrees with our common sense.
19ML Parameter Learning Example continuous case
Linear Regression
- Y has a Gaussian distribution whose mean is
depend on X and standard deviation is fixed - Maximize
- minimizing
- This quantity is sum of squared errors. Thus in
this case, - ML , Least Mean-Square (LMS)
20Bayesian Parameter Learning
- ML approximations deficiency with small data
- e.g. ML of one cherry observation 100 cherry
- Bayesian parameter learning
- Place a hypothesis prior over the possible values
of parameters - Update this distribution as data arrive
21Bayesian Learning of Parameter q
- The density becomes more peaked as the number of
samples increase - Despite different prior dsitribution, posterior
density is virtually identical with large set of
data
22Bayesian Parameter Learning ExampleBeta
Distribution Candy example revisited.
- q is the value of a random variable Q in Bayesian
view. - P(Q) is a continuous distribution.
- Uniform density is one candidate.
- Another possibility is to use beta distributions.
- Beta distribution has two hyperparameters a and
b, and is given by (a normalizing constant) - ba,b(q)aqa-1(1-q)b-1.
- mean a/(ab).
- Larger a suggest q is closer to 1 than to 0
- More peaked when ab is large, suggesting greater
certainty about the value of Q.
23Beta Distribution
ba,b(q)aqa-1(1-q)b-1
24Baysian Parameter Learning ExampleProperty of
Beta Distribution
- if Q has a prior ba,b, then the posterior
distribution for Q is also a beta distribution. - P(qdcherry) a P(dcherryq)P(q)
- a q ba,b(q)
- a qqa-1(1-q)b-1
- a qa(1-q)b-1
- ba1,b
- Beta distribution is called the conjugate prior
for the family of distributions for a Boolean
variable. - a and b as virtual count
- Uniform prior ba,b ? seen a-1 cherry and b-1 lime
25Density Estimation
- All Parametric densities are unimodal (have a
single local maximum), whereas many practical
problems involve multi-modal densities - Nonparametric procedures can be used with
arbitrary distributions and without the
assumption that the forms of the underlying
densities are known - There are two types of nonparametric methods
- Estimating P(x ?j )
- Bypass probability and go directly to
a-posteriori probability estimation
26Density Estimation Basic idea
- Probability that a vector x will fall in region R
is - P is a smoothed (or averaged) version of the
density function p(x) if we have a sample of size
n therefore, the probability that k points fall
in R is then - and the expected value for k is
- E(k) nP
(3)
27- ML estimation of P ?
- is reached for
- Therefore, the ratio k/n is a good estimate for
the probability P and hence for the density
function p. - p(x) is continuous and that the region R is so
small that p does not vary significantly within
it, we can write - where x is a point within R and V the volume
enclosed by R. - Combining equation (1) , (3) and (4) yields
28Parzen Windows
- Parzen-window approach to estimate densities
assume that the region Rn is a d-dimensional
hypercube -
- ?((x-xi)/hn) is equal to unity if xi falls within
the hypercube of volume Vn centered at x and
equal to zero otherwise.
29- The number of samples in this hypercube is
-
-
- By substituting kn in equation (7), we obtain the
following estimate -
- Pn(x) estimates p(x) as an average of functions
of x and - the samples (xi) (i 1, ,n). These functions ?
can be general!
30Illustration of Parzen Window
- The behavior of the Parzen-window method
- Case where p(x) ?N(0,1)
- Let ?(u) (1/?(2?) exp(-u2/2) and hn h1/?n
(ngt1) (h1 known parameter) - Thus
- is an average of normal densities centered at the
samples xi.
31Numerical results
- For n 1 and h11For n 10 and h 0.1,
the contributions of the individual samples are
clearly observable !
32(No Transcript)
33(No Transcript)
34Analogous results are also obtained in two
dimensions as illustrated
35Case where p(x) ?1.U(a,b) ?2.T(c,d) (unknown
density) (mixture of a uniform and a triangle
density)
36(No Transcript)
37Summary
- Full Bayesian learning gives best possible
predictions but is intractable - MAP Learning balances complexity with accuracy on
training data - ML approximation assumes uniform prior, OK for
large data sets - Parameter estimation is often used