Title: Maximum Likelihood
1Outline
- Maximum Likelihood
- Maximum A-Posteriori (MAP) Estimation
- Bayesian Parameter Estimation
- ExampleThe Gaussian Case
- Recursive Bayesian Incremental Learning
- Problems of Dimensionality
- Nonparametric Techniques
- Density Estimation
- Histogram Approach
- Parzen-window method
2Bayes' Decision Rule (Minimizes the probability
of error)Â
- choose w1 if P(w1x) gt P(w2x)
- choose w2 otherwise
- or
- w1 if P ( x w1) P(w1) gt P(xw2) P(w2)
- w2 otherwise
- and
- P(Errorx) min P(w1x) , P(w2x)
3Normal Density - Multivariate Case
- The general multivariate normal density (MND) in
a d dimensions is written as - It can be shown that
- which means for components
4Maximum Likelihood and Bayesian Parameter
Estimation
- To design an optimal classifier we need P(wi)
and p(x wi), but usually we do not know them. - Solution to use training data to estimate the
unknown probabilities. Estimation of
class-conditional densities is a difficult task.
5Maximum Likelihood and Bayesian Parameter
Estimation
- Supervised learning we get to see samples from
each of the classes separately (called tagged
or labeled samples). - Tagged samples are expensive. We need to learn
the distributions as efficiently as possible. - Two methods parametric (easier) and
non-parametric (harder)
6Learning From Observed Data
Hidden
Observed
Unsupervised
Supervised
7Maximum Likelihood and Bayesian Parameter
Estimation
- Program for parametric methods
- Assume specific parametric distributions with
parameters - Estimate parameters from training
data - Replace true value of class-conditional density
with approximation and apply the Bayesian
framework for decision making.Â
8Maximum Likelihood and Bayesian Parameter
Estimation
- Suppose we can assume that the relevant
(class-conditional) densities are of some
parametric form. That is, - p(xw)p(xq), where
- Examples of parameterized densities
- Binomial x(n) has m 1s and n-m 0s
- Exponential Each data point x is distributed
according to
9Maximum Likelihood and Bayesian Parameter
Estimation cont.
- Two procedures for parameter estimation will be
considered - Maximum likelihood estimation choose parameter
value that makes the data most probable
(i.e., maximizes the probability of obtaining the
sample that has actually been observed), - Bayesian learning define a prior probability on
the model space and compute the
posterior Additional
samples sharp the posterior density which peaks
near the true values of the parameters . - Â
- Â
10Sampling Model
- It is assumed that a sample set
with - independently generated samples is available.
- The sample set is partitioned into separate
sample sets for each class, - A generic sample set will simply be denoted by
. - Each class-conditional is
assumed to have a known parametric form and is
uniquely specified by a parameter
(vector) . - Samples in each set are assumed to be
independent and identically distributed (i.i.d.)
according to some true probability law
.
11Log-Likelihood function and Score Function
- The sample sets are assumed to be functionally
independent, i.e., the training set
contains no information about for
. - The i.i.d. assumption implies that
-
- Let be a generic sample of size
. - Log-likelihood function
- The log-likelihood function is identical to the
logarithm of the probability density function,
but is interpreted as a function over the sample
space for given parameter
12Log-Likelihood Illustration
- Assume that all the points in are drawn
from some (one-dimensional) normal distribution
with some (known) variance and unknown mean.
13Log-Likelihood function and Score Function cont.
- Maximum likelihood estimator (MLE)
-
-
- (tacitly assuming that such a maximum
exists!) - Score function
-
- and hence
- Necessary condition for MLE (if not on border of
domain - )
14Maximum A Posteriory
- Maximum a posteriory (MAP)
- Find the value of q that maximizes
l(q)ln(p(q)), where p(q),is a prior probability
of different parameter values.A MAP estimator
finds the peak or mode of a posterior. - Drawback of MAP after arbitrary nonlinear
transformation of the parameter space, the
density will change, and the MAP solution will no
longer be correct. -
-
15Maximum A-Posteriori (MAP) Estimation
- The most likely value is given by q
-
16Maximum A-Posteriori (MAP) Estimation
-
-
- since the data is i.i.d.
- We can disregard the normalizing factor
when looking for the maximum
17MAP - continued
- So, the we are looking for is
18The Gaussian Case Unknown Mean
- Suppose that the samples are drawn from a
multivariate normal population with mean ,
and covariance matrix - .
- Â Consider fist the case where only the mean is
unknown - .
- Â For a sample point xk , we have
- and
- The maximum likelihood estimate for must
satisfy
19The Gaussian Case Unknown Mean
- Multiplying by , and rearranging, we obtain
- The MLE estimate for the unknown population mean
is just the arithmetic average of the training
samples (sample mean). - Geometrically, if we think of the n samples as a
cloud of points, the sample mean is the centroid
of the cloudÂ
20The Gaussian Case Unknown Mean and Covariance
- In the general multivariate normal case, neither
the mean nor the covariance matrix is known
. - Consider fist the univariate case with
and - . Â The log-likelihood of a
single point is -
-
- and its derivative is
21The Gaussian Case Unknown Mean and Covariance
- Setting the gradient to zero, and using all the
sample points, we get the following necessary
conditions -
- where are the
MLE estimates for , and
respectively. - Solving for , we obtain
22The Gaussian multivariate case
- For the multivariate case, it is easy to show
that the MLE estimates for are given by - The MLE for the mean vector is the sample mean,
and the MLE estimate for the covariance matrix is
the arithmetic average of the n matrices
- The MLE for is biased (i.e., the expected
value over all data sets of size n of the sample
variance is not equal to the true variance
23The Gaussian multivariate case
- Unbiased estimator for and are given by
-
- and
- C is called the sample covariance matrix . C is
absolutely unbiased. is asymptotically
unbiased. -
24Bayesian Estimation Class-Conditional Densities
- The aim is to find posteriors P(wix) knowing
p(xwi) and P(wi), but they are unknown. How to
find them? - Given the sample D, we say that the aim is to
find P(wix, D) - Bayes formula gives
- We use the information provided by training
samples to determine the class conditional
densities and the prior probabilities. - Generally used assumptions
- Priors generally are known or obtainable from a
trivial calculations. Thus P(wi) P(wiD). - The training set can be separated into c subsets
D1,,Dc
25Bayesian Estimation Class-Conditional Densities
- The samples Dj have no influence on p(xwi,Di )
if - Thus we can write
- We have c separate problems of the form
- Use a set D of samples drawn independently
according to a fixed but unknown probability
distribution p(x) to determine p(xD).
26Bayesian Estimation General Theory
- Bayesian leaning considers (the parameter
vector to be - estimated) to be a random variable.
- Before we observe the data, the parameters
are described by a prior p(q ) which is
typically very broad. Once we observed the data,
we can make use of Bayes formula to find
posterior p(q D ). Since some values of the
parameters are more consistent with the data than
others, the posterior is narrower than prior.
This is Bayesian learning (see fig.) -
27General Theory cont.
- Density function for x, given the training data
set , - From the definition of conditional probability
densities - The first factor is independent of since
it just our assumed form
for parameterized density. - Therefore
- Instead of choosing a specific value for ,
the Bayesian approach performs a weighted average
over all values of - The weighting factor , which
is a posterior of is determined by starting
from some assumed prior
28General Theory cont.
- Then update it using Bayes formula to take
account of - data set . Since
are drawn independently -
- which is likelihood function.
- Posterior for is
- where normalization factor
29Bayesian Learning Univariate Normal Distribution
- Let us use the Bayesian estimation technique to
calculate a posteriori density
and the desired probability density
for the case - Univariate Case
- Let m be the only unknown parameter
30Bayesian Learning Univariate Normal Distribution
- Prior probability normal distribution over ,
-
- encodes some prior knowledge about the
true mean , while measures our prior
uncertainty. - If m is drawn from p(m) then density for x is
completely determined. Letting
we use
31Bayesian Learning Univariate Normal Distribution
- Computing the posterior distribution
32Bayesian Learning Univariate Normal Distribution
- Where factors that do not depend on have
been absorbed into the constants and - is an exponential
function of a quadratic function of i.e.
it is a normal density. - remains normal for
any number of training samples. - If we write
- then identifying the coefficients, we get
33Bayesian Learning Univariate Normal Distribution
- where is the sample
mean. - Solving explicitly for and
we obtain -
-
and
-
- represents our best guess for after
observing - n samples.
- measures our uncertainty about this guess.
- decreases monotonically with n (approaching
- as n approaches infinity)
- Â
34Bayesian Learning Univariate Normal Distribution
- Each additional observation decreases our
uncertainty about the true value of . - As n increases, becomes more
and more sharply peaked, approaching a Dirac
delta function as n approaches infinity. This
behavior is known as Bayesian Learning.
35Bayesian Learning Univariate Normal Distribution
- In general, is a linear combination of
and , with coefficients that are
non-negative and sum to 1. - Thus lies somewhere between and
. - If , as
- If , our a priori certainty that
is so - strong that no number of observations can
change our - opinion.
- If , a priori guess is very
uncertain, and we - take
- The ratio is called dogmatism.
36Bayesian Learning Univariate Normal Distribution
- The Univariate Case
- where
37Bayesian Learning Univariate Normal Distribution
- Since
we can write - To obtain the class conditional probability
, whose parametric form is known to be
we
replace by and by
- The conditional mean is treated as if
it were the true mean, and the known variance is
increased to account for the additional
uncertainty in x resulting from our lack of exact
knowledge of the mean .
38Example (demo-MAP)
- We have N points which are generated by one
dimensional Gaussian, - Since
we think that the mean should not be very big we
use as a prior
where is a hyperparameter. The total
objective function is -
- which is maximized to give,
- For influence of prior
is negligible and result is ML estimate. But for
very strong belief in the prior
the estimate tends to zero. Thus, -
- if few data are available, the prior will
bias the estimate towards the prior expected value
39Recursive Bayesian Incremental Learning
- We have seen that Let
us define Then - Substituting into and using
Bayes we have -
- Finally
40Recursive Bayesian Incremental Learning
- While repeated use of
this eq. produces a sequence -
- This is called the recursive Bayes approach to
the parameter estimation. (Also incremental or
on-line learning). - When this sequence of densities converges to a
Dirac delta function centered about the true
parameter value, we have Bayesian learning.
41Maximal Likelihood vs. Bayesian
- ML and Bayesian estimations are asymptotically
equivalent and consistent. They yield the same
class-conditional densities when the size of the
training data grows to infinity. - ML is typically computationally easier in ML we
need to do (multidimensional) differentiation and
in Bayesian (multidimensional) integration. - ML is often easier to interpret it returns the
single best model (parameter) whereas Bayesian
gives a weighted average of models. - But for a finite training data (and given a
reliable prior) Bayesian is more accurate (uses
more of the information). - Bayesian with flat prior is essentially ML
with asymmetric and broad priors the methods lead
to different solutions. -
42Problems of DimensionalityAccuracy, Dimension,
and Training Sample Size
- Consider two-class multivariate normal
distributions - with the same covariance. If priors are
equal then Bayesian error rate is given by -
- where is the squared Mahalanobis
distance - Thus the probability of error decreases as r
increases. In the conditionally independent case
and
43Problems of Dimensionality
- While classification accuracy can become better
with growing of dimensionality (and an amount of
training data),
- beyond a certain point, the inclusion of
additional features leads to worse rather then
better performance - computational complexity grows
- the problem of overfitting arises
44Occam's Razor
- "Pluralitas non est ponenda sine neccesitate" or
"plurality should not be posited without
necessity." The words are those of the medieval
English philosopher and Franciscan monk William
of Occam (ca. 1285-1349). - Decisions based on overly complex models often
lead to lower accuracy of the classifier.
45Outline
- Nonparametric Techniques
- Density Estimation
- Histogram Approach
- Parzen-window method
- Kn-Nearest-Neighbor Estimation
- Component Analysis and Discriminants
- Principal Components Analysis
- Fisher Linear Discriminant
- MDA
46NONPARAMETRIC TECHNIQUES
- So far, we treated supervised learning under the
assumption that the forms of the underlying
density functions were known. - The common parametric forms rarely fit the
densities actually encountered in practice. - Classical parametric densities are unimodal,
whereas many practical problems involve
multimodal densities. - We examine nonparametric procedures that can be
used with arbitrary distribution and without the
assumption that the forms of the underlying
densities are known.
47NONPARAMETRIC TECHNIQUES
- There are several types of nonparametric methods
- Procedures for estimating the density functions
from sample patterns. If these
estimates are satisfactory, they can be
substituted for the true densities when designing
the classifier. - Procedures for directly estimating the a
posteriori probabilities - Nearest neighbor rule which bypass probability
estimation, and go directly to decision
functions.
48Histogram Approach
- The conceptually simplest method of estimating a
p.d.f. is histogram. The range of each component
xs of vector x is divided into a fixed number m
of equal intervals. The resulting boxes (bins) of
identical volume V are then expected and the
number of points falling into each bin is
counted. - Suppose that we have ni samples xj , j1,, ni
from class wi - Suppose that the number of vector points
in the j-th bin, bj , be kj . The histogram
estimate , of density function
49Histogram Approach
- is defined as
- is constant over every bin bj
. - Let us verify that is a density
function - We can choose a number m of bins and their
starting points. Fixation of starting points is
not critical, but m is important. It place a
role of smoothing parameter. Too big m makes
histogram spiky, for too little m we loose a
true form of the density function
50The Histogram MethodExample
- Assume (one dimensional) data
- Some points were sampled from a combination of
two Gaussians - 3 bins
51The Histogram MethodExample
52Histogram Approach
- The histogram p.d.f. estimator is very
effective. We can do it online all we should do
is to update the counters kj during the run time,
so we do not need to keep all the data which
could be huge. - But its usefulness is limited only to low
dimensional vectors x, because the number of
bins, Nb , grows exponentially with
dimensionality d - This is the so called curse of dimensionality
53DENSITY ESTIMATION
- To estimate the density at x, we form a sequence
of regions R1, R2, .. - The probability for x to fall into R is
-
- Suppose we have n i. i.d. samples x1 , , xn
drawn according to p(x) . The probability that k
of them fall in R is - and the expected value for k is
and variance - . The
relative part of samples which fall into R (k/n)
is also a random variable for which -
- When n is growing up the variance is making
smaller and is becoming to be better
estimator for P.
54DENSITY ESTIMATION
- Pk sharply peaks about the mean, so the
- k/n is a good estimate of P.
- For small enough R
- where x is within R and V is a volume enclosed by
R. - Thus
55Three Conditions for DENSITY ESTIMATION
- Let us take a growing sequence of samples
n1,2,3... - We take regions Rn with reduced volumes V1 gt V2 gt
V3 gt... - Let kn be the number of samples falling in Rn
- Let pn(x) be the nth estimate for p(x)
- If pn(x) is to converge to p(x) , 3 conditions
must be required - resolution as big
as possible (to reduce smoothing) - otherwise in the
range Rn there will not be infinite - number of points and k/n will not converge
to P and well get p(x)0 - to guarantee
convergence of ().
56Parzen Window and KNN
- Â How to obtain the sequence R1 , R2 , ..?
- There are 2 common approaches of obtaining
sequences of regions that satisfy the above
conditions - Shrink an initial region by specifying the volume
Vn as some function of n , such as
and show that kn and kn/n behave
properly i.e. pn(x) converges to p(x). - This is Parzen-window (or kernel ) method .
- Specify kn as some function of n, such as
Here - the volume Vn is grown until it encloses kn
neighbors of x . Â - This is kn-nearest-neighbor method .
- Both of these methods do converge, although it is
difficult to make meaningful statements about
their finite-sample behavior.
57PARZEN WINDOWS
- Â Assume that the region Rn is a d-dimensional
hypercube. - If hn is the length of an edge of that
hypercube, then its volume is given by
- Define the following window function
- defines a unit hypercube centered at
the origin. - Â , if xi falls
within the hypercube of volume Vn centered at x,
and is zero otherwise. - The number of samples in this hypercube is given
by
58PARZEN WINDOWS cont.
- Â Since
- Rather than limiting ourselves to the hypercube
window, we can use a more general class of window
functions.Thus pn(x) is an average of functions
of x and the samples xi. - The window function is being used for
interpolation. Each sample contributing to the
estimate in accordance with its distance from x. - pn(x) must
- be nonnegative
- integrate to 1.
59PARZEN WINDOWS cont.
- Â This can be assured by requiring the window
function itself be a density function, i.e., - Effect of the window size hn on p(x)
- Define the function
- then, we write pn(x) as the average
- Since , hn affects both the
amplitude and the width of
60PARZEN WINDOWS cont.
- Examples of two-dimensional circularly symmetric
normal Parzen windows - for 3
different values of h. - If hn is very large, the amplitude of
is small, and x must be far from xi before
changes much from
61PARZEN WINDOWS cont.
- In this case pn(x) is the superposition of n
broad, slowly varying functions, and is very
smooth "out-of-focus" estimate for p(x). - If hn is very small, the peak value of
is large, and occurs near x xi . - In this case, pn(x) is the superposition of n
sharp pulses centered at the samples an erratic,
"noisy" estimate. - As hn approaches zero,
approaches a Dirac delta function centered at xi
, and pn(x) approaches a superposition of delta
functions centered at the samples.
62PARZEN WINDOWS cont.
- 3 Parzen-window density estimates based on the
same set of 5 samples, - using windows from
previous figure - Â The choice of hn (or Vn) has an important effect
on pn(x) - Â If Vn is too large the estimate will suffer
from too little resolution - Â If Vn is too small the estimate will suffer
from too much - statistical variability.
- Â If there is limited number of samples, then
seek some acceptable - compromise.
63PARZEN WINDOWS cont.
- If we have unlimited number of samples, then let
Vn slowly approach zero as n increases, and have
pn(x) converge to the unknown density p(x). - Examples
- Example 1 p(x) is a zero-mean, unit variance,
univariate normal density. Let the widow function
be of the same form -
- Â Let where is
a parameter - pn(x) is an average of normal densities centered
at the samples