Title: Ch' 3' MaximumLikelihood and Bayesian Parameter Estimation
1Ch. 3. Maximum-Likelihood and Bayesian Parameter
Estimation
2Introduction
- If we knew the prior probabilities P(?i) and the
class-conditional densities p(x ?i), we could
design an optimal classifier - Unfortunately, we rarely have this kind of
complete knowledge about the probabilistic
structure of the problem - In a typical case, we merely have some vague,
general knowledge about situation, with a number
of design samples or training data particular
representatives of patterns
3Introduction
- Problem find some way to use this information to
design or train the classifier - An approach to this problem is to use the
samples to estimate the unknown probabilities and
probability densities - And to use the result estimates as if they were
the true values
4Introduction Maximum Likelihood Parameter
Estimation
- The ML approach assumes the parameter are fixed,
but unknown - The ML approach is to estimate the best parameter
that maximize the probability of obtaining the
(given) training set - The ML approach seeks parameter estimates that
maximize likelihood function
5Introduction Bayesian Estimation
- Bayesian approach models the parameters to be
estimated as random variables with some (assumed)
known a priori distribution - Bayesian approach uses the training set to update
the training-set-conditioned density function of
the unknown parameters
6Maximum Likelihood Estimation
7Formulation of ML estimation
- ML estimation assumes the parameters to be
estimated are unknown, but constant. - ML formulation assumes
- We have a training set D in the form of c subsets
of the samples or feature vectors D1,D2,,Dc - Samples in Di are assumed to be generated by the
underlying density function for class i, p(x?i).
i.e. It is assumed the parametric form of p(x?i)
is known - Parameter vector ?i is the set of parameter to
be estimated for class i - In the Gaussian case where x N(mi,Ci), the
components of ?i are the elements of mi and Ci
8Use of the training set, ML
- Use of the Training Set
- We consider the training of each class
separately. - Samples in Di do not give information about ?j,
j?i - i.e. It is assumed that the parameters for the
different classes are functionally independent - Use a set Di of training samples drawn
independently for the probability density p(x?i)
to estimate the unknown parameter vector ?i
9The likelihood function
- Suppose Dix1,x2,,xn
- If the xk within Di are assumed independent, the
joint parameter-conditional pdf of Di is -
- where p(Di?i) is the likelihood function of ?i
- Maximum-likelihood estimate of ?i Given Di, the
objective of ML is to find ?i that maximizes
p(Di?i) - i.e. find ?i to maximize the likelihood of ?i
- The goal is to maximize p(Di?i) with respect to
parameter vector ?i
10Maximum-likelihood estimation
- Given Di, the objective of ML estimation is to
find ?i, that maximizes p(Di?i), i.e. find ?i to
maximize the likelihood of ?i. - The goal is to maximize p(Di ?i) with respect to
parameter vector ?i
11ML Estimation Example - 1D Gaussian
12ML Estimation
13ML Estimation, log-likelihood
14ML estimation Example Gaussian with unknown m,
known C
15ML estimation Example Gaussian with unknown m,
unknown C
16ML estimation example, 2D
17ML estimation example, 3D
18Maximum A Posteriori (MAP) Estimation
- Posterior density p(?D) p(?D)?p(D ?)p(?)
l(?)p(?) - MAP estimation Find the value of ? that
maximizes - l(?)p(?)p(D ?)p(?)
- Maximum likelihood estimator a MAP estimator for
the uniform (flat) prior - MAP estimator finds the peak (mode) of a
posterior density - Generally speaking, information on p(?) is
derived from the designers knowledge of the
problem domain (beyond our study of the
classifier design)
19Bayesian Parameter Estimation
20Bayesian Parameter Estimation
- Although the desired probability density p(x) is
unknown, we assume that it has a known parameter
form - ?The function p(x?) is completely known
- The only thing assumed unknown is the value of
parameter vector ? - Any information about ? is assumed to be
contained in a density p(?) - Observation of the samples D converts this
density p(?) to a posterior density p(?D), which
is sharply peaked about the true value of ?
21Bayesian Parameter Estimation
22Class-conditional Densities
23Basic Assumptions of Bayesian Parameter Estimation
- The basic assumptions are summarized as
- The form of the density p(x?) is assumed to be
known, but the value of the parameter vector ? is
not known exactly - The initial knowledge about ? is assumed to be
contained in a known a priori density p(?) - The rest of knowledge about ? is contained in a
set D of n samples x1,,xn drawn independently
according to the unknown probability density p(x)
24The Parameter Distribution
- The important problem is to compute the posterior
density p(?D), because we can calculate p(xD)
as follows
- If p(?D) is sharply peaked about some value ?0,
then we obtain p(xD)p(x?0) - We are less certain about the exact value of ?
(general case), we use above equation
25Example for Gaussian density with unknown mean
vector
- Problem Calculate a posterior density p(?D)
and the desired pdf p(xD) for p(xm)N(m,C)
26Example for Gaussian density with unknown mean
vector
27Example for Gaussian density with unknown mean
vector
28Example for Gaussian density with unknown mean
vector
29Example for Gaussian density with unknown mean
vector
30Estimation of p(mD)
As the training sample size increases, p(mD)
becomes more sharply-peaked
31The Univariate Gaussian Case p(xD)
32Bayesian Parameter Estimation General theory
The basic problem is to compute the posterior
density p(?D), because from this we can
calculate p(xD)
- By independence assumption
33Bayesian Parameter Estimation
34Bayesian Parameter Estimation
35Questions remain
- Difficulty of carrying out these computation?
- Convergence of p(xD) to p(x)?
36When ML and Bayesian Methods Differs
- ML is easier since they require merely
differential calculus techniques or gradient
search for ? (cf. complex multidimensional
integration needed for Bayesian estimation) - Bayesian method is more accurate
37Bayesian Parameter Estimation
- The probabilities p(?ix), i1,2,,c are needed
for classification - The objective is to form the posterior
probabilities p(?ix,Di) for given training set
Di - A application of Bayes rule yields
38Estimation of p(?ix,Di)
- The estimation of the posterior probabilities
p(?ix,Di) needs computation of a priori
probability p(?iDi) and the density function
p(x ?i,Di) - Assumption for simplication
- 1. The probability p(?iDi) is independent of a
training set Di. i.e., - 2. The probability p(?i), i1,2,,c is known
- 3. A training set Di have information about only
the parameter of class ?i - 4. The functional form of p(xDi) is known
39Estimation of p(?ix,Di)
- Then, p(?ix,Di) can be computed from p(xDi)
- Therefore, the problem is to estimate a random
vector of parameters ?i for probability p(xDi)
40Estimation Equation
p(x?i,Di)p(x?i)