Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

About This Presentation

Title:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

Description:

Title: PowerPoint Presentation Author: Djamel Bouchaffra Last modified by: rose Created Date: 1/28/2001 7:39:16 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:379

Avg rating:3.0/5.0

Slides: 22

Provided by: djam84

Learn more at: https://cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

1
Pattern ClassificationAll materials in these
slides were taken from Pattern Classification
(2nd ed) by R. O. Duda, P. E. Hart and D. G.
Stork, John Wiley Sons, 2000 with the
permission of the authors and the publisher

2
Chapter 3Maximum-Likelihood Bayesian
Parameter Estimation (part 1)

Introduction
Maximum-Likelihood Estimation
Example of a Specific Case
The Gaussian Case unknown ? and ?
Bias
Appendix ML Problem Statement

Introduction
Data availability in a Bayesian framework
We could design an optimal classifier if we knew
P(?i) (priors)
P(x ?i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!
Design a classifier from a training sample
No problem with prior estimation
Samples are often too small for class-conditional
estimation (large dimension of feature space!)

1
4

A priori information about the problem
Do we know something about the distribution?
? find parameters to characterize the
distribution
Example Normality of P(x ?i)
P(x ?i) N( ?i, ?i)
Characterized by 2 parameters
Estimation techniques
Maximum-Likelihood (ML) and the Bayesian
estimations
Results are nearly identical, but the approaches
are different

1
5

Parameters in ML estimation are fixed but
unknown!
Best parameters are obtained by maximizing the
probability of obtaining the samples observed
Bayesian methods view the parameters as random
variables having some known distribution
In either approach, we use P(?i x)for our
classification rule!

1
6

Maximum-Likelihood Estimation
Has good convergence properties as the sample
size increases
Simpler than any other alternative techniques
General principle
Assume we have c classes and
P(x ?j) N( ?j, ?j)
P(x ?j) ? P (x ?j, ?j) where

2
7

Use the informationprovided by the training
samples to estimate
? (?1, ?2, , ?c), each ?i (i 1, 2, , c) is
associated with each category
Suppose that D contains n samples, x1, x2,, xn
ML estimate of ? is, by definition the value that
maximizes P(D ?)
It is the value of ? that best agrees with the
actually observed training sample

2
8
2
9

Optimal estimation
Let ? (?1, ?2, , ?p)t and let ?? be the
gradient operator
We define l(?) as the log-likelihood function
l(?) ln P(D ?)
(recall D is the training data)
New problem statement
determine ? that maximizes the log-likelihood

2
10

The definition of l() is
and
Set of necessary conditions for an optimum is
??l 0 (eq. 7)

2
11

Example, the Gaussian case unknown ?
We assume we know the covariance
p(xi ?) N(?, ?)
(Samples are drawn from a multivariate normal
population)
? ? thereforeThe ML estimate for ? must
satisfy

2
12

Multiplying by ? and rearranging, we obtain
Just the arithmetic average of the samples of
the training samples!
Conclusion
If P(xk ?j) (j 1, 2, , c) is supposed to be
Gaussian in a d-dimensional feature space then
we can estimate the vector
? (?1, ?2, , ?c)t and perform an optimal
classification!

2
13

Example, Gaussian Case unknown ? and S
First consider univariate case unknown ? and
? ? (?1, ?2) (?, ?2)

2
14

Summation (over the training set)
Combining (1) and (2), one obtains

2
15

The ML estimates for the multivariate case is
similar
The scalars c and ? are replaced with vectors
The variance s2 is replaced by the covariance
matrix

Bias
ML estimate for ?2 is biased
Extreme case n1, E 0 ? ?2
As the n increases the bias is reduced
? this type of estimator is called asymptotically
unbiased

2
17

An elementary unbiased estimator for ? is
This estimator is unbiased for all distributions
? Such estimators are called absolutely unbiased

2
18

Our earlier estimator for ? is biased
In fact it is asymptotically unbiased
Observe that

2
19

Appendix ML Problem Statement
Let D x1, x2, , xn
P(x1,, xn ?) ?1,nP(xk ?) D n
Our goal is to determine (value of ? that
maximizes the likelihood of this sample set!)