Title: Statistical Pattern Recognition and
1Lecture 16
- Statistical Pattern Recognition and
- Small Sample Size Problems
- (Covariance Estimation)
Many thanks to Carlos Thomaz who authored the
original version of these slides
2The Multi-dimensional Gaussian distribution
- The Gaussian distribution is written
- p(x) exp(- ½ (x-m)T S-1 (x-m) )/ (2p)n/2 ?S
- we can use it to determine the probability of
membership of a class given S and m. For a given
class we may also have a prior probability, using
pi for class i - p(xclassi) pi exp(- ½ (x-m)T S-1 (x-m) )/
(2p)n/2 ?S
3The Bayes Plug-in Classifier (parametric)
- Using logs so that we do not have an infinite
space - the rule becomes
- Assign pattern x to class i if the quadratic
score - ? Focus on covariance estimators for Sj
4The Parzen Window Classifier (non-parametric)
- It is based on pdfs estimated locally using
kernels (Gaussian) and a number of neighbours.
In the standard Parzen classifier - ? Analogously, focus on estimators for Si
5Statistical Pattern Recognition
- Information about class member-ship is contained
in the set of class conditional probability
density functions (pdfs) which could be specified
(parametric) or learned (non-parametric). - In practice, pdfs are based on Gaussian kernels
that involve the inverse of sample group
covariance matrix. - However, in high dimensional problems the
covariance matrix may be singular - Small Sample Size Problem
6Small samples
- In many pattern recognition applications nowadays
there are often a large number of features (n)
but the number of training patterns (Ni) per
class may be significant less than the dimension
of the feature space. - Ni ltlt n !
7For instance
- In image recognition problems each group is
commonly defined by a small number of pictures
but the number of features used for recognition
may be thousands of pixels or even hundreds of
pre-processed image attributes. - High dimensional problems!
8This implies
- that the performance of classical statistical
pattern recognition techniques, which have been
used successfully to design several recognition
systems, deteriorate in such small sample size
settings.
9Small Sample Size Problem
- Sample group covariance matrix Si
- Si is singular when Ni lt n
- Si is poorly estimated when Ni is not gtgt n
10Covariance Matrix Estimation
- Geometric idea of parametric classifiers
11Covariance Matrix Estimation
- Geometric idea of non-parametric classifiers
12Covariance Estimation
- 1. Pooled covariance matrix Sp (LDF)
- ? Assumes equal covariance for all groups
13Covariance Estimation (continued)
- 2. Friedmans RDA covariance estimator (1989)
- Maximises the classification accuracy
- Computationally intensive method
- Same mixing parameters (?,?) for all classes
14Covariance Estimation (continued)
- 3. Hoffbecks LOOC estimator (1996)
- Maximises the average likelihood of each class
- Requires less computation than RDA
15Covariance Estimation (non-parametric classifiers)
- 1. Van Ness covariance matrices Sness (1980)
- Maximises the classification accuracy
- Same smoothing parameter (a) for all classes
- No covariance information
16Covariance Estimation (non-parametric classifiers)
- 2. Toeplitz covariance matrices Stoep (1996)
-
- Based on stationarity assumption (restrictive)
17A Maximum Entropy Covariance Estimate
- Work by Carlos Thomaz
- It is based on the idea that
- When we make inferences based on incomplete
information, we should draw them from that
probability distribution that has the maximum
entropy permitted by the information we do have.
- E.T.Jaynes 1982
18Loss of Covariance Information
- In many pattern recognition problems the sources
of variation are often the same from group to
group. - Similar covariance shape may be assumed
- In such situations and when Si are singular or
poorly estimated, linear combination of Si and,
for instance, Sp may lead to a - loss of covariance information.
19Loss of Covariance Information (cont.)
20Loss of Covariance Information (cont.)
- However, when Ni lt n, the (n Ni 1) lower z i
variances are approximately 0 ! - Therefore, using the same parameters a and b for
the whole feature space fritters away some pooled
covariance information !
21Loss of Covariance Information (cont.)
22A Maximum Entropy Covariance (cont.)
- Let an n-dimensional sample Xi be normally
distributed with true covariance matrix Si. Its
entropy h can be written as - which is simply a function of the determinant of
Si and is invariant under any orthonormal
tranformation.
23A Maximum Entropy Covariance (cont.)
- Thus, in order to maximise
- we must select the covariance estimation of Si
that gives the largest eigenvalues.
24A Maximum Entropy Covariance (cont.)
- Considering linear combinations of Si and Sp
25A Maximum Entropy Covariance (cont.)
- Moreover, as the natural log is a monotonic
increasing function, we can maximise - However,
- Therefore, we do not need to choose the best
parameters a and b but simply select the maximum
variances of the corresponding matrices.
26A Maximum Entropy Covariance (cont.)
- The Maximum Entropy Covariance Selection (MECS)
method is given by the following procedure - Find the eigenvectors of Si Sp
- Calculate the variance contribution of both
matrices - Form new variance matrix based on the largest
values - Form the MECS estimator
27Visual Analysis
- The top row shows the 5 image training examples
of a subject and the subsequent rows show the
image eigenvectors (with the corresponding
eigenvalues below) of the following covariance
matrices - (1) sample group
- (2) pooled
- (3) maximum likelihood mixture
- (4) maximum classification mixture
- (5) maximum entropy mixture.
28Visual Analysis (cont.)
- The top row shows the 5 image training examples
of a subject and the subsequent rows show the
image eigenvectors (with the corresponding
eigenvalues below) of the following covariance
matrices - (1) sample group
- (2) pooled
- (3) maximum likelihood mixture
- (4) maximum classification mixture
- (5) maximum entropy mixture.
29Visual Analysis (cont.)
- The top row shows the 3 image training examples
of a subject and the subsequent rows show the
image eigenvectors (with the corresponding
eigenvalues below) of the following covariance
matrices - (1) sample group
- (2) pooled
- (3) maximum likelihood mixture
- (4) maximum classification mixture
- (5) maximum entropy mixture.
30Visual Analysis (cont.)
- The top row shows the 3 image training examples
of a subject and the subsequent rows show the
image eigenvectors (with the corresponding
eigenvalues below) of the following covariance
matrices - (1) sample group
- (2) pooled
- (3) maximum likelihood mixture
- (4) maximum classification mixture
- (5) maximum entropy mixture.
31How accurate is the MECS idea ?
- The details will be covered in the next lecture
about the covariance-based classifier called
Linear Discriminant Analysis (LDA). - Exemplar Neonatal Brain Classification and
Analysis