Title: Statistical Pattern Recognition: A Review
1Statistical Pattern RecognitionA Review
2Contents
- Introduction
- Statistical Pattern Recognition
- The Curse of Dimensionality Peaking Phenomena
- Dimensionality Reduction
- Feature Extraction
- Feature Selection
- Classifiers
- Classifier Combination
- Error Estimation
- Unsupervised Classification
3Statistical Pattern RecognitionA Review
4What is Pattern Recognition?
- The study of how machines can observe the
environment, - learn to distinguish patterns of interest from
their background, and - make sound and reasonable decisions about the
categories of the patterns. - What is a pattern?
- What kinds of category we have?
5What is a Pattern?
- As opposite of a chaos it is an entity, vaguely
defined, that could be given a name. - For example, a pattern could be
- A fingerprint images
- A handwritten cursive word
- A human face
- A speech signal
6Categories (Classes)
- Supervised Classification
- Discriminant Analysis
- Unsupervised Classification
- Clustering
7Applications of Pattern Recognition
8The Design
- The design of a pattern recognition system
essentially involves the following three aspects - data acquisition
- data representation
- decision making
9Pattern Recognition Models
- The four best known approaches
- template matching
- statistical classification
- syntactic or structural matching
- neural networks
10Pattern Recognition Models
11Statistical Pattern RecognitionA Review
- Statistical Pattern Recognition
12Pattern Representation
- A pattern is represented by a set of d features,
or attributes, viewed as a d-dimensional feature
vector.
13Two Modes of a Pattern Recognition system
14Decision Making Rules
- Given a pattern x (x1, x2, , xd)T, assign it
to one of c categories in ?. - ? ?1, ?2, , ?c
15Decision Making Rules
16The States of Nature
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
17Baysian Decision Theory
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
A posterior probability
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
18Decision Making Rules
- Bayes Decision Rule
- Maximum Likelihood Rule
- Minimax
- Neyman-Pearson
- . . .
19Loss Functions Conditional Risk
Loss Function
The loss incurred in deciding ?i when the true
class is ?j.
Conditional Risk
Posterior Probability
20Bayse Decision Rule
The optimal decision rule for minimizing the risk.
21Maximum Likelihood Decision Rule
with
0/1 loss function
?
22Minimax
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Minimax deals with the case that P(?i)s are
unknown.
Game theory
23Neyman-Pearson
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Neyman-Pearson criterion wish to minimize the
overall risk subject to a constraint, such as
for a particular i.
24Various Approaches
25Performance Evaluation
Optimizing a classifier to maximize its
performance on the training set may not always
result in the desired performance on a test set.
Test Set
Training Set
26Problems on Learning (Generalization)
- The curse of dimensionality
- The number of features is too large relative to
the number of training samples - Classifiers complexity
- The number of unknown parameters associated with
the classifier is large (e.g., polynomial
classifiers or a large neural network) - Overtrained
- Too intensively optimized on training set.
27Statistical Pattern RecognitionA Review
- The Curse of Dimensionality
-
- Peaking Phenomena
28The Curse
- The performance of a classifier depends on the
interrelationship between - sample sizes
- number of features
- classifier complexity
- If table-lookup technique is adopted for
classification, how many training samples are
required w.r.t the number of features?
29Peaking Phenomena
- The probability of misclassification of a
decision rule does not increase as the number of
features increases. - This is true as long as the class-conditional
densities are completely known. - Peaking Phenomena
- Adding features many actually degrade the
performance of a classifier
30Trunks Example
- Two classes with mean vectors and covariance
matrices as follows
G. V. Trunk, A Problem of Dimensionality A
Simple Example, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 1, no. 3, pp. 306-307,
July 1979.
31Guideline
- The number of training samples, the number of
features and the true parameters of the
class-conditional densities is very difficult to
established. - It is generally accepted that using at least ten
times as many training samples per class as the
number of features (n/d gt 10) is a good practice
to follow in classifier design.
32Statistical Pattern RecognitionA Review
33Dimensionality Reduction
- A limited yet salient feature set simplifies both
pattern representation and classifier design. - Pattern representation is easy for 2D and 3D
features. - How to make pattern with high dimensional
features viewable?
34ExamplePattern Representation
35Dimensionality Reduction How?
- Feature Extraction
- Create new features based on the original feature
set - Transforms are usually involved
- Feature Selection
- Select the best subset from a given feature set.
36Main Issues in Dimensionality Reduction
- The choice of a criterion function
- Commonly used criterion classification error
- The determination of the appropriate
dimensionality - Correlated with the intrinsic dimensionality of
data
37Dimensionality Reduction
38Feature Extractor
yi
xi
m ? d, usually
39Some Important Methods
- Principal Component Analysis (PCA)
- or Karhunen-Loeve Expansion
- Project Pursuit
- Independent Component Analysis (ICA)
- Factor Analysis
- Discriminate Analysis
- Kernel PCA
- Multidimensional Scaling (MDS)
- Feed-Forward Neural Networks
- Self-Organizing Map
Linear Approaches
Nonlinear Approaches
Neural Networks
40Feed-Forward Neural Networks
Linear PCA
Nonlinear PCA
41Demonstration
Iris Setosa Iris Versicolor o Iris
Virginica
Fisher Mapping
PCA
Sammon Mapping
Kernel PCA with second order polynomial kernel
42Summary
43Dimensionality Reduction
44Feature Selector
possible Selections
m ? d, usually
45The problem
- Given a set of d features, select a subset of
size m that leads to the smallest classification
error. - No nonexhaustive sequential feature selection
procedure can be guaranteed to produce the
optimal subset.
46Optimal Methods
- Exhaustive Search
- Evaluate all possible subsets
- Branch-and-Bound
- The monotonicity property of the criterion
function has to be held.
47Suboptimal Methods
- Best Individual Features
- Sequential Forward Selection (SFS)
- Sequential Backward Selection (SBS)
- Plus l-take away r Selection
- Sequential Forward Floating Search and Sequential
Backward Floating Search
48Summary
Optimal
Suboptimal
49Statistical Pattern RecognitionA Review
50Classification
Once a feature selection or a classification
procedure finds a proper representation, a
classifier can be designed using a number of
possible approaches.
In practice, the choice of a classifier is a
difficult problem and it is often based on which
classifier(s) happen to be available, best known,
to the user.
51Approaches of Classifier Design
- Based on Similarity
- Probabilistic Approach
- Decision-Boundary Approach
- Geometric Approach
52Classifiers Based on Similarity
- Once a good metric to define similarity, patterns
can be classified by template matching or minimum
distance classifier using a few prototypes per
class. - The choice of the metric and prototypes is
crucial to the success of this approach.
53Classification Methods Based on Similarity
- Template Matching
- Assign Pattern to the most similar template
- Nearest Mean Classifier
- Assign Pattern to the nearest class mean
- Subspace Method
- Assign Pattern to the nearest subspace
(invariance) - 1-Nearest Neighbor Rule
- Assign Pattern to the class of the nearest
training pattern
54Probabilistic Approach
- Bayes decision rule
- It takes into account costs associated with
different types of misclassification. - Given prior probabilities, loss function,
class-conditional densities, it is optimal in
minimizing the risk. - With the 0/1 loss function, it assign a pattern
to the class with the maximum posterior
probability (maximum likelihood decision rule).
55Probability Models
- Parametric Models
- Parameter estimation
- Commonly used models
- Multivariate Gaussian distributions for
continuous features - Binomial distributions for binary features
- Multinormal distributions for integer-valued
features - Bayse Plug-in rule
- Nonparametric Models
56Classifiers Probabilistic Approach
- Parametric Models
- Bayse Plug-in
- Logistic Classifier
- Nonparametric Models
- k-Nearest Neighbor Rule
- Parzen Classifier
57Geometric Approach
- Construct decision boundaries directly by
optimizing certain error criterion. - Commonly used criterion
- Classification error
- MSE
- A training procedure is required.
58Classifiers Geometric Approach
- Fisher Linear Discriminant
- Binary Decision Tree
- Neural Networks
- Perceptron
- Multi-Layer Perceptron
- Radial Basis Network
- Support Vector Classifier
59Statistical Pattern RecognitionA Review
60Why Combining Classifiers?
- Independent classifiers for the same goal.
- Person identification by voice, face and
handwriting. - Sometimes more than a single training set is
available, each collected at different time or in
a different environment. These training sets may
even use different features. - Different classifiers trained on the same data
may not only differ in their global performance,
but they also may show strong local differences.
Each classifier may have its own region in the
feature space where it performs the best. - Some classifiers such as neural networks show
different results with different initializations
due to the randomness inherent in the training
procedure. Instead of selecting the best network
and discarding the others, one can combine
various networks, thereby taking advantage of all
the attempts to learn from data.
61Combining Schemes
- Parallel
- Cascading
- Hierarchical
62Selection and Training of Individual Classifier
- Combining classifiers that are largely
independent. - Create training sets using various resampling
techniques, such as rotation and bootstrapping. - Examples
- Stacking
- Bagging (bootstrap aggregation)
- Boosting or ARCing (Adaptive Reweighting and
Combinig)
63Selection and Training of Individual Classifier
- Cluster analysis may be used to separate the
individual classes in the training set in to
subclasses. - Consequently, simpler classifier (e.g., linear)
may be used are combined later to generate, for
instance, a piecewise linear result. - Building different classifiers on different sets
of training patterns, different feature sets may
be used, e.g., random subspace method.
64Combiners
- Static Combiners
- Voting, Averaging, Borda Count.
- Trainable Combiners
- Lead to better improvement than static ones.
- Additional training data needed.
- Adaptive Combiners
- Combiner evaluates (or weighs) the decision of
individual classifiers depending on the input
patterns.
65Output Types of Individual Classifiers
- Measurement
- Confidence or Probability
- Rank
- Assign a rank to each class
- Abstract
- A set of several class labels
66An Example
- Handwritten numerals (0-9)
- Extract from a collection of Dutch utility maps
- 30 48 binary images
- 200 patterns per class (2000 in total)
- Features
- 76 Fourier Coefficients of the character shapes
- 216 profile correlations
- 64 Karhunen-Loeve coefficients
- 240 pixel averages in 2 3 windows
- 47 Zernike moments
- 6 morphlogical features
67An Example
68Statistical Pattern RecognitionA Review
69Ultimate Measurement of a Classifier
- Classification error or simple the error rate Pe.
- The percentage of misclassified test samples is
taken as an estimate of the error rate. - How should the available samples be split to form
training and test sets? - Especially important in small sample case
70Error Estimation Methods
Cross- Validataion Approaches
71Other Performance Measurements
- Example (Fingerprint Matching System)
- False Acceptance Rate (FAR)
- The percentage for incorrectly matches
- False Reject Rate (FRR)
- The percentage for incorrectly unmatches.
- Reject rate
- The percentage for reject doubtful patterns
72Statistical Pattern RecognitionA Review
- Unsupervised Classification
73Unlabelled Training Data
Unsupervised classification is also known as data
clustering.
74Difficulties
- Data can reveal clusters with different sizes and
shapes. - Number of clusters depends on the resolution.
- Similarity measurement.
75Importance
- Data Mining
- Information Retrieval
- Image Segmentation
- Signal Compression and Coding
- Machine Learning
76Main Techniques
- Iterative Square-Error Partition Clustering
- Main concern in the following discussion
- Agglomerative Hierarchical Clustering
77Formulation
- Given n patterns in a d-dimensional metric space,
determine a partitions of patterns in to K
clusters such that - the patterns in a cluster are more similar to
each other than to patterns in different clusters.
78Two Popular Approaches forPartition Clustering
- Square-Error Clustering
- K-Means
- Fuzzy K-Means
- Mixture Decomposition
- EM Algorithm
79Square-Error Clustering
Fact
Total Scattering
Between-Cluster Scattering
Within-Cluster Scattering
80Square-Error Clustering
Fact
Goal Find a partition to
maximize
or
minimize
81Mixture Model
- Formal model for unsupervised classification.
- Each pattern was produced by one of a set of
alternative (probabilistically modeled) sources. - Mixtures can also be seen as a class of models
that are able to represent arbitrarily complex
probability density functions. - Mixtures also well suited for represent complex
class-conditional densities in supervised
learning scenarios.
82Mixture Generation
Parameters
K random sources
P(x?Ci) ai
83Mixture Generation
Parameters
K random sources
P(x?Ci) ai
If pm(x?m) is normal, it is called the Gaussian
Mixture Model (GMM).
84Goal
Parameters
Given the form of pm(x?m) and the data x1, x2,
, xn,
find
to fit the model
85Issues
Parameters
- How to estimate the parameters?
- EM (expectation-maximization) algorithm
- MCMC (Markov Chain Monte-Carlo) Method
- How to estimate the number of components
(sources)? - More difficult
86Example