Statistical Pattern Recognition: A Review

About This Presentation

Title:

Statistical Pattern Recognition: A Review

Description:

learn to distinguish patterns of interest from their background, and ... As opposite of a chaos; it is an entity, vaguely defined, that could be given a name. ... – PowerPoint PPT presentation

Number of Views:1700

Avg rating:3.0/5.0

Slides: 116

Provided by: www2Se

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Pattern Recognition: A Review

1
Statistical Pattern Recognition A Review

2007? 2??
??? ???

2
Contents

Introduction
Statistical Pattern Recognition
The Curse of Dimensionality Peaking Phenomena
Dimensionality Reduction
Feature Extraction
Feature Selection
Classifiers
Classifier Combination
Error Estimation
Unsupervised Classification

3
Statistical Pattern Recognition A Review

Introduction

4
What is Pattern Recognition?

The study of how machines can observe the
environment,
learn to distinguish patterns of interest from
their background, and
make sound and reasonable decisions about the
categories of the patterns.
What is a pattern?
What kinds of category we have?

5
What is a Pattern?

As opposite of a chaos it is an entity, vaguely
defined, that could be given a name.
For example, a pattern could be
A fingerprint images
A handwritten cursive word
A human face
A speech signal

6
Categories (Classes)

Supervised Classification
Discriminant Analysis
Unsupervised Classification
Clustering

7
Applications of Pattern Recognition
8
The Design

The design of a pattern recognition system
essentially involves the following three
aspects
data acquisition
data representation
decision making

9
Pattern Recognition Models

The four best known approaches
template matching
statistical classification
syntactic or structural matching
neural networks

10
Pattern Recognition Models
11
Statistical Pattern Recognition A Review

Statistical Pattern Recognition

12
Pattern Representation

A pattern is represented by a set of d features,
or attributes, viewed as a d-dimensional feature
vector.

13
Two Modes of a Pattern Recognition system
Classification Mode
Preprocessing
Feature Measurement
Classification
test pattern
Feature Extraction/ Selection
Preprocessing
Learning
training pattern
Training Mode
14
Decision Making Rules

Given a pattern x (x1, x2, , xd)T, assign it
to one of c categories in ?.
? ?1, ?2, , ?c

15
Decision Making Rules
16
The States of Nature
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
17
Baysian Decision Theory
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
A posterior probability
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
18
Decision Making Rules

Bayes Decision Rule
Maximum Likelihood Rule
Minimax
Neyman-Pearson
. . .

19
Loss Functions Conditional Risk
Loss Function
The loss incurred in deciding ?i when the true
class is ?j.
Conditional Risk
Posterior Probability
20
Bayse Decision Rule
The optimal decision rule for minimizing the risk.
21
Maximum Likelihood Decision Rule
0/1 loss function
with
?
22
Minimax
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Minimax deals with the case that P(?i)s are
unknown.
Game theory
23
Neyman-Pearson
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Neyman-Pearson criterion wish to minimize the
overall risk subject to a constraint, such as
for a particular i.
24
Various Approaches
25
Performance Evaluation
Optimizing a classifier to maximize its
performance on the training set may not always
result in the desired performance on a test set.
Test Set
Classification Mode
Preprocessing
Feature Measurement
Classification
test pattern
Feature Extraction/ Selection
Preprocessing
Learning
training pattern
Training Set
Training Mode
26
Problems on Learning (Generalization)

The curse of dimensionality
The number of features is too large relative to
the number of training samples
Classifiers complexity
The number of unknown parameters associated with
the classifier is large (e.g., polynomial
classifiers or a large neural network)
Overtrained
Too intensively optimized on training set.

27
Statistical Pattern Recognition A Review

The Curse of Dimensionality Peaking Phenomena

28
The Curse

The performance of a classifier depends on the
interrelationship between
sample sizes
number of features
classifier complexity
If table-lookup technique is adopted for
classification, how many training samples are
required w.r.t the number of features?

29
Peaking Phenomena

The probability of misclassification of a
decision rule does not increase as the number of
features increases.
This is true as long as the class-conditional
densities are completely known.
Peaking Phenomena
Adding features many actually degrade the
performance of a classifier

30
Trunks Example

Two classes with mean vectors and covariance
matrices as follows

G. V. Trunk, A Problem of Dimensionality A
Simple Example, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 1, no. 3, pp. 306-307,
July 1979.
31
Trunks Example

The mean vector m is known
The mean vector m is unknown and n labeled
training samples are available

32
Guideline

The number of training samples, the number of
features and the true parameters of the
class-conditional densities is very difficult to
established.
It is generally accepted that using at least ten
times as many training samples per class as the
number of features (n/d gt 10) is a good practice
to follow in classifier design.

33
Statistical Pattern Recognition A Review

Dimensionality Reduction

34
Dimensionality Reduction

A limited yet salient feature set simplifies both
pattern representation and classifier design.
Pattern representation is easy for 2D and 3D
features.
How to make pattern with high dimensional
features viewable?

35
ExamplePattern Representation
36
Dimensionality Reduction How?

Feature Extraction
Create new features based on the original feature
set
Transforms are usually involved
Feature Selection
Select the best subset from a given feature set

37
Main Issues in Dimensionality Reduction

The choice of a criterion function
Commonly used criterion classification error
The determination of the appropriate
dimensionality
Correlated with the intrinsic dimensionality of
data

38
Dimensionality Reduction

Feature Extraction

39
Feature Extractor
Feature Extractor
yi
xi
m ? d, usually
40
Some Important Methods

Principal Component Analysis (PCA)
or Karhunen-Loeve Expansion
Project Pursuit
Independent Component Analysis (ICA)
Factor Analysis
Discriminate Analysis
Kernel PCA
Multidimensional Scaling (MDS)
Feed-Forward Neural Networks
Self-Organizing Map

Linear Approaches
Nonlinear Approaches
Neural Networks
41
Feed-Forward Neural Networks
Linear PCA
Nonlinear PCA
42
Demonstration
PCA
Fisher Mapping
Sammon Mapping
Kernel PCA
43
Summary
44
Dimensionality Reduction

Feature Selection

45
Feature Selector
possible Selections
Feature Selector
xi
m ? d, usually
46
The Problem

Given a set of d features, select a subset of
size m that leads to the smallest classification
error.
No nonexhaustive sequential feature selection
procedure can be guaranteed to produce the
optimal subset.

47
Optimal Methods

Exhaustive Search
Evaluate all possible subsets
Branch-and-Bound
The monotonicity property of the criterion
function has to be held.

48
Suboptimal Methods

Best Individual Features
Sequential Forward Selection (SFS)
Sequential Backward Selection (SBS)
Plus l-take away r Selection
Sequential Forward Floating Search and Sequential
Backward Floating Search

49
Summary 1/2
Optimal
50
Summary 2/2
Suboptimal
51
Statistical Pattern Recognition A Review

Classifiers

52
Classification
Once a feature selection or a classification
procedure finds a proper representation, a
classifier can be designed using a number of
possible approaches.
In practice, the choice of a classifier is a
difficult problem and it is often based on which
classifier(s) happen to be available, best known,
to the user.
53
Approaches of Classifier Design

Based on Similarity
Probabilistic Approach
Decision-Boundary Approach
Geometric Approach

54
Classifiers Based on Similarity

Once a good metric to define similarity, patterns
can be classified by template matching or
minimum distance classifier using a few
prototypes per class.
The choice of the metric and prototypes is
crucial to the success of this approach.

55
Classification Methods Based on Similarity

Template Matching
Assign Pattern to the most similar template
Nearest Mean Classifier
Assign Pattern to the nearest class mean
Subspace Method
Assign Pattern to the nearest subspace
(invariance)
1-Nearest Neighbor Rule
Assign Pattern to the class of the nearest
training pattern

56
Probabilistic Approach

Bayes decision rule
It takes into account costs associated with
different types of misclassification.
Given prior probabilities, loss function,
class-conditional densities, it is optimal in
minimizing the risk.
With the 0/1 loss function, it assign a pattern
to the class with the maximum posterior
probability (maximum likelihood decision rule).

57
Probability Models

Parametric Models
Parameter estimation
Commonly used models
Multivariate Gaussian distributions for
continuous features
Binomial distributions for binary features
Multinormal distributions for integer-valued
features
Bayse Plug-in rule
Nonparametric Models

58
Classifiers - Probabilistic Approach

Parametric Models
Bayse Plug-in
Logistic Classifier
Nonparametric Models
k-Nearest Neighbor Rule
Parzen Classifier

59
Geometric Approach

Construct decision boundaries directly by
optimizing certain error criterion.
Commonly used criterion
Classification error
MSE
A training procedure is required.

60
Classifiers - Special Approach

Fisher Linear Discriminant
Binary Decision Tree
Neural Networks
Perceptron
Multi-Layer Perceptron
Radial Basis Network
Support Vector Classifier

SVM classification approach

62
Linear SVM Linearly separable case (1)

Training set
vectors from the d-dimensional feature space
Target
Two classes are linearly separable.
Discriminant function

63
Linear SVM Linearly separable case (2)
64
Linear SVM Linearly separable case (3)
65
Linear SVM Linearly Nonseparable Case (1)

Linearly separable
difficult to satisfy in the classification of
real data
handle nonseparable data
optimal separating hyperplane has been
generalized as the solution that minimizes a cost
function
margin maximization
in the case of linearly separable data
error minimization
to penalize the wrongly classified

66
Linear SVM Linearly Nonseparable Case (2)

Cost function
(slack variable) account for the
nonseparability of data
constant C regularization parameter that allows
to control the penalty assigned to errors
two kinds of support vectors coexist
margin support vectors lie on the hyperplane
margin
nonmargin support vectors fall on the wrong
side of this margin

67
Linear SVM Linearly Nonseparable Case (3)
68
Nonlinear SVM Kernel Method (1)

nonlinear discriminant function
mapping the data through a proper nonlinear
transformation into a higher dimensional
feature space
separation between the two classes
dual problem
replacing the inner products in the original
space
with inner products in the transformed space
Computation of
expensive
at times unfeasible

69
Nonlinear SVM Kernel Method (2)

kernel method provides an elegant and effective
way of dealing with this problem.
correspond to some type of inner product in the
transformed (higher) dimensional feature space

70
Nonlinear SVM Kernel Method (3)

discriminant function
The shape of the discriminant function depends on
the kind of kernel functions adopted.
Gaussian radial basis function
polynomial function of order expressed

71
SVMs in Feature Spaces

The geometrical nature of SVMs results in a
methodology that is not aimed at estimating the
statistical distributions of classes over the
entire hyperdimensional space.
SVMs do not involve a density estimation problem
The maximum margin solution allows to fully
exploit the discrimination capability of the
relatively few training samples available.

SVMs Multiclass Strategies

73
Parallel Approach OAA (1)

OAA(one-against-all) strategy
two-class problem defined by one information
class
winner-takes-all
The winning class is the one corresponding to the
SVM with the highest output
discriminant function value

74
Parallel Approach OAA (2)
75
Parallel Approach OAO (1)

OAO(one-against-one) strategy
The main problem of the OAA strategy
estimation of complex discriminant functions
simple classification tasks
parallel architecture made up of a large number
of SVMs
OAO strategy involves SVMs, which model all
possible pairwise classifications.

76
Parallel Approach OAO (2)

Binary classification
Score function
Final decision

77
Hierarchical Tree-Based Approach

BHT -Balanced Branches Strategy
BHT One Against All Strategy

78
BHT-BB Algorithm
79
BHT-BB
80
BHT-OAA Algorithm
81
BHT-OAA
82
Statistical Pattern Recognition A Review

Classifier Combination

83
Why Combining Classifiers?

Independent classifiers for the same goal.
Person identification by voice, face and
handwriting.
Sometimes more than a single training set is
available, each collected at different time or in
a different environment. These training sets may
even use different features.
Different classifiers trained on the same data
may not only differ in their global performance,
but they also may show strong local differences.
Each classifier may have its own region in the
feature space where it performs the best.
Some classifiers such as neural networks show
different results with different initializations
due to the randomness inherent in the training
procedure. Instead of selecting the best network
and discarding the others, one can combine
various networks, thereby taking advantage of all
the attempts to learn from data.

84
Combining Schemes

Parallel
Cascading
Hierarchical

85
Selection and Training of Individual Classifier

Combining classifiers that are largely
independent.
Create training sets using various resampling
techniques, such as rotation and bootstrapping.
Examples
Stacking
Bagging (bootstrap aggregation)
Boosting or ARCing (Adaptive Reweighting and
Combinig)

86
Selection and Training of Individual Classifier

Cluster analysis may be used to separate the
individual classes in the training set in to
subclasses.
Consequently, simpler classifier (e.g., linear)
may be used and combined later to generate, for
instance, a piecewise linear result.
Building different classifiers on different sets
of training patterns, different feature sets may
be used, e.g., random subspace method.

87
Combiners

Static Combiners
Voting, Averaging, Borda Count.
Trainable Combiners
Lead to better improvement than static ones.
Additional training data needed.
Adaptive Combiners
Combiner evaluates (or weighs) the decision of
individual classifiers depending on the input
patterns.

88
Output Types of Individual Classifiers

Measurement
Confidence or Probability
Rank
Assign a rank to each class
Abstract
A set of several class labels

89
An Example

Handwritten numerals (0-9)
Extract from a collection of Dutch utility maps
30 48 binary images
200 patterns per class (2000 in total)
Features
76 Fourier Coefficients of the character shapes
216 profile correlations
64 Karhunen-Loeve coefficients
240 pixel averages in 2 3 windows
47 Zernike moments
6 morphlogical features

90
An Example
91
Statistical Pattern Recognition A Review

Error Estimation

92
Ultimate Measurement of a Classifier

Classification error or simple the error rate Pe.
The percentage of misclassified test samples is
taken as an estimate of the error rate.
How should the available samples be split to form
training and test sets?
Especially important in small sample case

93
Error Estimation Methods
Cross- Validataion Approaches
94
Other Performance Measurements

Example (Fingerprint Matching System)
False Acceptance Rate (FAR)
The percentage for incorrectly matches
False Reject Rate (FRR)
The percentage for incorrectly unmatches.
Reject rate
The percentage for reject doubtful patterns

95
Error Rate Example
96
Receiver operating characteristics (ROC)

FMR? FNMR? trade-off ??
???? ???? ?? ??
?? ?? ??
?? FNMR (?? ??? ??)
?? ?? ??
?? FMR (??? ??? ??)
?? ??
FMR? FNMR ?? ??
Equal error rate (EER)
FMR FNMR? ?? ???
???? ? ??? ??
System B? System A?? ?? ??? ??

97
Statistical Pattern Recognition A Review

Unsupervised Classification

98
Unlabelled Training Data
Unsupervised classification is also known as data
clustering.
99
Difficulties

Data can reveal clusters with different sizes and
shapes.
Number of clusters depends on the resolution.
Similarity measurement.

100
Importance

Data Mining
Information Retrieval
Image Segmentation
Signal Compression and Coding
Machine Learning

101
Main Techniques

Iterative Square-Error Partition Clustering
Main concern in the following discussion
Agglomerative Hierarchical Clustering

102
Formulation

Given n patterns in a d-dimensional metric space,
determine a partitions of patterns in to K
clusters such that
the patterns in a cluster are more similar to
each other than to patterns in different clusters.

103
Two Popular Approaches forPartition Clustering

Square-Error Clustering
K-Means
Fuzzy K-Means
Mixture Decomposition
EM Algorithm

104
Square-Error Clustering
Fact
Total Scattering
Between-Cluster Scattering
Within-Cluster Scattering
105
Square-Error Clustering
Fact
Goal Find a partition to
maximize
or
minimize
106
Mixture Model

Formal model for unsupervised classification.
Each pattern was produced by one of a set of
alternative (probabilistically modeled) sources.
Mixtures can also be seen as a class of models
that are able to represent arbitrarily complex
probability density functions.
Mixtures also well suited for represent complex
class-conditional densities in supervised
learning scenarios.

107
Mixture Generation
Parameters
K random sources
P(x?Ci) ai
108
Mixture Generation
Parameters
K random sources
P(x?Ci) ai
If pm(x?m) is normal, it is called the Gaussian
Mixture Model (GMM).
109
Goal
Parameters
Given the form of pm(x?m) and the data x1, x2,
, xn,
find
to fit the model
110
Issues
Parameters

How to estimate the parameters?
EM (expectation-maximization) algorithm
MCMC (Markov Chain Monte-Carlo) Method
How to estimate the number of components
(sources)?
More difficult

111
Example
112
Discussion

Increasing interaction and collaboration among
different disciplines
The prevalence of fast processors, the Internet,
large and inexpensive memory and storage
Emerging applications, such as data mining and
document taxonomy creation and maintenance.
The need for a principled, rather than ad hoc
approach for successfully solving pattern
recognition problems in a predictable way