Statistical Pattern Recognition: A Review - PowerPoint PPT Presentation

1 / 115
About This Presentation
Title:

Statistical Pattern Recognition: A Review

Description:

learn to distinguish patterns of interest from their background, and ... As opposite of a chaos; it is an entity, vaguely defined, that could be given a name. ... – PowerPoint PPT presentation

Number of Views:1700
Avg rating:3.0/5.0
Slides: 116
Provided by: www2Se
Category:

less

Transcript and Presenter's Notes

Title: Statistical Pattern Recognition: A Review


1
Statistical Pattern Recognition A Review
  • 2007? 2??
  • ??? ???

2
Contents
  • Introduction
  • Statistical Pattern Recognition
  • The Curse of Dimensionality Peaking Phenomena
  • Dimensionality Reduction
  • Feature Extraction
  • Feature Selection
  • Classifiers
  • Classifier Combination
  • Error Estimation
  • Unsupervised Classification

3
Statistical Pattern Recognition A Review
  • Introduction

4
What is Pattern Recognition?
  • The study of how machines can observe the
    environment,
  • learn to distinguish patterns of interest from
    their background, and
  • make sound and reasonable decisions about the
    categories of the patterns.
  • What is a pattern?
  • What kinds of category we have?

5
What is a Pattern?
  • As opposite of a chaos it is an entity, vaguely
    defined, that could be given a name.
  • For example, a pattern could be
  • A fingerprint images
  • A handwritten cursive word
  • A human face
  • A speech signal

6
Categories (Classes)
  • Supervised Classification
  • Discriminant Analysis
  • Unsupervised Classification
  • Clustering

7
Applications of Pattern Recognition
8
The Design
  • The design of a pattern recognition system
    essentially involves the following three
    aspects
  • data acquisition
  • data representation
  • decision making

9
Pattern Recognition Models
  • The four best known approaches
  • template matching
  • statistical classification
  • syntactic or structural matching
  • neural networks

10
Pattern Recognition Models
11
Statistical Pattern Recognition A Review
  • Statistical Pattern Recognition

12
Pattern Representation
  • A pattern is represented by a set of d features,
    or attributes, viewed as a d-dimensional feature
    vector.

13
Two Modes of a Pattern Recognition system
Classification Mode
Preprocessing
Feature Measurement
Classification
test pattern
Feature Extraction/ Selection
Preprocessing
Learning
training pattern
Training Mode
14
Decision Making Rules
  • Given a pattern x (x1, x2, , xd)T, assign it
    to one of c categories in ?.
  • ? ?1, ?2, , ?c

15
Decision Making Rules
16
The States of Nature
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
17
Baysian Decision Theory
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
A posterior probability
P(?2)
P(?1)
P(x?2)
P(x?1)
P(?3)
P(x?3)
18
Decision Making Rules
  • Bayes Decision Rule
  • Maximum Likelihood Rule
  • Minimax
  • Neyman-Pearson
  • . . .

19
Loss Functions Conditional Risk
Loss Function
The loss incurred in deciding ?i when the true
class is ?j.
Conditional Risk
Posterior Probability
20
Bayse Decision Rule
The optimal decision rule for minimizing the risk.
21
Maximum Likelihood Decision Rule
0/1 loss function
with
?
22
Minimax
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Minimax deals with the case that P(?i)s are
unknown.
Game theory
23
Neyman-Pearson
P(?i) prior probabilities.
P(x?i) class-conditional probabilities.
Neyman-Pearson criterion wish to minimize the
overall risk subject to a constraint, such as
for a particular i.
24
Various Approaches
25
Performance Evaluation
Optimizing a classifier to maximize its
performance on the training set may not always
result in the desired performance on a test set.
Test Set
Classification Mode
Preprocessing
Feature Measurement
Classification
test pattern
Feature Extraction/ Selection
Preprocessing
Learning
training pattern
Training Set
Training Mode
26
Problems on Learning (Generalization)
  • The curse of dimensionality
  • The number of features is too large relative to
    the number of training samples
  • Classifiers complexity
  • The number of unknown parameters associated with
    the classifier is large (e.g., polynomial
    classifiers or a large neural network)
  • Overtrained
  • Too intensively optimized on training set.

27
Statistical Pattern Recognition A Review
  • The Curse of Dimensionality Peaking Phenomena

28
The Curse
  • The performance of a classifier depends on the
    interrelationship between
  • sample sizes
  • number of features
  • classifier complexity
  • If table-lookup technique is adopted for
    classification, how many training samples are
    required w.r.t the number of features?

29
Peaking Phenomena
  • The probability of misclassification of a
    decision rule does not increase as the number of
    features increases.
  • This is true as long as the class-conditional
    densities are completely known.
  • Peaking Phenomena
  • Adding features many actually degrade the
    performance of a classifier

30
Trunks Example
  • Two classes with mean vectors and covariance
    matrices as follows

G. V. Trunk, A Problem of Dimensionality A
Simple Example, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 1, no. 3, pp. 306-307,
July 1979.
31
Trunks Example
  • The mean vector m is known
  • The mean vector m is unknown and n labeled
    training samples are available

32
Guideline
  • The number of training samples, the number of
    features and the true parameters of the
    class-conditional densities is very difficult to
    established.
  • It is generally accepted that using at least ten
    times as many training samples per class as the
    number of features (n/d gt 10) is a good practice
    to follow in classifier design.

33
Statistical Pattern Recognition A Review
  • Dimensionality Reduction

34
Dimensionality Reduction
  • A limited yet salient feature set simplifies both
    pattern representation and classifier design.
  • Pattern representation is easy for 2D and 3D
    features.
  • How to make pattern with high dimensional
    features viewable?

35
ExamplePattern Representation
36
Dimensionality Reduction How?
  • Feature Extraction
  • Create new features based on the original feature
    set
  • Transforms are usually involved
  • Feature Selection
  • Select the best subset from a given feature set

37
Main Issues in Dimensionality Reduction
  • The choice of a criterion function
  • Commonly used criterion classification error
  • The determination of the appropriate
    dimensionality
  • Correlated with the intrinsic dimensionality of
    data

38
Dimensionality Reduction
  • Feature Extraction

39
Feature Extractor
Feature Extractor
yi
xi
m ? d, usually
40
Some Important Methods
  • Principal Component Analysis (PCA)
  • or Karhunen-Loeve Expansion
  • Project Pursuit
  • Independent Component Analysis (ICA)
  • Factor Analysis
  • Discriminate Analysis
  • Kernel PCA
  • Multidimensional Scaling (MDS)
  • Feed-Forward Neural Networks
  • Self-Organizing Map

Linear Approaches
Nonlinear Approaches
Neural Networks
41
Feed-Forward Neural Networks
Linear PCA
Nonlinear PCA
42
Demonstration
PCA
Fisher Mapping
Sammon Mapping
Kernel PCA
43
Summary
44
Dimensionality Reduction
  • Feature Selection

45
Feature Selector
possible Selections
Feature Selector
xi
m ? d, usually
46
The Problem
  • Given a set of d features, select a subset of
    size m that leads to the smallest classification
    error.
  • No nonexhaustive sequential feature selection
    procedure can be guaranteed to produce the
    optimal subset.

47
Optimal Methods
  • Exhaustive Search
  • Evaluate all possible subsets
  • Branch-and-Bound
  • The monotonicity property of the criterion
    function has to be held.

48
Suboptimal Methods
  • Best Individual Features
  • Sequential Forward Selection (SFS)
  • Sequential Backward Selection (SBS)
  • Plus l-take away r Selection
  • Sequential Forward Floating Search and Sequential
    Backward Floating Search

49
Summary 1/2
Optimal
50
Summary 2/2
Suboptimal
51
Statistical Pattern Recognition A Review
  • Classifiers

52
Classification
Once a feature selection or a classification
procedure finds a proper representation, a
classifier can be designed using a number of
possible approaches.
In practice, the choice of a classifier is a
difficult problem and it is often based on which
classifier(s) happen to be available, best known,
to the user.
53
Approaches of Classifier Design
  • Based on Similarity
  • Probabilistic Approach
  • Decision-Boundary Approach
  • Geometric Approach

54
Classifiers Based on Similarity
  • Once a good metric to define similarity, patterns
    can be classified by template matching or
    minimum distance classifier using a few
    prototypes per class.
  • The choice of the metric and prototypes is
    crucial to the success of this approach.

55
Classification Methods Based on Similarity
  • Template Matching
  • Assign Pattern to the most similar template
  • Nearest Mean Classifier
  • Assign Pattern to the nearest class mean
  • Subspace Method
  • Assign Pattern to the nearest subspace
    (invariance)
  • 1-Nearest Neighbor Rule
  • Assign Pattern to the class of the nearest
    training pattern

56
Probabilistic Approach
  • Bayes decision rule
  • It takes into account costs associated with
    different types of misclassification.
  • Given prior probabilities, loss function,
    class-conditional densities, it is optimal in
    minimizing the risk.
  • With the 0/1 loss function, it assign a pattern
    to the class with the maximum posterior
    probability (maximum likelihood decision rule).

57
Probability Models
  • Parametric Models
  • Parameter estimation
  • Commonly used models
  • Multivariate Gaussian distributions for
    continuous features
  • Binomial distributions for binary features
  • Multinormal distributions for integer-valued
    features
  • Bayse Plug-in rule
  • Nonparametric Models

58
Classifiers - Probabilistic Approach
  • Parametric Models
  • Bayse Plug-in
  • Logistic Classifier
  • Nonparametric Models
  • k-Nearest Neighbor Rule
  • Parzen Classifier

59
Geometric Approach
  • Construct decision boundaries directly by
    optimizing certain error criterion.
  • Commonly used criterion
  • Classification error
  • MSE
  • A training procedure is required.

60
Classifiers - Special Approach
  • Fisher Linear Discriminant
  • Binary Decision Tree
  • Neural Networks
  • Perceptron
  • Multi-Layer Perceptron
  • Radial Basis Network
  • Support Vector Classifier

61
  • SVM classification approach

62
Linear SVM Linearly separable case (1)
  • Training set
  • vectors from the d-dimensional feature space
  • Target
  • Two classes are linearly separable.
  • Discriminant function

63
Linear SVM Linearly separable case (2)
64
Linear SVM Linearly separable case (3)
65
Linear SVM Linearly Nonseparable Case (1)
  • Linearly separable
  • difficult to satisfy in the classification of
    real data
  • handle nonseparable data
  • optimal separating hyperplane has been
    generalized as the solution that minimizes a cost
    function
  • margin maximization
  • in the case of linearly separable data
  • error minimization
  • to penalize the wrongly classified

66
Linear SVM Linearly Nonseparable Case (2)
  • Cost function
  • (slack variable) account for the
    nonseparability of data
  • constant C regularization parameter that allows
    to control the penalty assigned to errors
  • two kinds of support vectors coexist
  • margin support vectors lie on the hyperplane
    margin
  • nonmargin support vectors fall on the wrong
    side of this margin

67
Linear SVM Linearly Nonseparable Case (3)
68
Nonlinear SVM Kernel Method (1)
  • nonlinear discriminant function
  • mapping the data through a proper nonlinear
    transformation into a higher dimensional
    feature space
  • separation between the two classes
  • dual problem
  • replacing the inner products in the original
    space
  • with inner products in the transformed space
  • Computation of
  • expensive
  • at times unfeasible

69
Nonlinear SVM Kernel Method (2)
  • kernel method provides an elegant and effective
    way of dealing with this problem.
  • correspond to some type of inner product in the
    transformed (higher) dimensional feature space

70
Nonlinear SVM Kernel Method (3)
  • discriminant function
  • The shape of the discriminant function depends on
    the kind of kernel functions adopted.
  • Gaussian radial basis function
  • polynomial function of order expressed

71
SVMs in Feature Spaces
  • The geometrical nature of SVMs results in a
    methodology that is not aimed at estimating the
    statistical distributions of classes over the
    entire hyperdimensional space.
  • SVMs do not involve a density estimation problem
  • The maximum margin solution allows to fully
    exploit the discrimination capability of the
    relatively few training samples available.

72
  • SVMs Multiclass Strategies

73
Parallel Approach OAA (1)
  • OAA(one-against-all) strategy
  • two-class problem defined by one information
    class
  • winner-takes-all
  • The winning class is the one corresponding to the
    SVM with the highest output
  • discriminant function value

74
Parallel Approach OAA (2)
75
Parallel Approach OAO (1)
  • OAO(one-against-one) strategy
  • The main problem of the OAA strategy
  • estimation of complex discriminant functions
  • simple classification tasks
  • parallel architecture made up of a large number
    of SVMs
  • OAO strategy involves SVMs, which model all
    possible pairwise classifications.

76
Parallel Approach OAO (2)
  • Binary classification
  • Score function
  • Final decision

77
Hierarchical Tree-Based Approach
  • BHT -Balanced Branches Strategy
  • BHT One Against All Strategy

78
BHT-BB Algorithm
79
BHT-BB
80
BHT-OAA Algorithm
81
BHT-OAA
82
Statistical Pattern Recognition A Review
  • Classifier Combination

83
Why Combining Classifiers?
  • Independent classifiers for the same goal.
  • Person identification by voice, face and
    handwriting.
  • Sometimes more than a single training set is
    available, each collected at different time or in
    a different environment. These training sets may
    even use different features.
  • Different classifiers trained on the same data
    may not only differ in their global performance,
    but they also may show strong local differences.
    Each classifier may have its own region in the
    feature space where it performs the best.
  • Some classifiers such as neural networks show
    different results with different initializations
    due to the randomness inherent in the training
    procedure. Instead of selecting the best network
    and discarding the others, one can combine
    various networks, thereby taking advantage of all
    the attempts to learn from data.

84
Combining Schemes
  • Parallel
  • Cascading
  • Hierarchical

85
Selection and Training of Individual Classifier
  • Combining classifiers that are largely
    independent.
  • Create training sets using various resampling
    techniques, such as rotation and bootstrapping.
  • Examples
  • Stacking
  • Bagging (bootstrap aggregation)
  • Boosting or ARCing (Adaptive Reweighting and
    Combinig)

86
Selection and Training of Individual Classifier
  • Cluster analysis may be used to separate the
    individual classes in the training set in to
    subclasses.
  • Consequently, simpler classifier (e.g., linear)
    may be used and combined later to generate, for
    instance, a piecewise linear result.
  • Building different classifiers on different sets
    of training patterns, different feature sets may
    be used, e.g., random subspace method.

87
Combiners
  • Static Combiners
  • Voting, Averaging, Borda Count.
  • Trainable Combiners
  • Lead to better improvement than static ones.
  • Additional training data needed.
  • Adaptive Combiners
  • Combiner evaluates (or weighs) the decision of
    individual classifiers depending on the input
    patterns.

88
Output Types of Individual Classifiers
  • Measurement
  • Confidence or Probability
  • Rank
  • Assign a rank to each class
  • Abstract
  • A set of several class labels

89
An Example
  • Handwritten numerals (0-9)
  • Extract from a collection of Dutch utility maps
  • 30 48 binary images
  • 200 patterns per class (2000 in total)
  • Features
  • 76 Fourier Coefficients of the character shapes
  • 216 profile correlations
  • 64 Karhunen-Loeve coefficients
  • 240 pixel averages in 2 3 windows
  • 47 Zernike moments
  • 6 morphlogical features

90
An Example
91
Statistical Pattern Recognition A Review
  • Error Estimation

92
Ultimate Measurement of a Classifier
  • Classification error or simple the error rate Pe.
  • The percentage of misclassified test samples is
    taken as an estimate of the error rate.
  • How should the available samples be split to form
    training and test sets?
  • Especially important in small sample case

93
Error Estimation Methods
Cross- Validataion Approaches
94
Other Performance Measurements
  • Example (Fingerprint Matching System)
  • False Acceptance Rate (FAR)
  • The percentage for incorrectly matches
  • False Reject Rate (FRR)
  • The percentage for incorrectly unmatches.
  • Reject rate
  • The percentage for reject doubtful patterns

95
Error Rate Example
96
Receiver operating characteristics (ROC)
  • FMR? FNMR? trade-off ??
  • ???? ???? ?? ??
  • ?? ?? ??
  • ?? FNMR (?? ??? ??)
  • ?? ?? ??
  • ?? FMR (??? ??? ??)
  • ?? ??
  • FMR? FNMR ?? ??
  • Equal error rate (EER)
  • FMR FNMR? ?? ???
  • ???? ? ??? ??
  • System B? System A?? ?? ??? ??

97
Statistical Pattern Recognition A Review
  • Unsupervised Classification

98
Unlabelled Training Data
Unsupervised classification is also known as data
clustering.
99
Difficulties
  • Data can reveal clusters with different sizes and
    shapes.
  • Number of clusters depends on the resolution.
  • Similarity measurement.

100
Importance
  • Data Mining
  • Information Retrieval
  • Image Segmentation
  • Signal Compression and Coding
  • Machine Learning

101
Main Techniques
  • Iterative Square-Error Partition Clustering
  • Main concern in the following discussion
  • Agglomerative Hierarchical Clustering

102
Formulation
  • Given n patterns in a d-dimensional metric space,
    determine a partitions of patterns in to K
    clusters such that
  • the patterns in a cluster are more similar to
    each other than to patterns in different clusters.

103
Two Popular Approaches forPartition Clustering
  • Square-Error Clustering
  • K-Means
  • Fuzzy K-Means
  • Mixture Decomposition
  • EM Algorithm

104
Square-Error Clustering
Fact
Total Scattering
Between-Cluster Scattering
Within-Cluster Scattering
105
Square-Error Clustering
Fact
Goal Find a partition to
maximize
or
minimize
106
Mixture Model
  • Formal model for unsupervised classification.
  • Each pattern was produced by one of a set of
    alternative (probabilistically modeled) sources.
  • Mixtures can also be seen as a class of models
    that are able to represent arbitrarily complex
    probability density functions.
  • Mixtures also well suited for represent complex
    class-conditional densities in supervised
    learning scenarios.

107
Mixture Generation
Parameters
K random sources
P(x?Ci) ai
108
Mixture Generation
Parameters
K random sources
P(x?Ci) ai
If pm(x?m) is normal, it is called the Gaussian
Mixture Model (GMM).
109
Goal
Parameters
Given the form of pm(x?m) and the data x1, x2,
, xn,
find
to fit the model
110
Issues
Parameters
  • How to estimate the parameters?
  • EM (expectation-maximization) algorithm
  • MCMC (Markov Chain Monte-Carlo) Method
  • How to estimate the number of components
    (sources)?
  • More difficult

111
Example
112
Discussion
  • Increasing interaction and collaboration among
    different disciplines
  • The prevalence of fast processors, the Internet,
    large and inexpensive memory and storage
  • Emerging applications, such as data mining and
    document taxonomy creation and maintenance.
  • The need for a principled, rather than ad hoc
    approach for successfully solving pattern
    recognition problems in a predictable way

113
Frontiers of Pattern Recognition 1/2
114
Frontiers of Pattern Recognition 2/2
115
Reference
Write a Comment
User Comments (0)
About PowerShow.com