Title: Feature selection for supervised classification
1Feature selection for supervised classification
- Edgar Acuña
- (joint work with Frida Coaquira)
- Department of Mathematics
- University of Puerto Rico
- Mayagüez Campus
- (www.math.uprm.edu/edgar)
-
2OUTLINE
- Some concepts on classification
- The feature selection problem
- Kernel density estimators classifiers
- Wrapper methods
- Filter methods
- Results and concluding remarks
3The Supervised classification problem
4 The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L, and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. Methods to
estimate ME Resubstitution, Leave-one-out,
Cross-validation, Bootstrapping
5CROSS-VALIDATION
Training Sample
E1 (y?C1(x))
C1
CVE(E1E10)/n
6Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(Yj/x)gtP(Yj/x) for
all j?j By Bayess theorem P(Yj/x)?jf(x/j)/f(
x)?jP(Yj) Prior of the j-th classf(x/j)
Class conditional densityf(x) Density function
of x Thus, j argmaxj ?jf(x/j).
7Kernel density estimator classifiers
8The feature selection problem Goal
Choose a small subset of features such thata)
The accuracy of the classifier on the dataset
does not decreases significantly.b) The
resulting conditional distribution of a class C
given the selected vector feature G is as close
as possible to the original conditional
distribution given all the features F. That is
PC/Ff?PC/GfG , fG is the projection of f
on G.
9 Steps of Feature selection 1. A
Generation Procedure to generate the next
candidate subset.2. An Evaluation function to
evaluate the subset under examination.3. A
Stopping criterion to decide when to stop.4.
(Optional) A Validation procedure to check
whether the subset is valid .
10 The Generation procedureIf the
original feature set is of size p then the total
number of competing candidate subsets is
2p.CompleteThe order of the search space is
O(2p ), but different procedures such as the
Branch and Bound can be used to reduce the
search. Heuristic. The generation of subsets is
basically incremental (either forward or
backward). The order of the search space is
O(p2).Random. The subsets are generated using
probabilistic arguments. The search space is O(2p
), but it is reduced by setting a maximum number
of iterations. It requires some parameters
values.
11 Evaluation Functions They try to
measure the discriminating ability of a feature
or a subset of features to distinguish the
different class labels.1. Distance measures
(e.g. Euclidean distance)2. Information measures
(e.g. Entropy measure)3. Dependence measures
(correlation).4. Consistency measures.5.
Classifier error rate measures.
12 A Comparison of Evaluation functions
13Stopping criterion of the feature selection
procedures
- A Threshold
- A prefixed number of iterations
- A prefixed size of the best subset of features
14 Categorization of feature selection methods
15Advantages of feature selection
- The computational cost of classification will be
reduced since the number of features will be
less than before. - The complexity of the classifier is reduced since
rendundant and irrelevant features are eliminated.
16Guidelines for choosing a feature selection method
- Ability to handle different type of features
(continuous, binary, nominal,ordinal) - Ability to handle multiple classes
- Ability to handle large datasets.
- Ability to handle noisy data.
- Low complexity time.
17Wrapper methods
- Wrappers use the misclassification error rate as
the evaluation function for the subsets of
features. - Sequential Forward selection (SFS)
- Sequential Backward selection (SBS)
- Sequential Floating Forward selection (SFFS)
- Others SFBS, Take l-remove r, GSFS, GA, SA.
18Sequential Forward selection (SFS)
- Initiallly the best subset of features T is set
as the empty set - The first featuring entering to T is the one
among of the features that yields the highest
recognition rate. - Then we perform classification using two
features the feature already selected and one
feature not selected yet. The second feature
entering T will be the one that produces the
highest recognition rate. - We continue the process entering in each step
only one feature until the recognition rate does
not increases when the classification is
performed using the features already selected
with each of the remaining features.
19Sequential Backward selection(SBS)
- Initially the best subset of features T include
all the features of the dataset - In the first step we perform the classification
without considering each of the feature, and we
remove the feature where the recognition rate is
the highest. - The procedure continues removing one variable in
each step until the recognition rates starts to
decrease. - No efficient for nonparametric classifiers
because has a high computing running time.
20Sequential floating forward selection (SFFS)
- Pudil, et al (1994). It tries to solve the
nesting problem that appears in SFS and SBS. - Initiallly the best subset of features T is set
as the empty set - In each step a new feature is included in T using
SFS, but it is followed for exclusion of features
that already in T. The features are excluded
using SBS until the recognition rate starts to
decrease - The process continues until the recognition rate
does not increases when the classification is
performed using the features already selected
with each of the remaining features.
21Filter methods
- They do not require a classifier instead they use
measures that allow us to select the features
distinguishing more the classes. - RELIEF
- Las Vegas Filter (LVF)
- FINCO
- Others Branch Bound, Focus,
22The RELIEF method
- Kira and Rendell (1992) and modified by
Kononenko (1994) and Kononenko, et al. (1997). - Generates subsets of features heuristically.
- Based on the k-nn classifier.
- The features are selected according to they
distinguish the classes through distances between
instances near to each other.
23RELIEF (procedure)
- A given number N of instances are selected
randomly from the training set. - For each instance x selected, one must identify
two partciular instances - Nearhit The instance closest to x that
belong to its same class. - Nearmiss The instance closest to x that
belongs to different class. - The relevances weigth Wj of each feature is
initialized to ceron Los pesos son inicializados
en cero. - The relevance Wj are update using the relation
- Wj Wj- diff(xj, Nearhitj)2
diff(xj, Nearmissj)2
24RELIEF (distances)
- IIIf the feature Xk is either nominal or binary
then - diff(xik,xjk) 1 for xik ?xjk
- 0 en caso contrario
- If the feature Xk is either continuous
or ordinal then - diff(xik,xjk) (xik-xjk)/ck ,
- ck es la norma de x-y
25 The RELIEF AlgorithmInput DData
set, F number of features in D, Nosample number
of instances randomly draw, Threshold?. 1. T
?( T is the subset containing the features
selected) 2. Inicializate all weigths, Wj
(j1,..,F), to zero3. For i1 to Nosample
Randomly choose an instance x in D Find its
NearHit and NearMiss For j1 to F
WjWj-diff(xj,nearHitj)2diff(xj,nearMissj)24.
For j1 to FIf Wj? ? then append fj to T5.
Return T.
26The LVF Algorithm
- Input D Dataset , p Number of features ,
- MaxTries Maximum number of
trials , Threshold ?? . - Cbestp , Sbest S
- For i 1 to MaxTries
- Si Subset of S choosen randomly.
- C card(Si)
- If(C lt Cbest)
- If Inconsistency(Si, D) lt?
- Sbest S , Cbest C
- If ( C Cbest and Inconsistency
(Si, D)? ?) - Sbest Si.
- Output Sbest
27 The Relief (Cont)
Advantages It works for noisy and correlated
featuresTime complexity is linear on the number
of features and in Nosample.It works for any
type of features.LimitationsOnly binary
classes. Extended to multiclasses by Kononenko
(1994).Remove irrelevant features but it does
not remove redundant features. Choice of the
threshold.Choice of the NoSample.
28RELIEF for multiclass problems
- Kononenko (1994) and (1997).
- To update the relevance, first a Nearmiss has
to be found for each class diferent to x, and
then their contribution is averaged using
weigths based on priors.
29The Las Vegas Filter (LVF) method
- Liu and Setiono (1997)
- The subset of features are choosen randomly.
- The evaluation function used is a inconsistency
measure. - The continuous features of the dataset have to be
discretized previously.
30The Inconsistency measure
- Given a dataset wich has only non-continuous
features, then its inconsistency is defined by -
- C number of classes
- Ni number of instances the i-th class that also
appear also in any other class. - N Total number of instances
31Discretization
- Supervised versus no supervised
- Global versus Local
- Static versus Dynamic
32Supervised Discretization using equal width
intervals
- Freedman and Diaconiss formula for the width
- where denotes the interquartile
range. - Then the number of intervals is given by
- k R / h
- where R is the range of the feature to be
discretized - This method is robust to outliers.
33The FINCO method
- FINCO (Acuna and Coaquira, 2002) combines a
sequential forward selection with a inconsistency
measure as evaluation function - PROCEDURE
- The best subset of features T is initialized as
the empty set. - In the first step we select the feature that
produces the smallest level of inconsistency. - In the second step we select the feature that
along with the first feature selected produces
the smallest level of inconsistency. - The process continues until every feature not
selected yet along with the features already
selected produces a level of inconsistency less
than a prefixed threshold ? (0???0.10)
34The FINCO algorithm
- Input D Dataset , p Number of features in
D, - S set of features of D ,
Threshold ?? . - Initialization
- Set k0 and Tk???
- Inclusion For k1 to p
- Select the feature x such that
-
- where S- Tk is the subset of features not yet
selected. - If Incons(Tkx)gtIncons(Tk) and Incons(Tkx)lt?,
then - Tk1 Tk x and kk1
- else stop
- Salida Tk subset of slected features
35Experimental Methodology
- All the feature selection methods were applied to
twelve datasets available in the Machine Learning
Databases Repository at the Computer Science
Departament of the Universidad de California,
Irvine. - All the algorithms were programmed in S-Plus.
- The feature selection procedure were compared in
three aspects - 1. The number of features selected.
- 2. The misclassification error rate using
classifiers based on kernel density estimators
with fixed bandwidth (standard kernel) and with
variable bandwidth (adaptive kernel). - 3. The computing running time.
36Methodology for WRAPPERS methods
- The experiment was repeated 10 times for
datasets with an small number of features. For
other cases the experiment was repeated 20 times. - The size of the subset was determined by the
average number of features selected in all the
repetitions. - The features selected were those with the highest
frecuency. - To break ties for the last feature to be selected
we assigned weigths to the features according to
the their selection order.
37Methodology for filter methods
- In RELIEF and LVF the experiment was repeated 10
times for datasets with an small number of
features. For other cases the experiment was
repeated 20 times. - In RELIEF, Nosample was taken equal to the
number of instances of the dataset, and threshold
?0 - In LVF, the number of subsets selected randomly
was chosen between 50 and 1000, and the
inconsistency level was selected between 0 and
0.10. - In FINCO the experiment was performed only one
time since there is not randomness, and the
consistency level was selected between 0 and 0.05.
38 Datasets
39Recognition rate versus the number of variables
being selected by SFS with Standard Kernel
40 Feature selected using SFS
41Misclassification error rate before and after
feature selection using SFS
42Computing running time for feature selection with
SFS
43Misclassification error rate before and after
feature selection with SFFS
44Computing running time for feature selection
using SFFS
45Comparison of the number of features selected
with the three wrapper methods using the
standard kernel
46Comparison of the number of features selected
with the three wrapper methods using the
adaptive kernel
47Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(standard kernel)
48Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(adaptive kernel)
49Misclassification error rate before and after
feature selection using the RELIEF method
50Misclassification error rate before and after
feature selection using the LVF method
51Computing running times for the LVF method
52Features selected with the FINCO method
53Misclassification error rate before and after
feature selection using the FINCO method
54Computing running times for feature selection
using the FINCO method
55Comparison of the number of features selected
with the filter methods
56Comparison of the misclassification error rates
of the standard kernel classifier after feature
selection using the filter methods
57Comparison of the misclassification error rates
of the adaptive kernel classifier after feature
selection using the filter methods
58Comparison of percentages of features selected
with the wrapper methods (standard kernel) and
the filter methods
59Comparison of percentage of features selected
with the wrapper methods (adaptive kernel) and
the filter methods
60Comparison of ME ratios after/before feature
selection with methods wrappers (standard kernel)
and filter methods
61Comparison of ME ratios after/before feature
selection with methods wrappers (adaptive kernel)
and filter methods
62CONCLUDING REMARKS
- Among the wrappers methods the SFFS performs
better. - Among the filters methods, FINCO has the smallest
percentages of features selected, but has the
higher computing time. - The performance of LVF and RELIEF is quite
similar, but LVF takes more time to be computed. - Wrappers are more effective than filters in
reducing the misclassification error rate. - The RELIEF is the faster feature selection
procedure. This is more evident for large
datasets. - FINCO and SFFS have the smallest percentages of
features selected.
63CONCLUDING REMARKS (Cont.)
- SBS selects a higher number of features and it
takes a lot time to be computed. - The number of iterations suggested by the authors
of LVF is 77p5 it seems to be too high. A 1000
iterations is good enough. - Choosing a inconsistency level very close to cero
increases the number of feature selected.
64FUTURE WORK
- Compare these feature selection procedures using
others nonparametric classifiers. - Improve the computation of the inconsistency
measure to reduce the computing running time for
LVF y FINCO. - Use parallel computation for feature selection
algorithms. - Search for discretization techniques more
robusts. - Search for new methods for feature selection for
unsupervised and supervised classification