Feature selection for supervised classification - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Feature selection for supervised classification

Description:

Let C(X, L ) be the classifier constructed by using the training sample L, and T ... If the original feature set is of size p then the total number of competing ... – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 65

Provided by: INELIC

Category:

more less

Transcript and Presenter's Notes

Title: Feature selection for supervised classification

1
Feature selection for supervised classification

Edgar Acuña
(joint work with Frida Coaquira)
Department of Mathematics
University of Puerto Rico
Mayagüez Campus
(www.math.uprm.edu/edgar)

2
OUTLINE

Some concepts on classification
The feature selection problem
Kernel density estimators classifiers
Wrapper methods
Filter methods
Results and concluding remarks

3
The Supervised classification problem
4
The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L, and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. Methods to
estimate ME Resubstitution, Leave-one-out,
Cross-validation, Bootstrapping
5
CROSS-VALIDATION
Training Sample
E1 (y?C1(x))
C1
CVE(E1E10)/n
6
Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(Yj/x)gtP(Yj/x) for
all j?j By Bayess theorem P(Yj/x)?jf(x/j)/f(
x)?jP(Yj) Prior of the j-th classf(x/j)
Class conditional densityf(x) Density function
of x Thus, j argmaxj ?jf(x/j).
7
Kernel density estimator classifiers
8
The feature selection problem Goal
Choose a small subset of features such thata)
The accuracy of the classifier on the dataset
does not decreases significantly.b) The
resulting conditional distribution of a class C
given the selected vector feature G is as close
as possible to the original conditional
distribution given all the features F. That is
PC/Ff?PC/GfG , fG is the projection of f
on G.
9
Steps of Feature selection 1. A
Generation Procedure to generate the next
candidate subset.2. An Evaluation function to
evaluate the subset under examination.3. A
Stopping criterion to decide when to stop.4.
(Optional) A Validation procedure to check
whether the subset is valid .
10
The Generation procedureIf the
original feature set is of size p then the total
number of competing candidate subsets is
2p.CompleteThe order of the search space is
O(2p ), but different procedures such as the
Branch and Bound can be used to reduce the
search. Heuristic. The generation of subsets is
basically incremental (either forward or
backward). The order of the search space is
O(p2).Random. The subsets are generated using
probabilistic arguments. The search space is O(2p
), but it is reduced by setting a maximum number
of iterations. It requires some parameters
values.
11
Evaluation Functions They try to
measure the discriminating ability of a feature
or a subset of features to distinguish the
different class labels.1. Distance measures
(e.g. Euclidean distance)2. Information measures
(e.g. Entropy measure)3. Dependence measures
(correlation).4. Consistency measures.5.
Classifier error rate measures.
12
A Comparison of Evaluation functions
13
Stopping criterion of the feature selection
procedures

A Threshold
A prefixed number of iterations
A prefixed size of the best subset of features

14
Categorization of feature selection methods
15
Advantages of feature selection

The computational cost of classification will be
reduced since the number of features will be
less than before.
The complexity of the classifier is reduced since
rendundant and irrelevant features are eliminated.

16
Guidelines for choosing a feature selection method

Ability to handle different type of features
(continuous, binary, nominal,ordinal)
Ability to handle multiple classes
Ability to handle large datasets.
Ability to handle noisy data.
Low complexity time.

17
Wrapper methods

Wrappers use the misclassification error rate as
the evaluation function for the subsets of
features.
Sequential Forward selection (SFS)
Sequential Backward selection (SBS)
Sequential Floating Forward selection (SFFS)
Others SFBS, Take l-remove r, GSFS, GA, SA.

18
Sequential Forward selection (SFS)

Initiallly the best subset of features T is set
as the empty set
The first featuring entering to T is the one
among of the features that yields the highest
recognition rate.
Then we perform classification using two
features the feature already selected and one
feature not selected yet. The second feature
entering T will be the one that produces the
highest recognition rate.
We continue the process entering in each step
only one feature until the recognition rate does
not increases when the classification is
performed using the features already selected
with each of the remaining features.

19
Sequential Backward selection(SBS)

Initially the best subset of features T include
all the features of the dataset
In the first step we perform the classification
without considering each of the feature, and we
remove the feature where the recognition rate is
the highest.
The procedure continues removing one variable in
each step until the recognition rates starts to
decrease.
No efficient for nonparametric classifiers
because has a high computing running time.

20
Sequential floating forward selection (SFFS)

Pudil, et al (1994). It tries to solve the
nesting problem that appears in SFS and SBS.
Initiallly the best subset of features T is set
as the empty set
In each step a new feature is included in T using
SFS, but it is followed for exclusion of features
that already in T. The features are excluded
using SBS until the recognition rate starts to
decrease
The process continues until the recognition rate
does not increases when the classification is
performed using the features already selected
with each of the remaining features.

21
Filter methods

They do not require a classifier instead they use
measures that allow us to select the features
distinguishing more the classes.
RELIEF
Las Vegas Filter (LVF)
FINCO
Others Branch Bound, Focus,

22
The RELIEF method

Kira and Rendell (1992) and modified by
Kononenko (1994) and Kononenko, et al. (1997).
Generates subsets of features heuristically.
Based on the k-nn classifier.
The features are selected according to they
distinguish the classes through distances between
instances near to each other.

23
RELIEF (procedure)

A given number N of instances are selected
randomly from the training set.
For each instance x selected, one must identify
two partciular instances
Nearhit The instance closest to x that
belong to its same class.
Nearmiss The instance closest to x that
belongs to different class.
The relevances weigth Wj of each feature is
initialized to ceron Los pesos son inicializados
en cero.
The relevance Wj are update using the relation
Wj Wj- diff(xj, Nearhitj)2
diff(xj, Nearmissj)2

24
RELIEF (distances)

IIIf the feature Xk is either nominal or binary
then
diff(xik,xjk) 1 for xik ?xjk
0 en caso contrario
If the feature Xk is either continuous
or ordinal then
diff(xik,xjk) (xik-xjk)/ck ,
ck es la norma de x-y

25
The RELIEF AlgorithmInput DData
set, F number of features in D, Nosample number
of instances randomly draw, Threshold?. 1. T
?( T is the subset containing the features
selected) 2. Inicializate all weigths, Wj
(j1,..,F), to zero3. For i1 to Nosample
Randomly choose an instance x in D Find its
NearHit and NearMiss For j1 to F
WjWj-diff(xj,nearHitj)2diff(xj,nearMissj)24.
For j1 to FIf Wj? ? then append fj to T5.
Return T.
26
The LVF Algorithm

Input D Dataset , p Number of features ,
MaxTries Maximum number of
trials , Threshold ?? .
Cbestp , Sbest S
For i 1 to MaxTries
Si Subset of S choosen randomly.
C card(Si)
If(C lt Cbest)
If Inconsistency(Si, D) lt?
Sbest S , Cbest C
If ( C Cbest and Inconsistency
(Si, D)? ?)
Sbest Si.
Output Sbest

27
The Relief (Cont)
Advantages It works for noisy and correlated
featuresTime complexity is linear on the number
of features and in Nosample.It works for any
type of features.LimitationsOnly binary
classes. Extended to multiclasses by Kononenko
(1994).Remove irrelevant features but it does
not remove redundant features. Choice of the
threshold.Choice of the NoSample.
28
RELIEF for multiclass problems

Kononenko (1994) and (1997).
To update the relevance, first a Nearmiss has
to be found for each class diferent to x, and
then their contribution is averaged using
weigths based on priors.

29
The Las Vegas Filter (LVF) method

Liu and Setiono (1997)
The subset of features are choosen randomly.
The evaluation function used is a inconsistency
measure.
The continuous features of the dataset have to be
discretized previously.

30
The Inconsistency measure

Given a dataset wich has only non-continuous
features, then its inconsistency is defined by
C number of classes
Ni number of instances the i-th class that also
appear also in any other class.
N Total number of instances

31
Discretization

Supervised versus no supervised
Global versus Local
Static versus Dynamic

32
Supervised Discretization using equal width
intervals

Freedman and Diaconiss formula for the width
where denotes the interquartile
range.
Then the number of intervals is given by
k R / h
where R is the range of the feature to be
discretized
This method is robust to outliers.

33
The FINCO method

FINCO (Acuna and Coaquira, 2002) combines a
sequential forward selection with a inconsistency
measure as evaluation function
PROCEDURE
The best subset of features T is initialized as
the empty set.
In the first step we select the feature that
produces the smallest level of inconsistency.
In the second step we select the feature that
along with the first feature selected produces
the smallest level of inconsistency.
The process continues until every feature not
selected yet along with the features already
selected produces a level of inconsistency less
than a prefixed threshold ? (0???0.10)

34
The FINCO algorithm

Input D Dataset , p Number of features in
D,
S set of features of D ,
Threshold ?? .
Initialization
Set k0 and Tk???
Inclusion For k1 to p
Select the feature x such that
where S- Tk is the subset of features not yet
selected.
If Incons(Tkx)gtIncons(Tk) and Incons(Tkx)lt?,
then
Tk1 Tk x and kk1
else stop
Salida Tk subset of slected features

35
Experimental Methodology

All the feature selection methods were applied to
twelve datasets available in the Machine Learning
Databases Repository at the Computer Science
Departament of the Universidad de California,
Irvine.
All the algorithms were programmed in S-Plus.
The feature selection procedure were compared in
three aspects
1. The number of features selected.
2. The misclassification error rate using
classifiers based on kernel density estimators
with fixed bandwidth (standard kernel) and with
variable bandwidth (adaptive kernel).
3. The computing running time.

36
Methodology for WRAPPERS methods

The experiment was repeated 10 times for
datasets with an small number of features. For
other cases the experiment was repeated 20 times.
The size of the subset was determined by the
average number of features selected in all the
repetitions.
The features selected were those with the highest
frecuency.
To break ties for the last feature to be selected
we assigned weigths to the features according to
the their selection order.

37
Methodology for filter methods

In RELIEF and LVF the experiment was repeated 10
times for datasets with an small number of
features. For other cases the experiment was
repeated 20 times.
In RELIEF, Nosample was taken equal to the
number of instances of the dataset, and threshold
?0
In LVF, the number of subsets selected randomly
was chosen between 50 and 1000, and the
inconsistency level was selected between 0 and
0.10.
In FINCO the experiment was performed only one
time since there is not randomness, and the
consistency level was selected between 0 and 0.05.

38
Datasets
39
Recognition rate versus the number of variables
being selected by SFS with Standard Kernel
40
Feature selected using SFS
41
Misclassification error rate before and after
feature selection using SFS
42
Computing running time for feature selection with
SFS
43
Misclassification error rate before and after
feature selection with SFFS
44
Computing running time for feature selection
using SFFS
45
Comparison of the number of features selected
with the three wrapper methods using the
standard kernel
46
Comparison of the number of features selected
with the three wrapper methods using the
adaptive kernel
47
Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(standard kernel)
48
Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(adaptive kernel)
49
Misclassification error rate before and after
feature selection using the RELIEF method
50
Misclassification error rate before and after
feature selection using the LVF method
51
Computing running times for the LVF method
52
Features selected with the FINCO method
53
Misclassification error rate before and after
feature selection using the FINCO method
54
Computing running times for feature selection
using the FINCO method
55
Comparison of the number of features selected
with the filter methods
56
Comparison of the misclassification error rates
of the standard kernel classifier after feature
selection using the filter methods
57
Comparison of the misclassification error rates
of the adaptive kernel classifier after feature
selection using the filter methods
58
Comparison of percentages of features selected
with the wrapper methods (standard kernel) and
the filter methods
59
Comparison of percentage of features selected
with the wrapper methods (adaptive kernel) and
the filter methods
60
Comparison of ME ratios after/before feature
selection with methods wrappers (standard kernel)
and filter methods
61
Comparison of ME ratios after/before feature
selection with methods wrappers (adaptive kernel)
and filter methods
62
CONCLUDING REMARKS

Among the wrappers methods the SFFS performs
better.
Among the filters methods, FINCO has the smallest
percentages of features selected, but has the
higher computing time.
The performance of LVF and RELIEF is quite
similar, but LVF takes more time to be computed.
Wrappers are more effective than filters in
reducing the misclassification error rate.
The RELIEF is the faster feature selection
procedure. This is more evident for large
datasets.
FINCO and SFFS have the smallest percentages of
features selected.

63
CONCLUDING REMARKS (Cont.)

SBS selects a higher number of features and it
takes a lot time to be computed.
The number of iterations suggested by the authors
of LVF is 77p5 it seems to be too high. A 1000
iterations is good enough.
Choosing a inconsistency level very close to cero
increases the number of feature selected.

64
FUTURE WORK

Compare these feature selection procedures using
others nonparametric classifiers.
Improve the computation of the inconsistency
measure to reduce the computing running time for
LVF y FINCO.
Use parallel computation for feature selection
algorithms.
Search for discretization techniques more
robusts.
Search for new methods for feature selection for
unsupervised and supervised classification