Feature selection for supervised classification - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Feature selection for supervised classification

Description:

Let C(X, L ) be the classifier constructed by using the training sample L, and T ... If the original feature set is of size p then the total number of competing ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 65
Provided by: INELIC
Category:

less

Transcript and Presenter's Notes

Title: Feature selection for supervised classification


1
Feature selection for supervised classification
  • Edgar Acuña
  • (joint work with Frida Coaquira)
  • Department of Mathematics
  • University of Puerto Rico
  • Mayagüez Campus
  • (www.math.uprm.edu/edgar)

2
OUTLINE
  • Some concepts on classification
  • The feature selection problem
  • Kernel density estimators classifiers
  • Wrapper methods
  • Filter methods
  • Results and concluding remarks

3
The Supervised classification problem
4
The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L, and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. Methods to
estimate ME Resubstitution, Leave-one-out,
Cross-validation, Bootstrapping
5
CROSS-VALIDATION
Training Sample
E1 (y?C1(x))
C1
CVE(E1E10)/n
6
Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(Yj/x)gtP(Yj/x) for
all j?j By Bayess theorem P(Yj/x)?jf(x/j)/f(
x)?jP(Yj) Prior of the j-th classf(x/j)
Class conditional densityf(x) Density function
of x Thus, j argmaxj ?jf(x/j).
7
Kernel density estimator classifiers
8
The feature selection problem Goal
Choose a small subset of features such thata)
The accuracy of the classifier on the dataset
does not decreases significantly.b) The
resulting conditional distribution of a class C
given the selected vector feature G is as close
as possible to the original conditional
distribution given all the features F. That is
PC/Ff?PC/GfG , fG is the projection of f
on G.
9
Steps of Feature selection 1. A
Generation Procedure to generate the next
candidate subset.2. An Evaluation function to
evaluate the subset under examination.3. A
Stopping criterion to decide when to stop.4.
(Optional) A Validation procedure to check
whether the subset is valid .
10
The Generation procedureIf the
original feature set is of size p then the total
number of competing candidate subsets is
2p.CompleteThe order of the search space is
O(2p ), but different procedures such as the
Branch and Bound can be used to reduce the
search. Heuristic. The generation of subsets is
basically incremental (either forward or
backward). The order of the search space is
O(p2).Random. The subsets are generated using
probabilistic arguments. The search space is O(2p
), but it is reduced by setting a maximum number
of iterations. It requires some parameters
values.
11
Evaluation Functions They try to
measure the discriminating ability of a feature
or a subset of features to distinguish the
different class labels.1. Distance measures
(e.g. Euclidean distance)2. Information measures
(e.g. Entropy measure)3. Dependence measures
(correlation).4. Consistency measures.5.
Classifier error rate measures.
12
A Comparison of Evaluation functions
13
Stopping criterion of the feature selection
procedures
  • A Threshold
  • A prefixed number of iterations
  • A prefixed size of the best subset of features

14
Categorization of feature selection methods
15
Advantages of feature selection
  • The computational cost of classification will be
    reduced since the number of features will be
    less than before.
  • The complexity of the classifier is reduced since
    rendundant and irrelevant features are eliminated.

16
Guidelines for choosing a feature selection method
  • Ability to handle different type of features
    (continuous, binary, nominal,ordinal)
  • Ability to handle multiple classes
  • Ability to handle large datasets.
  • Ability to handle noisy data.
  • Low complexity time.

17
Wrapper methods
  • Wrappers use the misclassification error rate as
    the evaluation function for the subsets of
    features.
  • Sequential Forward selection (SFS)
  • Sequential Backward selection (SBS)
  • Sequential Floating Forward selection (SFFS)
  • Others SFBS, Take l-remove r, GSFS, GA, SA.

18
Sequential Forward selection (SFS)
  • Initiallly the best subset of features T is set
    as the empty set
  • The first featuring entering to T is the one
    among of the features that yields the highest
    recognition rate.
  • Then we perform classification using two
    features the feature already selected and one
    feature not selected yet. The second feature
    entering T will be the one that produces the
    highest recognition rate.
  • We continue the process entering in each step
    only one feature until the recognition rate does
    not increases when the classification is
    performed using the features already selected
    with each of the remaining features.

19
Sequential Backward selection(SBS)
  • Initially the best subset of features T include
    all the features of the dataset
  • In the first step we perform the classification
    without considering each of the feature, and we
    remove the feature where the recognition rate is
    the highest.
  • The procedure continues removing one variable in
    each step until the recognition rates starts to
    decrease.
  • No efficient for nonparametric classifiers
    because has a high computing running time.

20
Sequential floating forward selection (SFFS)
  • Pudil, et al (1994). It tries to solve the
    nesting problem that appears in SFS and SBS.
  • Initiallly the best subset of features T is set
    as the empty set
  • In each step a new feature is included in T using
    SFS, but it is followed for exclusion of features
    that already in T. The features are excluded
    using SBS until the recognition rate starts to
    decrease
  • The process continues until the recognition rate
    does not increases when the classification is
    performed using the features already selected
    with each of the remaining features.

21
Filter methods
  • They do not require a classifier instead they use
    measures that allow us to select the features
    distinguishing more the classes.
  • RELIEF
  • Las Vegas Filter (LVF)
  • FINCO
  • Others Branch Bound, Focus,

22
The RELIEF method
  • Kira and Rendell (1992) and modified by
    Kononenko (1994) and Kononenko, et al. (1997).
  • Generates subsets of features heuristically.
  • Based on the k-nn classifier.
  • The features are selected according to they
    distinguish the classes through distances between
    instances near to each other.

23
RELIEF (procedure)
  • A given number N of instances are selected
    randomly from the training set.
  • For each instance x selected, one must identify
    two partciular instances
  • Nearhit The instance closest to x that
    belong to its same class.
  • Nearmiss The instance closest to x that
    belongs to different class.
  • The relevances weigth Wj of each feature is
    initialized to ceron Los pesos son inicializados
    en cero.
  • The relevance Wj are update using the relation
  • Wj Wj- diff(xj, Nearhitj)2
    diff(xj, Nearmissj)2

24
RELIEF (distances)
  • IIIf the feature Xk is either nominal or binary
    then
  • diff(xik,xjk) 1 for xik ?xjk
  • 0 en caso contrario
  • If the feature Xk is either continuous
    or ordinal then
  • diff(xik,xjk) (xik-xjk)/ck ,
  • ck es la norma de x-y

25
The RELIEF AlgorithmInput DData
set, F number of features in D, Nosample number
of instances randomly draw, Threshold?. 1. T
?( T is the subset containing the features
selected) 2. Inicializate all weigths, Wj
(j1,..,F), to zero3. For i1 to Nosample
Randomly choose an instance x in D Find its
NearHit and NearMiss For j1 to F
WjWj-diff(xj,nearHitj)2diff(xj,nearMissj)24.
For j1 to FIf Wj? ? then append fj to T5.
Return T.
26
The LVF Algorithm
  • Input D Dataset , p Number of features ,
  • MaxTries Maximum number of
    trials , Threshold ?? .
  • Cbestp , Sbest S
  • For i 1 to MaxTries
  • Si Subset of S choosen randomly.
  • C card(Si)
  • If(C lt Cbest)
  • If Inconsistency(Si, D) lt?
  • Sbest S , Cbest C
  • If ( C Cbest and Inconsistency
    (Si, D)? ?)
  • Sbest Si.
  • Output Sbest

27
The Relief (Cont)
Advantages It works for noisy and correlated
featuresTime complexity is linear on the number
of features and in Nosample.It works for any
type of features.LimitationsOnly binary
classes. Extended to multiclasses by Kononenko
(1994).Remove irrelevant features but it does
not remove redundant features. Choice of the
threshold.Choice of the NoSample.
28
RELIEF for multiclass problems
  • Kononenko (1994) and (1997).
  • To update the relevance, first a Nearmiss has
    to be found for each class diferent to x, and
    then their contribution is averaged using
    weigths based on priors.

29
The Las Vegas Filter (LVF) method
  • Liu and Setiono (1997)
  • The subset of features are choosen randomly.
  • The evaluation function used is a inconsistency
    measure.
  • The continuous features of the dataset have to be
    discretized previously.

30
The Inconsistency measure
  • Given a dataset wich has only non-continuous
    features, then its inconsistency is defined by
  • C number of classes
  • Ni number of instances the i-th class that also
    appear also in any other class.
  • N Total number of instances

31
Discretization
  • Supervised versus no supervised
  • Global versus Local
  • Static versus Dynamic

32
Supervised Discretization using equal width
intervals
  • Freedman and Diaconiss formula for the width
  • where denotes the interquartile
    range.
  • Then the number of intervals is given by
  • k R / h
  • where R is the range of the feature to be
    discretized
  • This method is robust to outliers.

33
The FINCO method
  • FINCO (Acuna and Coaquira, 2002) combines a
    sequential forward selection with a inconsistency
    measure as evaluation function
  • PROCEDURE
  • The best subset of features T is initialized as
    the empty set.
  • In the first step we select the feature that
    produces the smallest level of inconsistency.
  • In the second step we select the feature that
    along with the first feature selected produces
    the smallest level of inconsistency.
  • The process continues until every feature not
    selected yet along with the features already
    selected produces a level of inconsistency less
    than a prefixed threshold ? (0???0.10)

34
The FINCO algorithm
  • Input D Dataset , p Number of features in
    D,
  • S set of features of D ,
    Threshold ?? .
  • Initialization
  • Set k0 and Tk???
  • Inclusion For k1 to p
  • Select the feature x such that
  • where S- Tk is the subset of features not yet
    selected.
  • If Incons(Tkx)gtIncons(Tk) and Incons(Tkx)lt?,
    then
  • Tk1 Tk x and kk1
  • else stop
  • Salida Tk subset of slected features

35
Experimental Methodology
  • All the feature selection methods were applied to
    twelve datasets available in the Machine Learning
    Databases Repository at the Computer Science
    Departament of the Universidad de California,
    Irvine.
  • All the algorithms were programmed in S-Plus.
  • The feature selection procedure were compared in
    three aspects
  • 1. The number of features selected.
  • 2. The misclassification error rate using
    classifiers based on kernel density estimators
    with fixed bandwidth (standard kernel) and with
    variable bandwidth (adaptive kernel).
  • 3. The computing running time.

36
Methodology for WRAPPERS methods
  • The experiment was repeated 10 times for
    datasets with an small number of features. For
    other cases the experiment was repeated 20 times.
  • The size of the subset was determined by the
    average number of features selected in all the
    repetitions.
  • The features selected were those with the highest
    frecuency.
  • To break ties for the last feature to be selected
    we assigned weigths to the features according to
    the their selection order.

37
Methodology for filter methods
  • In RELIEF and LVF the experiment was repeated 10
    times for datasets with an small number of
    features. For other cases the experiment was
    repeated 20 times.
  • In RELIEF, Nosample was taken equal to the
    number of instances of the dataset, and threshold
    ?0
  • In LVF, the number of subsets selected randomly
    was chosen between 50 and 1000, and the
    inconsistency level was selected between 0 and
    0.10.
  • In FINCO the experiment was performed only one
    time since there is not randomness, and the
    consistency level was selected between 0 and 0.05.

38
Datasets
39
Recognition rate versus the number of variables
being selected by SFS with Standard Kernel
40
Feature selected using SFS
41
Misclassification error rate before and after
feature selection using SFS
42
Computing running time for feature selection with
SFS
43
Misclassification error rate before and after
feature selection with SFFS
44
Computing running time for feature selection
using SFFS
45
Comparison of the number of features selected
with the three wrapper methods using the
standard kernel
46
Comparison of the number of features selected
with the three wrapper methods using the
adaptive kernel
47
Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(standard kernel)
48
Comparison of the misclassification error rates
after feature selection using SFS and SFFS
(adaptive kernel)
49
Misclassification error rate before and after
feature selection using the RELIEF method
50
Misclassification error rate before and after
feature selection using the LVF method
51
Computing running times for the LVF method
52
Features selected with the FINCO method
53
Misclassification error rate before and after
feature selection using the FINCO method
54
Computing running times for feature selection
using the FINCO method
55
Comparison of the number of features selected
with the filter methods
56
Comparison of the misclassification error rates
of the standard kernel classifier after feature
selection using the filter methods
57
Comparison of the misclassification error rates
of the adaptive kernel classifier after feature
selection using the filter methods
58
Comparison of percentages of features selected
with the wrapper methods (standard kernel) and
the filter methods
59
Comparison of percentage of features selected
with the wrapper methods (adaptive kernel) and
the filter methods
60
Comparison of ME ratios after/before feature
selection with methods wrappers (standard kernel)
and filter methods
61
Comparison of ME ratios after/before feature
selection with methods wrappers (adaptive kernel)
and filter methods
62
CONCLUDING REMARKS
  • Among the wrappers methods the SFFS performs
    better.
  • Among the filters methods, FINCO has the smallest
    percentages of features selected, but has the
    higher computing time.
  • The performance of LVF and RELIEF is quite
    similar, but LVF takes more time to be computed.
  • Wrappers are more effective than filters in
    reducing the misclassification error rate.
  • The RELIEF is the faster feature selection
    procedure. This is more evident for large
    datasets.
  • FINCO and SFFS have the smallest percentages of
    features selected.

63
CONCLUDING REMARKS (Cont.)
  • SBS selects a higher number of features and it
    takes a lot time to be computed.
  • The number of iterations suggested by the authors
    of LVF is 77p5 it seems to be too high. A 1000
    iterations is good enough.
  • Choosing a inconsistency level very close to cero
    increases the number of feature selected.

64
FUTURE WORK
  • Compare these feature selection procedures using
    others nonparametric classifiers.
  • Improve the computation of the inconsistency
    measure to reduce the computing running time for
    LVF y FINCO.
  • Use parallel computation for feature selection
    algorithms.
  • Search for discretization techniques more
    robusts.
  • Search for new methods for feature selection for
    unsupervised and supervised classification
Write a Comment
User Comments (0)
About PowerShow.com