Dimensionality Reduction by Feature Selection in Machine Learning - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Dimensionality Reduction by Feature Selection in Machine Learning

Description:

... to the hyperplane of linear SVM trained on all the features [Brank et al 2002] ... Illustration on Reuters-2000 Data [Brank et al 2002] Reuters-2000 Data ... – PowerPoint PPT presentation

Number of Views:586

Avg rating:3.0/5.0

Slides: 36

Provided by: DunjaMl9

Category:

more less

Transcript and Presenter's Notes

Title: Dimensionality Reduction by Feature Selection in Machine Learning

1
Dimensionality Reduction by Feature Selection in
Machine Learning

Dunja Mladenic
J.Stefan Institute, Slovenia

2
Reasons for dimensionality reduction

Dimensionality reduction in machine learning is
usually performed to
Improve the prediction performance
Improve learning efficiency
Provide faster predictors possibly requesting
less information on the original data
Reduce complexity of the learned results, enable
better understanding of the underlying process

3
Approaches to dimensionality reduction

Map the original features onto the reduced
dimensionality space by
selecting a subset of the original features
no feature transformation, just select a feature
subset
constructing features to replace the original
features
using methods from statistics, such as, PCA
using background knowledge for constructing new
features to be used in addition/instead of the
original features (can be followed by feature
subset selection)
general background knowledge (sum or product of
features,...)
domain specific background knowledge (parser for
text data to get noun phrases, clustering of
words, user-specified function,)

Addressed here
4
Example for the problem

Data set
Five Boolean features
C F1 V F2
F3 F2 , F5 F4
Optimal subset
F1, F2 or F1, F3
optimization in space of all feature subsets (
possibilities)
(tutorial on genomics Yu 2004)

5
Search for feature subset

An example of search space (John Kohavi 1997)

Forward selection
Backward elimination
6
Feature subset selection

commonly used search strategies
forward selection
FSubset greedily add features one at a time
forward stepwise selection
FSubset greedily add or remove features one
at a time
backward elimination
FSubsetAllFeatures greedily remove features one
at a time
backward stepwise elimination
FSubsetAllFeatures greedily add or remove
features one at a time
random mutation
FSubsetRandomFeatures
greedily add or remove randomly selected feature
one at a time
stop after a given number of iterations

7
Approaches to feature subset selection

Filters - evaluation function independent of the
learning algorithm
Wrappers - evaluation using model selection based
on the machine learning algorithm
Embedded approaches - feature selection during
learning
Simple Filters - assume feature independence
(used for problems with large number of features,
eg. text classification)

8
Filtering
Evaluation independent of ML algorithm
9
Filters Distribution-based Koller Sahami 1996

Idea select a minimal subset of features that
keeps class probability distribution close to the
original distribution P(CFeatureSet) is close to
P(CAllFeatures)
start with all the features
use backward elimination to eliminate a
predefined number of features
evaluation the next feature to be deleted is
obtained using Cross-entropy measure

10
Filters Relief Kira Rendell 1992

Evaluation of a feature subset
represent examples using the feature subset
on a random subset of examples calculate average
difference in distance from
the nearest example of the same class and the
nearest example of the different class
F discrete
F cont.
some extensions, empirical and theoretical
analysis in Robnik-Sikonja Kononenko 2003

11
Filters FOCUS Almallim Dietterich 1991

Evaluation of a feature subset
represent examples using the feature subset
count conflicts in class value (two examples with
the same feature values and different class
value)
Search all the (promising) subsets of the same
(increasing) size are evaluated until a
sufficient (no conflicts) subset is found
assumes existence of a small sufficient subset
--gt not appropriate for tasks with many features
some extensions of the algorithm use heuristic
search to avoid evaluating all the subsets of the
same size

12
Illustration of FOCUS
Conflict!
Conflict!
13
Filters Random Liu Setiono 1996

Evaluation of a feature subset
represent examples using the feature subset
calculate the inconsistency rate
(the average difference between the number of
examples with equal feature values and the number
of examples among them with the locally, most
frequent class value)
select the smallest subset with inconsistency
rate below the given threshold
Search random sampling to search the space of
feature subsets
evaluate the predetermined number of subsets
noise handling by setting the threshold gt 0
if threshold 0, then the same evaluation as in
FOCUS

14
Filters MDL-based Pfahringer 1995

Evaluation using Minimum Description Length
represent examples using the feature subset
calculate MDL of a simple decision table
representing examples
Search start with random feature subset and add
or delete a feature, one at a time
performs at least as well as the wrapper approach
applied on the simple decision tables and scales
up better to large number of training examples

15
Wrapper
Evaluation uses the same ML algorithm that is
used after the feature selection
16
Wrappers Instance-based learning

Evaluation using instance-based learning
represent examples using the feature subset
estimate model quality using cross-validation
Search Aha Bankert 1994
start with random feature subset
use beam search with backward elimination
Search Skalak 1994
start with random feature subset
use random mutation

17
Wrappers Decision tree induction

Evaluation using decision tree induction
represent examples using the feature subset
estimate model quality using cross-validation
Search Bala et al 1995, Cherkauer Shavlik
1996
using genetic algorithm
Search Caruana Freitag 1994
adding and removing features (backward stepwise
elimination)
additionally, at each step removes all the
features that were not used in the decision tree
induced for the evaluation of the current feature
subset

18
Metric-based model selection

Ideapoor models behave differently on training
and other data
Evaluation using machine learning algorithm
represent examples using the feature subset
generate model using some ML algorithm
estimate model quality comparing the performance
of two models on training and on unlabeled data,
chose the largest subset that satisfies
triangular inequality with all the smaller
subsets
Combine metric and cross-validation Bengio
Chapados 2003
based on their disagreement on testing examples
(higher disagreement means lower trust to
cross-validation)
Intuition cross-validation provides good results
but has high variance and should benefit from a
combination with model selection having with
lower variance

19
Embedded
Feature selection as integral part of model
generation
20
Embedded

at each iteration of the incremental optimization
of the model
use a fast gradient-based heuristic to find the
most promising feature Perkins et al 2003
Idea features that are relevant to the concept
should affect the generalization error bound of
non-linear SVM more than irrelevant features
use backward elimination based on the criteria
derived from generalization error bounds of the
SVM theory (the weight vector norm or, using
upper bounds of the leave-one-out error)
Rakotomamonjy 2003

21
Embedded in filters Cardie 1993

Use embedded feature selection as
pre-processing
evaluation and search using the process embedded
in decision tree induction
the final feature subset contains only the
features that appear in the induced decision tree
used for learning using Nearest Neighbor algorithm

22
Simple Filtering
Evaluation independent of ML algorithm
23
Feature subset selection on text data commonly
used methods

Simple filtering using some scoring measure to
evaluate individual feature
supervised measures
information gain, cross entropy for text
(information gain on only one feature value),
mutual information for text
supervised measures for binary class
odds ratio (target class vs. the rest), bi-normal
separation
unsupervised measures
term frequency, document frequency
Simple filtering using embedded approach to score
the features
scoring measure equal to weights in the normal to
the hyperplane of linear SVM trained on all the
features Brank et al 2002
learning using linear SVM, Perceptron, Naïve Bayes

24
Scoring individual feature

InformationGain
CrossEntropyTxt
MutualInfoTxt
OddsRatio
Frequency
Bi-NormalSeparation
F - Normal distribution cumulative probability
function

25
Influence of feature selection on the
classification performance

Some ML algorithms are more sensitive to the
feature subset than other
Naïve Bayes on document categorization sensitive
to the feature subset
Linear SVM has embedded weighting of features
that partially compensates for feature selection

26
Illustration of feature selection

Naïve Bayes on Yahoo! hierarchy data
Comparison of different feature scoring measures
in simple filtering
Linear SVM on standard Reuters-2000 news data
Comparison of scoring measures including embedded
SVM-normal and perceptron used as pre-processing

27
Illustration on 5 datasets from Yahoo! hierarchy
using Naïve Bayes Mladenic Grobelnik 2003
28
CrossEntropy
OddsRatio

Feature subset size importantly influences the
performance
Some measures more sensitive than other

MutualInf
InfGain
Random
29

Rank of the correct category in the list of all
categories
F2-measure combining precision and recall
emphases on recall
Ctgs number of categories looking promising
(testing example needs to be classified by their
models)
best results Odds ratio
using only a small number of features (50-100,
0.2-5)
improves performance of Naïve Bayes
surprisingly good results unsupervised Term
frequency
poor results Information gain
probably because it is not compatible with Naïve
Bayes (selects mostly features representative for
neg. class and features informative when not
occurring in the document)

30
Illustration on Reuters-2000 Data Brank et al
2002
810,000 News articles 103 Categories
504,468 articles
302,323 articles
Training Period
Test Period
14. April,1997
20. Aug,1996
19. Aug,1997

Reuters-2000 Data used in the experiments
16 categories covering the range of break-event
point (estimated on a sample) and class
distribution
Training sample of 118,294 articles from the
training period
Testing 302,323 articles from the test period

31
Experiments with Naïve Bayes Classifier

Benefits from feature selection
SVM-normal gives the best performance

SVM Normal
InfoGain
OddsRatio
PercNormal
32
Average number of nonzero components per vector
instead of the overall no. of features

The same results showing F1 vs. sparsity of the
document vectors represented wiht the selected
features

SVM Normal
InfoGain
OddsRatio
PercNormal
33
Experiments with Perceptron Classifier

Does not benefit from feature selection
Perceptron and SVM Normal feature selection give
comparable performance

SVM Normal
InfoGain
PercNormal
OddsRatio
34
Experiments with the Linear SVM Classifier
SVM Normal
OddsRatio
InfoGain

Does not benefit from feature selection
SVM-normal the best performance

PercNormal
35
DiscussionUsing discarded features can help

The features that harm performance if used as
input were found to improve performance if used
as additional output
obtain additional information by introducing
mapping from the selected features to the
discarded features (the multitask learning
setting Caruana de Sa 2003)
experiments on synthetic regression and
classification problems and real-world medical
data have shown improvements in performance
Intuition transfer of information occurs inside
the model, when in addition to the class value it
models also additional output consisting of the
discarded features

36
Discussion

Feature subset selection as pre-processing
ignore interaction with the target learning
algorithm
Simple Filters work for large number of
features
assume feature independence, limited results
the size of feature subset to be determined
Filters search space of size , can not
handle many features
relay on general data characteristics
(consistency, distance, class distribution)
use the target learning algorithm for evaluation
Wrappers high accuracy, computationally
expensive
use model selection with cross-validation of the
target algorithm, similar to metric-based model
selection (eg., comparing output on training and
on unlabeled data)
Feature subset selection during learning
use the target learning algorithm during feature
selection
Embedded can be used by filters to find the
feature subset