Title: Diversity in Ensemble Feature Selection
1Diversity in Ensemble Feature Selection
- Alexey Tsymbal Department of Computer
ScienceTrinity College Dublin (a paper
submitted to Information Fusion with Pádraig
Cunningham and Nick Pechenizkiy)
2Contents
- Introduction ensemble learning
- Accuracy and diversity in regression and
classification ensembles - Ensembles of classifiers and feature selection
ensemble feature selection - search strategies in EFS HC, EFSS, EBSS, and
GEFS - Diversity measures
- Experimental results
- Conclusions and future work
3What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
4Ensemble learning
5Ensembles scientific communities
- machine learning (ML Research Four current
directions, Dietterich 1997) - knowledge discovery and data mining
(HanKamber 2000, DM Concepts and techniques,
current trends) - artificial intelligence (multi-agent systems,
RusselNorvig 2002, AI A modern approach,
2nd ed.) - neural networks
- statistics
- computational learning theory
- pattern recognition
- what else?
6Ensembles different names
- multiple models
- multiple classifier systems
- combining classifiers (regressors etc)
- integration of classifiers
- mixture of experts
- decision committee
- committee of experts
- classifer fusion
- multimodel learning
- consensus theory
- what else?
- base classifiers
- component classifiers
- individual classifiers
- members (of a decision committee)
- level 0 experts
- what else?
7Why ensemble learning?
- Accuracy a more reliable mapping can be
obtained by combining the output of multiple
experts - Efficiency a complex problem can be decomposed
into multiple sub-problems that are easier to
understand and solve (divide-and-conquer
approach). Mixture of experts, ensemble feature
selection. - There is not a single model that works for all
pattern recognition problems! (no free lunch
theorem)To solve really hard problems, well
have to use several different representations.
It is time to stop arguing over which type of
pattern-classification technique is best.
Instead we should work at a higher level of
organization and discover how to build managerial
systems to exploit the different virtues abd
evade the different limitations of each of these
ways of comparing things. Minsky, 1991.
8When ensemble learning?
- When you can build base classifiers that are
more accurate than chance, and, more importantly, - that are as much as possible independent from
each other - (Accuracy and Diversity!)
9How to make an effective ensemble?
- Two basic decisions when designing ensembles
- How to generate the base classifiers?
- How to integrate them?
10Methods for generating the base classifiers
- Subsampling the training examples- multiple
hypotheses are generated by training individual
classifiers on different datasets obtained by
resampling a common training set (Bagging,
Boosting) - Manipulating the input features- multiple
hypothesis are generated by training individual
classifiers on different representations, or
different subsets of a common feature vector - Manipulating the output targets- the output
targets for C classes are encoded with an l-bit
codeword, and an individual classifier is built
to predict each one of the bits in the codeword-
additional auxiliary targets may be used to
differentiate classifiers - Modifying the learning parameters of the
classifier- a number of classifiers are built
with different learning parameters, such as
number of neighbors in a kNN rule, initial
weights in an MLP, etc - 5. Using heterogeneous models (not often used).
11Ensembles the need for disagreement
- Overall error depends on average error of
ensemble members - Increasing ambiguity decreases overall error
- Provided it does not result in an increase in
average error - (Krogh and Vedelsby, 1995)
12Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
13Diversity metrics
- Pairwise
- plain disagreement
- fail/ non-fail disagreement
- Q statistic
- kappa statistic
- correlation coefficient
- Non-pairwise
- entropy
- variance
14Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for Dynamic Integration The main
assumption is that each classifier is the best in
some sub-areas of the whole data set, where its
local error is comparatively less than the
corresponding errors of the other classifiers.
15Problems of global feature selection
- Most feature selection methods ignore the fact
that some features may be relevant only in
context (i.e. in some regions of the instance
space) and cover the whole instance space by a
single set of features - They may discard features that are highly
relevant in a restricted region of the instance
space because this relevance is swamped by their
irrelevance everywhere else - They may retain features that are relevant in
most of the space, but still unnecessarily
confuse the classifier in some regions - Global feature selection can lead to poor
performance in minority class prediction, whereas
this is an often case (i.e. many negative/no
disease instances in medical diagnostics)
(Cardie and Howe 1997).
16Feature-space heterogeneity
- There exist many complicated data mining
problems, where relevant features are different
in different regions of the feature space. - Types of feature heterogeneity
- Class heterogenity
- Feature-value heterogenity
- Instance-space heterogeneity.
17Ensemble Feature Selection
- Goal of traditional feature selection
- find and remove features that are unhelpful or
destructive to learning making one feature subset
for single classifier - Goal of ensemble feature selection
- find and remove features that are unhelpful or
destructive to learning making different feature
subsets for a number of classifiers - find feature subsets that will promote
disagreement between the classifiers
18Search in EFS
- Search space
- 2NumOfFeaturesNumOfClassifiers 21825 6
553 600 - 4 search strategies to heuristically explore the
search space - Hill-Climbing (HC)
- Ensemble Forward Sequential Selection (EFSS)
- Ensemble Backward Sequential Selection (EBSS)
- Genetic Ensemble Feature Selection (GEFS)
19Hill-Climbing (HC) strategy (CBMS2002)
- Generation of initial feature subsets using the
random subspace method (RSM) - A number of refining passes on eachfeature set
while there is improvement in fitness
20Ensemble Forward Sequential Selection (EFSS)
forward selection
21Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
22Genetic Ensemble Feature Selection (GEFS)
23Computational complexity
EFSS and EBSS where S is the number of base
classifiers, N is the total number of features,
and N is the number of features included or
deleted on average in an FSS or BSS search.
Example EFSS 251831350 (and not 6 553
600!) HC where Npasses is the average number of
passes through the feature subsets in HC until
there is some improvement. GEFS where S is
the number of individuals (feature subsets) in
one generation, and Ngen is the number of
generations.
24An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
25Experiments results on AAP data sets
26Search strategies on UCI data sets
27Overfitting in EFS
28The measures of total diversity
Table 3. Spearmans rank correlation coefficient
(RCC) for the total ensemble diversity and the
difference between the ensemble accuracy and the
average base classifier accuracy (average,
maximal and minimal values)
29Diversity as a component of the fitness function
30For more results and charts
(optimal alfa, best integration methods, the
neighborhood for dynamic integration, overfitting
in GA generations) see the journal
paper http//www.cs.tcd.ie/publications/tech-repor
ts/reports.03/TCD-CS-2003-44.pdf
31Conclusions
- 4 new search strategies proposed and analyzed,
GA is very promising - 7 diversity measures
- the best search strategy and diversity measure
depends on the context of their use (domain, data
set characteristics, etc.) - the best diversities on average disagreement
(both), kappa entropy and variance
32Future work
- regression? Not much practical studies
- other diversities (DF, double fault), and other
search strategies (SA simulated annealing), - theoretical dependencies for EFS in crisp
classification - automated prediction of the best search strategy
and best diversity for a data set - huge data sets speech recognition (UC at
Berkeley), text classification - closer investigation of GA as the best strategy
on average - data streams, and tracking concept drifts
-
33Thank you
- Alexey.Tsymbal_at_cs.tcd.ie
- http//www.cs.tcd.ie/publications/tech-reports/rep
orts.03/TCD-CS-2003-44.pdf