Diversity in Ensemble Feature Selection - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Diversity in Ensemble Feature Selection

Description:

Accuracy and diversity in regression and classification ensembles ... artificial intelligence (multi-agent systems, Russel&Norvig 2002, ... – PowerPoint PPT presentation

Number of Views:218

Avg rating:3.0/5.0

Slides: 34

Provided by: tktl

Category:

more less

Transcript and Presenter's Notes

Title: Diversity in Ensemble Feature Selection

1
Diversity in Ensemble Feature Selection

Alexey Tsymbal Department of Computer
ScienceTrinity College Dublin (a paper
submitted to Information Fusion with Pádraig
Cunningham and Nick Pechenizkiy)

2
Contents

Introduction ensemble learning
Accuracy and diversity in regression and
classification ensembles
Ensembles of classifiers and feature selection
ensemble feature selection
search strategies in EFS HC, EFSS, EBSS, and
GEFS
Diversity measures
Experimental results
Conclusions and future work

3
What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
4
Ensemble learning
5
Ensembles scientific communities

machine learning (ML Research Four current
directions, Dietterich 1997)
knowledge discovery and data mining
(HanKamber 2000, DM Concepts and techniques,
current trends)
artificial intelligence (multi-agent systems,
RusselNorvig 2002, AI A modern approach,
2nd ed.)
neural networks
statistics
computational learning theory
pattern recognition
what else?

6
Ensembles different names

multiple models
multiple classifier systems
combining classifiers (regressors etc)
integration of classifiers
mixture of experts
decision committee
committee of experts
classifer fusion
multimodel learning
consensus theory
what else?

base classifiers
component classifiers
individual classifiers
members (of a decision committee)
level 0 experts
what else?

7
Why ensemble learning?

Accuracy a more reliable mapping can be
obtained by combining the output of multiple
experts
Efficiency a complex problem can be decomposed
into multiple sub-problems that are easier to
understand and solve (divide-and-conquer
approach). Mixture of experts, ensemble feature
selection.
There is not a single model that works for all
pattern recognition problems! (no free lunch
theorem)To solve really hard problems, well
have to use several different representations.
It is time to stop arguing over which type of
pattern-classification technique is best.
Instead we should work at a higher level of
organization and discover how to build managerial
systems to exploit the different virtues abd
evade the different limitations of each of these
ways of comparing things. Minsky, 1991.

8
When ensemble learning?

When you can build base classifiers that are
more accurate than chance, and, more importantly,
that are as much as possible independent from
each other
(Accuracy and Diversity!)

9
How to make an effective ensemble?

Two basic decisions when designing ensembles
How to generate the base classifiers?
How to integrate them?

10
Methods for generating the base classifiers

Subsampling the training examples- multiple
hypotheses are generated by training individual
classifiers on different datasets obtained by
resampling a common training set (Bagging,
Boosting)
Manipulating the input features- multiple
hypothesis are generated by training individual
classifiers on different representations, or
different subsets of a common feature vector
Manipulating the output targets- the output
targets for C classes are encoded with an l-bit
codeword, and an individual classifier is built
to predict each one of the bits in the codeword-
additional auxiliary targets may be used to
differentiate classifiers
Modifying the learning parameters of the
classifier- a number of classifiers are built
with different learning parameters, such as
number of neighbors in a kNN rule, initial
weights in an MLP, etc
5. Using heterogeneous models (not often used).

11
Ensembles the need for disagreement

Overall error depends on average error of
ensemble members
Increasing ambiguity decreases overall error
Provided it does not result in an increase in
average error
(Krogh and Vedelsby, 1995)

12
Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
13
Diversity metrics

Pairwise
plain disagreement
fail/ non-fail disagreement
Q statistic
kappa statistic
correlation coefficient
Non-pairwise
entropy
variance

14
Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for Dynamic Integration The main
assumption is that each classifier is the best in
some sub-areas of the whole data set, where its
local error is comparatively less than the
corresponding errors of the other classifiers.
15
Problems of global feature selection

Most feature selection methods ignore the fact
that some features may be relevant only in
context (i.e. in some regions of the instance
space) and cover the whole instance space by a
single set of features
They may discard features that are highly
relevant in a restricted region of the instance
space because this relevance is swamped by their
irrelevance everywhere else
They may retain features that are relevant in
most of the space, but still unnecessarily
confuse the classifier in some regions
Global feature selection can lead to poor
performance in minority class prediction, whereas
this is an often case (i.e. many negative/no
disease instances in medical diagnostics)
(Cardie and Howe 1997).

16
Feature-space heterogeneity

There exist many complicated data mining
problems, where relevant features are different
in different regions of the feature space.
Types of feature heterogeneity
Class heterogenity
Feature-value heterogenity
Instance-space heterogeneity.

17
Ensemble Feature Selection

Goal of traditional feature selection
find and remove features that are unhelpful or
destructive to learning making one feature subset
for single classifier
Goal of ensemble feature selection
find and remove features that are unhelpful or
destructive to learning making different feature
subsets for a number of classifiers
find feature subsets that will promote
disagreement between the classifiers

18
Search in EFS

Search space
2NumOfFeaturesNumOfClassifiers 21825 6
553 600
4 search strategies to heuristically explore the
search space
Hill-Climbing (HC)
Ensemble Forward Sequential Selection (EFSS)
Ensemble Backward Sequential Selection (EBSS)
Genetic Ensemble Feature Selection (GEFS)

19
Hill-Climbing (HC) strategy (CBMS2002)

Generation of initial feature subsets using the
random subspace method (RSM)
A number of refining passes on eachfeature set
while there is improvement in fitness

20
Ensemble Forward Sequential Selection (EFSS)
forward selection
21
Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
22
Genetic Ensemble Feature Selection (GEFS)
23
Computational complexity
EFSS and EBSS where S is the number of base
classifiers, N is the total number of features,
and N is the number of features included or
deleted on average in an FSS or BSS search.
Example EFSS 251831350 (and not 6 553
600!) HC where Npasses is the average number of
passes through the feature subsets in HC until
there is some improvement. GEFS where S is
the number of individuals (feature subsets) in
one generation, and Ngen is the number of
generations.
24
An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
25
Experiments results on AAP data sets
26
Search strategies on UCI data sets
27
Overfitting in EFS
28
The measures of total diversity
Table 3. Spearmans rank correlation coefficient
(RCC) for the total ensemble diversity and the
difference between the ensemble accuracy and the
average base classifier accuracy (average,
maximal and minimal values)
29
Diversity as a component of the fitness function
30
For more results and charts
(optimal alfa, best integration methods, the
neighborhood for dynamic integration, overfitting
in GA generations) see the journal
paper http//www.cs.tcd.ie/publications/tech-repor
ts/reports.03/TCD-CS-2003-44.pdf
31
Conclusions

4 new search strategies proposed and analyzed,
GA is very promising
7 diversity measures
the best search strategy and diversity measure
depends on the context of their use (domain, data
set characteristics, etc.)
the best diversities on average disagreement
(both), kappa entropy and variance

32
Future work

regression? Not much practical studies
other diversities (DF, double fault), and other
search strategies (SA simulated annealing),
theoretical dependencies for EFS in crisp
classification
automated prediction of the best search strategy
and best diversity for a data set
huge data sets speech recognition (UC at
Berkeley), text classification
closer investigation of GA as the best strategy
on average
data streams, and tracking concept drifts

33
Thank you