Title: Ensembles: Issues and Applications
1EnsemblesIssues and Applications
- Alexey Tsymbal
- Department of Computer Science Trinity College
Dublin - Ireland
2Contents
- Introduction - knowledge discovery and data
mining - the task of classification -
ensemble classification - What makes a good ensemble?
- Bagging, Boosting
- Evaluation of ensembles
- Comprehensibility of ensembles
- Overfitting in ensembles
- Ensemble feature selection for acute abdominal
pain classification - Ensembles for streaming data processing with
the presence of concept drift
3Knowledge discovery and data mining
- Knowledge Discovery in Databases (KDD) is an
emerging area that considers the process of
finding previously unknown and potentially
interesting patterns and relations in large
databases. - __________________________________________________
__________________________________________________
____________ - Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
4The task of classification
J classes, n training observations, p instance
attributes
New instance to be classified
Training Set
CLASSIFICATION
Class Membership of the new instance
5What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
6Ensemble learning
7Ensembles scientific communities
- machine learning (ML Research Four current
directions, Dietterich 1997) - knowledge discovery and data mining
(HanKamber 2000, DM Concepts and techniques,
current trends) - artificial intelligence (multi-agent systems,
RusselNorvig 2002, AI A modern approach,
2nd ed.) - neural networks
- statistics
- computational learning theory
- pattern recognition
- what else?
8Ensembles different names
- multiple models
- multiple classifier systems
- combining classifiers (regressors etc)
- integration of classifiers
- mixture of experts
- decision committee
- committee of experts
- classifer fusion
- multimodel learning
- consensus theory
- what else?
- base classifiers
- component classifiers
- individual classifiers
- members (of a decision committee)
- level 0 experts
- what else?
9Why ensemble learning?
- Accuracy a more reliable mapping can be
obtained by combining the output of multiple
experts - Efficiency a complex problem can be decomposed
into multiple sub-problems that are easier to
understand and solve (divide-and-conquer
approach). Mixture of experts, ensemble feature
selection. - There is not a single model that works for all
pattern recognition problems! (no free lunch
theorem)To solve really hard problems, well
have to use several different representations.
It is time to stop arguing over which type of
pattern-classification technique is best.
Instead we should work at a higher level of
organization and discover how to build managerial
systems to exploit the different virtues abd
evade the different limitations of each of these
ways of comparing things. Minsky, 1991.
10When ensemble learning?
- When you can build base classifiers that are
more accurate than chance, and, more importantly, - that are as much as possible independent from
each other
11Why do ensembles work? 1/3
Because uncorrelated errors of individual
classifiers can be eliminated by averaging.
Assume 40 base classifiers, majority voting,
each error rate 0.3 Probability of observing r
misclassified examples by the ensemble of 40
classifiers (Dietterich, 1997) r21 -gt 0.002
12Why do ensembles work? 2/3
The desired target function may not be
implementable with individual classifiers,but
may be approximated by ensemble
averaging Assume you want to build a decision
boundary with decision trees The decision
boundaries of decision trees are hyperplanes
parallel to the coordinate axes, as in the
figures By averaging a large number of such
staircases, the diagonal decision boundary can
be approximated with arbitrary small accuracy
Class 1
Class 1
Class 2
Class 2
a
b
13Why do ensembles work? 3/3
- Theoretical results by Hansen Solomon (1990)
- If we can assume that classifiers are random in
predictions and their accuracy gt 50, can push
accuracy arbitrarily high by combining more
classifiers - Key assumption
- classifiers are independent in their predictions
- not a very reasonable assumption
- more realistic for data points where classifiers
predict with gt50 accuracy, can push accuracy
arbitrarily high (some data points just too
difficult)
14How to make an effective ensemble?
- Two basic decisions when designing ensembles
- How to generate the base classifiers?
- How to integrate them?
15Methods for generating the base classifiers
- Subsampling the training examples- multiple
hypotheses are generated by training individual
classifiers on different datasets obtained by
resampling a common training set (Bagging,
Boosting) - Manipulating the input features- multiple
hypothesis are generated by training individual
classifiers on different representations, or
different subsets of a common feature vector - Manipulating the output targets- the output
targets for C classes are encoded with an l-bit
codeword, and an individual classifier is built
to predict each one of the bits in the codeword-
additional auxiliary targets may be used to
differentiate classifiers - Modifying the learning parameters of the
classifier- a number of classifiers are built
with different learning parameters, such as
number of neighbors in a kNN rule, initial
weights in an MLP, etc - 5. Using heterogeneous models (not often used).
16Learning algorithms in ensembles
- Decision tree learning
- ID3, C4.5 decision tree learning algorithms
(Quinlan) - Instance-based learning
- k-nearest neighbor classification, PEBLS
(Cost, Salzberg) - Bayesian classification
- Naïve Bayes (John)
- Neural networks (MLPs etc)
- Discriminant analysis
- Regression analysis
17Ensembles the need for disagreement
- Overall error depends on average error of
ensemble members - Increasing ambiguity decreases overall error
- Provided it does not result in an increase in
average error - (Krogh and Vedelsby, 1995)
18Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
19Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for the Dynamic Integration The
main assumption is that each classifier is the
best in some sub-areas of the whole data set,
where its local error is comparatively less than
the corresponding errors of the other classifiers.
20Problems of voting an example
21Stacked generalization framework
(Stacking, Wolpert, 1992)
22Arbiter meta-learning tree
(Chan Stolfo, 1997)
Meta-learning is the use of learning algorithms
to learn how to integrate results from multiple
learning systems. An arbiter is a classifier that
is trained to resolve disagreements between the
base classifiers. An arbiter tree is a
hierarchical (multi-level) structure composed of
arbiters that are computed in a bottom-up,
binary-tree way. An arbiter tree is a
hierarchical structure composed of arbiters that
are computed in a bottom-up, binary-tree
fashion. When a new instance is classified by the
arbiter tree in the application phase,
predictions flow from the leaves to the root of
tree.
23The space model motivation for dynamic
integration
Information about methods errors on training
instances can be used for learning just as
original instances are used for learning.
The main assumption is that each data mining
method is best in some subareas of the whole
application domain, where its local error is
comparatively less than the corresponding errors
of the other available methods.
24Dynamic integration of classifiers
The goal is to use each base classifier just in
the subdomain where it is most reliable (or to
use a weight proportional to its local
reliability) and thus to achieve overall results
that can be considerably better than those of the
best individual classifier alone.
25Dynamic integration of classifiers an example
26EFS_SBC experiments results (Accute Abdominal
Pain dataset)
27Bagging
BAGGing Bootstrap AGGregation (Breiman,
1996) Bootstrap???
28Bootstrapping A bit of history
- Rudolf Raspe, Baron MunchausensNarrative of
his Marvellous Travels and Campaings in Russia,
1785 - He hauls himself and his horse out of the mud by
lifting himself by his own hair. - By 1860s the story was modified to refer to
bootstrapsin the USA - This term was also used to refer to doing
something on your own, without the use of
external help since 1860s - Since 1950s it refers to the procedure of
getting a computer to start (to boot, to reboot)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37Weak learning motivation for Boosting
- Schapire showed that a set of weak learners
(learners with gt50 accuracy, but not much
greater) could be combined into a strong learner - Idea weight the data set based on how well we
have predicted data points so far- data points
correctly predicted -gt low weight- data points
mispredicted -gt high weight - Results focuses the base classifiers on portions
of data space not previously well predicted
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48Some theories on bagging/boosting
- Error Bayes Optimal Error Bias Variance
- Bayes Optimal Error noise error
- Theories
- bagging can reduce variance part of error
- boosting can reduce variance AND bias
- bagging will hardly ever increase error
- boosting may increase error
- boosting is susceptible to noise
49Comprehensibility of ensembles
Comprehensibility is commonly an important
shortcoming of ensembles because ensembles are
often too complex and difficult to understand for
an expert, even when the base classifiers are
simple (1 DT is easy to interprete, but how
about 200 DTs?). There are currently known two
ways to cope with this problem and to achive
understanding and explanationin ensembles 1)
Black box approach. The behaviour of an
ensemble is approximated with a single model
(Domingos, 1998) 2) Decomposition approach. The
ensemble is decomposed into a set of rules
(Cunningham, ECML/PKDD 2002)
50Overfitting
Formal definition
- Consider error of hypothesis h over
- Training data errortrain(h)
- Entire distribution D of data errorD(h)
- Hypothesis h?H overfits training data if there is
an alternative hypothesis h?H such that - errortrain(h) lt errortrain(h)
- and
- errorD(h) gt errorD(h)
-
51Overfitting
AccTrS
Accuracy
AccTestS
Search space(nodes, epochs, etc)
knee point
52Overfitting in ensembles
- Not that much research has been done to this time
- A surprising recent findingensembles of overfit
base classifiers(DTs, ANNs) are in many cases
better than the ensembles of non-overfit base
classifiers - This is related most probably to the fact that in
that case the ensemble diversity is much higher
53Measuring overfitting in ensembles
54Evaluating learned models
- Cross-validation (CV) the examples are randomly
split into v mutually exclusive partitions (the
folds) of approximately equal size. A sample is
formed by setting aside one of the v folds as the
test set, and the remaining folds make up the
training set. This creates v possible samples. As
each learned model is formed using one of the v
training sets, its generalization performance is
estimated on the corresponding test partition. - Random sampling or Monte Carlo cross-validation
is a special case of v-fold cross-validation
where a percentage of training examples
(typically 2/3) is randomly placed in the
training set, and the remaining examples are
placed in the test set. After learning takes
place on the training set, generalization
performance is estimated on the test set. This
whole process is repeated for many training/test
splits (usually 30) and the algorithm with the
best average generalization performance is
selected.
55Ensemble evaluation problems
(Salzberg, 1999) Statistically invalid
conclusions are to be avoided When one repeatedly
searches a large database with powerful
algorithms, it is all too easy to find a
phenomenon or pattern that looks impressive, even
when there is nothing to discover This is so
called multiplicity effect This effect is even
more important when the same search is done
within one scientific community Literature
skewness effect there is substantial danger that
published results, even when using the
appropriate significance tests, will be mere
accidents of chance. One must be very careful in
the design of an experimental study using
publicly available databases, such as
UCI Proposed solution to use important real
datasets and to make UCI ML repository larger.
56Feature selection motivation
- Build better predictors better quality
predictors (classifiers/regressors) can be built
by removing irrelevant and redundant features - Economy of representation and comprehensibility
allow problem/ phenomena to be represented as
succintly as possible - Knowledge discovery discover what features are
and are not influential in weak theory domains
57Problems of global feature selection
- Most feature selection methods ignore the fact
that some features may be relevant only in
context (i.e. in some regions of the instance
space) and cover the whole instance space by a
single set of features - They may discard features that are highly
relevant in a restricted region of the instance
space because this relevance is swamped by their
irrelevance everywhere else - They may retain features that are relevant in
most of the space, but still unnecessarily
confuse the classifier in some regions - Global feature selection can lead to poor
performance in minority class prediction, whereas
this is an often case (i.e. many negative/no
disease instances in medical diagnostics)
(Cardie and Howe 1997).
58Feature-space heterogeneity
- There exist many complicated data mining
problems, where relevant features are different
in different regions of the feature space. - Types of feature heterogeneity
- Class heterogenity
- Feature-value heterogenity
- Instance-space heterogeneity.
59Ensemble Feature Selection
- How to prepare inputs for the generation of the
base classifiers ? - Sample the training set
- Manipulate input features
- Manipulate output target (class values)
- Goal of traditional feature selection
- find and remove features that are unhelpful or
destructive to learning making one feature subset
for single classifier - Goal of ensemble feature selection
- find and remove features that are unhelpful or
destructive to learning making different feature
subsets for a number of classifiers - find feature subsets that will promote
disagreement between the classifiers
60Advanced local feature selection
A new instance x0
TS
Training set
Local Feature Filtering
Feature subsets
...
FS1
FS2
FSk
A learning algorithm
...
Classifiers
C1
C2
Ck
Dynamic integration
Local accuracies
...
LAcc1
LAcc2
LAcck
Dynamic Selection or Dynamic Voting
61Classification of acute abdominal pain
- 3 large datasets with cases of acute abdominal
pain (AAP) 1254, 2286, and 4020 instances, and
18 parameters (features) from history-taking and
clinical examination - the task of separating acute appendicitis
- the second most important cause of abdominal
surgeries - AAP I from 6 surgical departments in Germany,
AAP II from 14 centers in Germany, and AAP III
from 16 centers in Central and Eastern Europe - the 18 features are standardized by the World
Organization of Gastroenterology (OMGE)
Features 1 Sex 2 Age 3 Progress of pain 4
Duration of pain 5 Type of pain 6 Severity of
pain 7 Location of pain at present 8 Location of
pain at onset 9 Previous similar complaints 10
Previous abdominal operation 11 Distended
abdomen 12 Tenderness 13 Severity of
tenderness 14 Movement of abdominal wall 15
Rigidity 16 Rectal tenderness 17 Rebound
tenderness 18 Leukocytes
The data sets for research were kindly provided
by the Laboratory for System Design, Faculty
of Electrical Engineering and Computer Science,
University of Maribor, Slovenia, and the
Theoretical Surgery Unit, Department of General
and Trauma Surgery, Heinrich-Heine University,
Düsseldorf, Germany
62Search in EFS
- Search space
- 2NumOfFeaturesNumOfClassifiers 21825 6
553 600 - 4 search strategies to heuristically explore the
search space - Hill-Climbing (HC) (CBMS2002)
- Ensemble Forward Sequential Selection (EFSS)
- Ensemble Backward Sequential Selection (EBSS)
- Genetic Ensemble Feature Selection (GEFS)
63Hill-Climbing (HC) strategy (CBMS2002)
- Generation of initial feature subsets using the
random subspace method (RSM) - A number of refining passes on eachfeature set
while there is improvement in fitness
64Ensemble Forward Sequential Selection (EFSS)
forward selection
65Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
66Genetic Ensemble Feature Selection (GEFS)
67An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
68Experiments results
69Feature importance table (EFSS, alfa0)
70Ensembles for tracking concept drifts in
streaming data
- The presence of a changing target concept is
called concept drift (CD). - Many real-world examples peoples preferences
for products, SPAM, credit card fraud, intrusion
detection, etc. - Algorithms that track CD must be able to
identify a change in the target concept without
direct knowledge of the underlying shift in
distribution. - Two types of CD are distinguished sudden and
gradual. - Two basic kinds of algorithms to track CDs
- (1) window-based, and (2) ensemble-based.
71Concept drift a definition
- Consider 2 target concepts A and B.
- A sequence of instances i1in.
- Before id the concept A is stable
- After iddx the concept B is stable
- Between id and id dx the concept is drifting
between A and B - When dx 1, the concept drift is sudden,
otherwise gradual - The CD is usually modelled with a linear
function alfa
72Data streams
- Much in common with the concept of CD
- Many current information systems are real-time
production systems that generate tremendous
amount of data - Network event logs, telephone call records,
credit card transactional flows, sensoring and
surveillance video streams, etc. - Knowledge discovery on streaming data is a
research topic of growing interest - Incremental or online learning algorithms are
known, but they do not take into account CDs!
73 Data stream CD Ensemble
- An ensemble can be trained from sequential data
chunks in the stream - It may solve the problem of many conflicting
concepts (not only 2) - Optimal ensemble search solves the CD problem at
one level - Dynamic Integration solves the CD problem at the
instance (or local) level
74A Real-World Example Database Filling
75A Real-World Example Diagnostics
76A Real-World Example Classifier Evaluation
Selection
77Thank you!
- Alexey Tsymbal
- Department of Computer Science and Information
Systems, - University of Jyväskylä, FINLAND
- www.cs.jyu.fi/alexey/alexey_at_it.jyu.fi
78The DI algorithm keys
T training set for the base classifiers Ti i-th
fold of the training set T meta-level training
set for the combining algorithm x attributes of
an instance c(x) classification of the instance
with attributes x C set of base classifiers Cj
j-th base classifier Cj(x) prediction produced by
Cj on instance x Ej(x) estimation of error of Cj
on instance x Ej(x) prediction of error of Cj on
instance x m number of base classifiers W vector
of weights for base classifiers nn number of
near-by instances for error prediction WNNi
weight of i-th near-by instance
79The DI algorithm learning phase
procedure learning_phase(T,C) begin fills in the
meta-level training set T partition T into
v folds loop for Ti ? T, loop for j
from 1 to m train(Cj,T-Ti)
loop for x?Ti loop for j from 1 to m
compare Cj(x) with c(x) and derive
Ej(x) collect (x,E1(x),...,Em(x)) into
T loop for j from 1 to m
train(Cj,T) end
80The DI algorithm application phase
function DS_application_phase(T,C,x) returns
class of x begin loop for j from 1 to m
Ej?(1/nn)??WNNi?Ej(xNNi)WNN estimation
l?argmin Ej number of cl-er with min.
with the least global error in the case of
ties return Cl(x) end function
DV_application_phase(T,C,x) returns class of
x begin loop for j from 1 to m
Wj?1-(1/nn)??WNNi?Ej(xNNi) WNN estimation
return Weighted_Voting(W,C1(x),...,Cm(x)) end
81An integrated knowledge discovery management
system
82Experimental design
- three variations (DS, DV, and DVS), weighted
voting (WV), and cross-validation majority
(CVM) - 30 random runs (30 - testing, 70 - training)
for Monte-Carlo CV - 10-fold stratified CV for cross-validation
history - Heterogeneous Euclidean-Overlap distance metric
5 other metrics investigated - UCI ML Repository data sets Appendicitis
datasets - interval discretization (equal lines)
- PEBLS, C4.5, and BAYES learning algorithms for
testing DI - C4.5 for multi-level meta-learning, local
feature selection, and bagging and boosting - t-test for statistical significance
- the test environment is implemented within the
MLC framework (the Machine Learning Library in
C)