Ensembles: Issues and Applications presentation

About This Presentation

Transcript and Presenter's Notes

Title: Ensembles: Issues and Applications

1
EnsemblesIssues and Applications

Alexey Tsymbal
Department of Computer Science Trinity College
Dublin
Ireland

2
Contents

Introduction - knowledge discovery and data
mining - the task of classification -
ensemble classification
What makes a good ensemble?
Bagging, Boosting
Evaluation of ensembles
Comprehensibility of ensembles
Overfitting in ensembles
Ensemble feature selection for acute abdominal
pain classification
Ensembles for streaming data processing with
the presence of concept drift

3
Knowledge discovery and data mining

Knowledge Discovery in Databases (KDD) is an
emerging area that considers the process of
finding previously unknown and potentially
interesting patterns and relations in large
databases.
__________________________________________________
__________________________________________________
____________
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.

4
The task of classification
J classes, n training observations, p instance
attributes
New instance to be classified
Training Set
CLASSIFICATION
Class Membership of the new instance
5
What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
6
Ensemble learning
7
Ensembles scientific communities

machine learning (ML Research Four current
directions, Dietterich 1997)
knowledge discovery and data mining
(HanKamber 2000, DM Concepts and techniques,
current trends)
artificial intelligence (multi-agent systems,
RusselNorvig 2002, AI A modern approach,
2nd ed.)
neural networks
statistics
computational learning theory
pattern recognition
what else?

8
Ensembles different names

multiple models
multiple classifier systems
combining classifiers (regressors etc)
integration of classifiers
mixture of experts
decision committee
committee of experts
classifer fusion
multimodel learning
consensus theory
what else?

base classifiers
component classifiers
individual classifiers
members (of a decision committee)
level 0 experts
what else?

9
Why ensemble learning?

Accuracy a more reliable mapping can be
obtained by combining the output of multiple
experts
Efficiency a complex problem can be decomposed
into multiple sub-problems that are easier to
understand and solve (divide-and-conquer
approach). Mixture of experts, ensemble feature
selection.
There is not a single model that works for all
pattern recognition problems! (no free lunch
theorem)To solve really hard problems, well
have to use several different representations.
It is time to stop arguing over which type of
pattern-classification technique is best.
Instead we should work at a higher level of
organization and discover how to build managerial
systems to exploit the different virtues abd
evade the different limitations of each of these
ways of comparing things. Minsky, 1991.

10
When ensemble learning?

When you can build base classifiers that are
more accurate than chance, and, more importantly,
that are as much as possible independent from
each other

11
Why do ensembles work? 1/3
Because uncorrelated errors of individual
classifiers can be eliminated by averaging.
Assume 40 base classifiers, majority voting,
each error rate 0.3 Probability of observing r
misclassified examples by the ensemble of 40
classifiers (Dietterich, 1997) r21 -gt 0.002
12
Why do ensembles work? 2/3
The desired target function may not be
implementable with individual classifiers,but
may be approximated by ensemble
averaging Assume you want to build a decision
boundary with decision trees The decision
boundaries of decision trees are hyperplanes
parallel to the coordinate axes, as in the
figures By averaging a large number of such
staircases, the diagonal decision boundary can
be approximated with arbitrary small accuracy
Class 1
Class 1
Class 2
Class 2
a
b
13
Why do ensembles work? 3/3

Theoretical results by Hansen Solomon (1990)
If we can assume that classifiers are random in
predictions and their accuracy gt 50, can push
accuracy arbitrarily high by combining more
classifiers
Key assumption
classifiers are independent in their predictions
not a very reasonable assumption
more realistic for data points where classifiers
predict with gt50 accuracy, can push accuracy
arbitrarily high (some data points just too
difficult)

14
How to make an effective ensemble?

Two basic decisions when designing ensembles
How to generate the base classifiers?
How to integrate them?

15
Methods for generating the base classifiers

Subsampling the training examples- multiple
hypotheses are generated by training individual
classifiers on different datasets obtained by
resampling a common training set (Bagging,
Boosting)
Manipulating the input features- multiple
hypothesis are generated by training individual
classifiers on different representations, or
different subsets of a common feature vector
Manipulating the output targets- the output
targets for C classes are encoded with an l-bit
codeword, and an individual classifier is built
to predict each one of the bits in the codeword-
additional auxiliary targets may be used to
differentiate classifiers
Modifying the learning parameters of the
classifier- a number of classifiers are built
with different learning parameters, such as
number of neighbors in a kNN rule, initial
weights in an MLP, etc
5. Using heterogeneous models (not often used).

16
Learning algorithms in ensembles

Decision tree learning
ID3, C4.5 decision tree learning algorithms
(Quinlan)
Instance-based learning
k-nearest neighbor classification, PEBLS
(Cost, Salzberg)
Bayesian classification
Naïve Bayes (John)
Neural networks (MLPs etc)
Discriminant analysis
Regression analysis

17
Ensembles the need for disagreement

Overall error depends on average error of
ensemble members
Increasing ambiguity decreases overall error
Provided it does not result in an increase in
average error
(Krogh and Vedelsby, 1995)

18
Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
19
Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for the Dynamic Integration The
main assumption is that each classifier is the
best in some sub-areas of the whole data set,
where its local error is comparatively less than
the corresponding errors of the other classifiers.
20
Problems of voting an example
21
Stacked generalization framework
(Stacking, Wolpert, 1992)
22
Arbiter meta-learning tree
(Chan Stolfo, 1997)
Meta-learning is the use of learning algorithms
to learn how to integrate results from multiple
learning systems. An arbiter is a classifier that
is trained to resolve disagreements between the
base classifiers. An arbiter tree is a
hierarchical (multi-level) structure composed of
arbiters that are computed in a bottom-up,
binary-tree way. An arbiter tree is a
hierarchical structure composed of arbiters that
are computed in a bottom-up, binary-tree
fashion. When a new instance is classified by the
arbiter tree in the application phase,
predictions flow from the leaves to the root of
tree.
23
The space model motivation for dynamic
integration
Information about methods errors on training
instances can be used for learning just as
original instances are used for learning.
The main assumption is that each data mining
method is best in some subareas of the whole
application domain, where its local error is
comparatively less than the corresponding errors
of the other available methods.
24
Dynamic integration of classifiers
The goal is to use each base classifier just in
the subdomain where it is most reliable (or to
use a weight proportional to its local
reliability) and thus to achieve overall results
that can be considerably better than those of the
best individual classifier alone.
25
Dynamic integration of classifiers an example
26
EFS_SBC experiments results (Accute Abdominal
Pain dataset)
27
Bagging
BAGGing Bootstrap AGGregation (Breiman,
1996) Bootstrap???
28
Bootstrapping A bit of history

Rudolf Raspe, Baron MunchausensNarrative of
his Marvellous Travels and Campaings in Russia,
1785
He hauls himself and his horse out of the mud by
lifting himself by his own hair.
By 1860s the story was modified to refer to
bootstrapsin the USA
This term was also used to refer to doing
something on your own, without the use of
external help since 1860s
Since 1950s it refers to the procedure of
getting a computer to start (to boot, to reboot)

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Weak learning motivation for Boosting

Schapire showed that a set of weak learners
(learners with gt50 accuracy, but not much
greater) could be combined into a strong learner
Idea weight the data set based on how well we
have predicted data points so far- data points
correctly predicted -gt low weight- data points
mispredicted -gt high weight
Results focuses the base classifiers on portions
of data space not previously well predicted

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Some theories on bagging/boosting

Error Bayes Optimal Error Bias Variance
Bayes Optimal Error noise error
Theories
bagging can reduce variance part of error
boosting can reduce variance AND bias
bagging will hardly ever increase error
boosting may increase error
boosting is susceptible to noise

49
Comprehensibility of ensembles
Comprehensibility is commonly an important
shortcoming of ensembles because ensembles are
often too complex and difficult to understand for
an expert, even when the base classifiers are
simple (1 DT is easy to interprete, but how
about 200 DTs?). There are currently known two
ways to cope with this problem and to achive
understanding and explanationin ensembles 1)
Black box approach. The behaviour of an
ensemble is approximated with a single model
(Domingos, 1998) 2) Decomposition approach. The
ensemble is decomposed into a set of rules
(Cunningham, ECML/PKDD 2002)
50
Overfitting
Formal definition

Consider error of hypothesis h over
Training data errortrain(h)
Entire distribution D of data errorD(h)
Hypothesis h?H overfits training data if there is
an alternative hypothesis h?H such that
errortrain(h) lt errortrain(h)
and
errorD(h) gt errorD(h)

51
Overfitting
AccTrS
Accuracy
AccTestS
Search space(nodes, epochs, etc)
knee point
52
Overfitting in ensembles

Not that much research has been done to this time
A surprising recent findingensembles of overfit
base classifiers(DTs, ANNs) are in many cases
better than the ensembles of non-overfit base
classifiers
This is related most probably to the fact that in
that case the ensemble diversity is much higher

53
Measuring overfitting in ensembles
54
Evaluating learned models

Cross-validation (CV) the examples are randomly
split into v mutually exclusive partitions (the
folds) of approximately equal size. A sample is
formed by setting aside one of the v folds as the
test set, and the remaining folds make up the
training set. This creates v possible samples. As
each learned model is formed using one of the v
training sets, its generalization performance is
estimated on the corresponding test partition.
Random sampling or Monte Carlo cross-validation
is a special case of v-fold cross-validation
where a percentage of training examples
(typically 2/3) is randomly placed in the
training set, and the remaining examples are
placed in the test set. After learning takes
place on the training set, generalization
performance is estimated on the test set. This
whole process is repeated for many training/test
splits (usually 30) and the algorithm with the
best average generalization performance is
selected.

55
Ensemble evaluation problems
(Salzberg, 1999) Statistically invalid
conclusions are to be avoided When one repeatedly
searches a large database with powerful
algorithms, it is all too easy to find a
phenomenon or pattern that looks impressive, even
when there is nothing to discover This is so
called multiplicity effect This effect is even
more important when the same search is done
within one scientific community Literature
skewness effect there is substantial danger that
published results, even when using the
appropriate significance tests, will be mere
accidents of chance. One must be very careful in
the design of an experimental study using
publicly available databases, such as
UCI Proposed solution to use important real
datasets and to make UCI ML repository larger.
56
Feature selection motivation

Build better predictors better quality
predictors (classifiers/regressors) can be built
by removing irrelevant and redundant features
Economy of representation and comprehensibility
allow problem/ phenomena to be represented as
succintly as possible
Knowledge discovery discover what features are
and are not influential in weak theory domains

57
Problems of global feature selection

Most feature selection methods ignore the fact
that some features may be relevant only in
context (i.e. in some regions of the instance
space) and cover the whole instance space by a
single set of features
They may discard features that are highly
relevant in a restricted region of the instance
space because this relevance is swamped by their
irrelevance everywhere else
They may retain features that are relevant in
most of the space, but still unnecessarily
confuse the classifier in some regions
Global feature selection can lead to poor
performance in minority class prediction, whereas
this is an often case (i.e. many negative/no
disease instances in medical diagnostics)
(Cardie and Howe 1997).

58
Feature-space heterogeneity

There exist many complicated data mining
problems, where relevant features are different
in different regions of the feature space.
Types of feature heterogeneity
Class heterogenity
Feature-value heterogenity
Instance-space heterogeneity.

59
Ensemble Feature Selection

How to prepare inputs for the generation of the
base classifiers ?
Sample the training set
Manipulate input features
Manipulate output target (class values)
Goal of traditional feature selection
find and remove features that are unhelpful or
destructive to learning making one feature subset
for single classifier
Goal of ensemble feature selection
find and remove features that are unhelpful or
destructive to learning making different feature
subsets for a number of classifiers
find feature subsets that will promote
disagreement between the classifiers

60
Advanced local feature selection
A new instance x0
TS
Training set
Local Feature Filtering
Feature subsets
...
FS1
FS2
FSk
A learning algorithm
...
Classifiers
C1
C2
Ck
Dynamic integration
Local accuracies
...
LAcc1
LAcc2
LAcck
Dynamic Selection or Dynamic Voting
61
Classification of acute abdominal pain

3 large datasets with cases of acute abdominal
pain (AAP) 1254, 2286, and 4020 instances, and
18 parameters (features) from history-taking and
clinical examination
the task of separating acute appendicitis
the second most important cause of abdominal
surgeries
AAP I from 6 surgical departments in Germany,
AAP II from 14 centers in Germany, and AAP III
from 16 centers in Central and Eastern Europe
the 18 features are standardized by the World
Organization of Gastroenterology (OMGE)

Features 1 Sex 2 Age 3 Progress of pain 4
Duration of pain 5 Type of pain 6 Severity of
pain 7 Location of pain at present 8 Location of
pain at onset 9 Previous similar complaints 10
Previous abdominal operation 11 Distended
abdomen 12 Tenderness 13 Severity of
tenderness 14 Movement of abdominal wall 15
Rigidity 16 Rectal tenderness 17 Rebound
tenderness 18 Leukocytes
The data sets for research were kindly provided
by the Laboratory for System Design, Faculty
of Electrical Engineering and Computer Science,
University of Maribor, Slovenia, and the
Theoretical Surgery Unit, Department of General
and Trauma Surgery, Heinrich-Heine University,
Düsseldorf, Germany
62
Search in EFS

Search space
2NumOfFeaturesNumOfClassifiers 21825 6
553 600
4 search strategies to heuristically explore the
search space
Hill-Climbing (HC) (CBMS2002)
Ensemble Forward Sequential Selection (EFSS)
Ensemble Backward Sequential Selection (EBSS)
Genetic Ensemble Feature Selection (GEFS)

63
Hill-Climbing (HC) strategy (CBMS2002)

Generation of initial feature subsets using the
random subspace method (RSM)
A number of refining passes on eachfeature set
while there is improvement in fitness

64
Ensemble Forward Sequential Selection (EFSS)
forward selection
65
Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
66
Genetic Ensemble Feature Selection (GEFS)
67
An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
68
Experiments results
69
Feature importance table (EFSS, alfa0)
70
Ensembles for tracking concept drifts in
streaming data

The presence of a changing target concept is
called concept drift (CD).
Many real-world examples peoples preferences
for products, SPAM, credit card fraud, intrusion
detection, etc.
Algorithms that track CD must be able to
identify a change in the target concept without
direct knowledge of the underlying shift in
distribution.
Two types of CD are distinguished sudden and
gradual.
Two basic kinds of algorithms to track CDs
(1) window-based, and (2) ensemble-based.

71
Concept drift a definition

Consider 2 target concepts A and B.
A sequence of instances i1in.
Before id the concept A is stable
After iddx the concept B is stable
Between id and id dx the concept is drifting
between A and B
When dx 1, the concept drift is sudden,
otherwise gradual
The CD is usually modelled with a linear
function alfa

72
Data streams

Much in common with the concept of CD
Many current information systems are real-time
production systems that generate tremendous
amount of data
Network event logs, telephone call records,
credit card transactional flows, sensoring and
surveillance video streams, etc.
Knowledge discovery on streaming data is a
research topic of growing interest
Incremental or online learning algorithms are
known, but they do not take into account CDs!

73
Data stream CD Ensemble

An ensemble can be trained from sequential data
chunks in the stream
It may solve the problem of many conflicting
concepts (not only 2)
Optimal ensemble search solves the CD problem at
one level
Dynamic Integration solves the CD problem at the
instance (or local) level

74
A Real-World Example Database Filling
75
A Real-World Example Diagnostics
76
A Real-World Example Classifier Evaluation
Selection
77
Thank you!

Alexey Tsymbal
Department of Computer Science and Information
Systems,
University of Jyväskylä, FINLAND
www.cs.jyu.fi/alexey/alexey_at_it.jyu.fi

78
The DI algorithm keys
T training set for the base classifiers Ti i-th
fold of the training set T meta-level training
set for the combining algorithm x attributes of
an instance c(x) classification of the instance
with attributes x C set of base classifiers Cj
j-th base classifier Cj(x) prediction produced by
Cj on instance x Ej(x) estimation of error of Cj
on instance x Ej(x) prediction of error of Cj on
instance x m number of base classifiers W vector
of weights for base classifiers nn number of
near-by instances for error prediction WNNi
weight of i-th near-by instance
79
The DI algorithm learning phase
procedure learning_phase(T,C) begin fills in the
meta-level training set T partition T into
v folds loop for Ti ? T, loop for j
from 1 to m train(Cj,T-Ti)
loop for x?Ti loop for j from 1 to m
compare Cj(x) with c(x) and derive
Ej(x) collect (x,E1(x),...,Em(x)) into
T loop for j from 1 to m
train(Cj,T) end
80
The DI algorithm application phase
function DS_application_phase(T,C,x) returns
class of x begin loop for j from 1 to m
Ej?(1/nn)??WNNi?Ej(xNNi)WNN estimation
l?argmin Ej number of cl-er with min.
with the least global error in the case of
ties return Cl(x) end function
DV_application_phase(T,C,x) returns class of
x begin loop for j from 1 to m
Wj?1-(1/nn)??WNNi?Ej(xNNi) WNN estimation
return Weighted_Voting(W,C1(x),...,Cm(x)) end
81
An integrated knowledge discovery management
system
82
Experimental design

three variations (DS, DV, and DVS), weighted
voting (WV), and cross-validation majority
(CVM)
30 random runs (30 - testing, 70 - training)
for Monte-Carlo CV
10-fold stratified CV for cross-validation
history
Heterogeneous Euclidean-Overlap distance metric
5 other metrics investigated
UCI ML Repository data sets Appendicitis
datasets
interval discretization (equal lines)
PEBLS, C4.5, and BAYES learning algorithms for
testing DI
C4.5 for multi-level meta-learning, local
feature selection, and bagging and boosting
t-test for statistical significance
the test environment is implemented within the
MLC framework (the Machine Learning Library in
C)

Write a Comment

User Comments (0)

About PowerShow.com

Ensembles: Issues and Applications PowerPoint PPT Presentation