Ensembles: Issues and Applications PowerPoint PPT Presentation

presentation player overlay
1 / 82
About This Presentation
Transcript and Presenter's Notes

Title: Ensembles: Issues and Applications


1
EnsemblesIssues and Applications
  • Alexey Tsymbal
  • Department of Computer Science Trinity College
    Dublin
  • Ireland

2
Contents
  • Introduction - knowledge discovery and data
    mining - the task of classification -
    ensemble classification
  • What makes a good ensemble?
  • Bagging, Boosting
  • Evaluation of ensembles
  • Comprehensibility of ensembles
  • Overfitting in ensembles
  • Ensemble feature selection for acute abdominal
    pain classification
  • Ensembles for streaming data processing with
    the presence of concept drift

3
Knowledge discovery and data mining
  • Knowledge Discovery in Databases (KDD) is an
    emerging area that considers the process of
    finding previously unknown and potentially
    interesting patterns and relations in large
    databases.
  • __________________________________________________
    __________________________________________________
    ____________
  • Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
    Uthurusamy, R., Advances in Knowledge Discovery
    and Data Mining, AAAI/MIT Press, 1997.

4
The task of classification
J classes, n training observations, p instance
attributes
New instance to be classified
Training Set
CLASSIFICATION
Class Membership of the new instance
5
What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
6
Ensemble learning
7
Ensembles scientific communities
  • machine learning (ML Research Four current
    directions, Dietterich 1997)
  • knowledge discovery and data mining
    (HanKamber 2000, DM Concepts and techniques,
    current trends)
  • artificial intelligence (multi-agent systems,
    RusselNorvig 2002, AI A modern approach,
    2nd ed.)
  • neural networks
  • statistics
  • computational learning theory
  • pattern recognition
  • what else?

8
Ensembles different names
  • multiple models
  • multiple classifier systems
  • combining classifiers (regressors etc)
  • integration of classifiers
  • mixture of experts
  • decision committee
  • committee of experts
  • classifer fusion
  • multimodel learning
  • consensus theory
  • what else?
  • base classifiers
  • component classifiers
  • individual classifiers
  • members (of a decision committee)
  • level 0 experts
  • what else?

9
Why ensemble learning?
  • Accuracy a more reliable mapping can be
    obtained by combining the output of multiple
    experts
  • Efficiency a complex problem can be decomposed
    into multiple sub-problems that are easier to
    understand and solve (divide-and-conquer
    approach). Mixture of experts, ensemble feature
    selection.
  • There is not a single model that works for all
    pattern recognition problems! (no free lunch
    theorem)To solve really hard problems, well
    have to use several different representations.
    It is time to stop arguing over which type of
    pattern-classification technique is best.
    Instead we should work at a higher level of
    organization and discover how to build managerial
    systems to exploit the different virtues abd
    evade the different limitations of each of these
    ways of comparing things. Minsky, 1991.

10
When ensemble learning?
  • When you can build base classifiers that are
    more accurate than chance, and, more importantly,
  • that are as much as possible independent from
    each other

11
Why do ensembles work? 1/3
Because uncorrelated errors of individual
classifiers can be eliminated by averaging.
Assume 40 base classifiers, majority voting,
each error rate 0.3 Probability of observing r
misclassified examples by the ensemble of 40
classifiers (Dietterich, 1997) r21 -gt 0.002
12
Why do ensembles work? 2/3
The desired target function may not be
implementable with individual classifiers,but
may be approximated by ensemble
averaging Assume you want to build a decision
boundary with decision trees The decision
boundaries of decision trees are hyperplanes
parallel to the coordinate axes, as in the
figures By averaging a large number of such
staircases, the diagonal decision boundary can
be approximated with arbitrary small accuracy
Class 1
Class 1
Class 2
Class 2
a
b
13
Why do ensembles work? 3/3
  • Theoretical results by Hansen Solomon (1990)
  • If we can assume that classifiers are random in
    predictions and their accuracy gt 50, can push
    accuracy arbitrarily high by combining more
    classifiers
  • Key assumption
  • classifiers are independent in their predictions
  • not a very reasonable assumption
  • more realistic for data points where classifiers
    predict with gt50 accuracy, can push accuracy
    arbitrarily high (some data points just too
    difficult)

14
How to make an effective ensemble?
  • Two basic decisions when designing ensembles
  • How to generate the base classifiers?
  • How to integrate them?

15
Methods for generating the base classifiers
  • Subsampling the training examples- multiple
    hypotheses are generated by training individual
    classifiers on different datasets obtained by
    resampling a common training set (Bagging,
    Boosting)
  • Manipulating the input features- multiple
    hypothesis are generated by training individual
    classifiers on different representations, or
    different subsets of a common feature vector
  • Manipulating the output targets- the output
    targets for C classes are encoded with an l-bit
    codeword, and an individual classifier is built
    to predict each one of the bits in the codeword-
    additional auxiliary targets may be used to
    differentiate classifiers
  • Modifying the learning parameters of the
    classifier- a number of classifiers are built
    with different learning parameters, such as
    number of neighbors in a kNN rule, initial
    weights in an MLP, etc
  • 5. Using heterogeneous models (not often used).

16
Learning algorithms in ensembles
  • Decision tree learning
  • ID3, C4.5 decision tree learning algorithms
    (Quinlan)
  • Instance-based learning
  • k-nearest neighbor classification, PEBLS
    (Cost, Salzberg)
  • Bayesian classification
  • Naïve Bayes (John)
  • Neural networks (MLPs etc)
  • Discriminant analysis
  • Regression analysis

17
Ensembles the need for disagreement
  • Overall error depends on average error of
    ensemble members
  • Increasing ambiguity decreases overall error
  • Provided it does not result in an increase in
    average error
  • (Krogh and Vedelsby, 1995)

18
Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
19
Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for the Dynamic Integration The
main assumption is that each classifier is the
best in some sub-areas of the whole data set,
where its local error is comparatively less than
the corresponding errors of the other classifiers.
20
Problems of voting an example
21
Stacked generalization framework
(Stacking, Wolpert, 1992)
22
Arbiter meta-learning tree
(Chan Stolfo, 1997)
Meta-learning is the use of learning algorithms
to learn how to integrate results from multiple
learning systems. An arbiter is a classifier that
is trained to resolve disagreements between the
base classifiers. An arbiter tree is a
hierarchical (multi-level) structure composed of
arbiters that are computed in a bottom-up,
binary-tree way. An arbiter tree is a
hierarchical structure composed of arbiters that
are computed in a bottom-up, binary-tree
fashion. When a new instance is classified by the
arbiter tree in the application phase,
predictions flow from the leaves to the root of
tree.
23
The space model motivation for dynamic
integration
Information about methods errors on training
instances can be used for learning just as
original instances are used for learning.
The main assumption is that each data mining
method is best in some subareas of the whole
application domain, where its local error is
comparatively less than the corresponding errors
of the other available methods.
24
Dynamic integration of classifiers
The goal is to use each base classifier just in
the subdomain where it is most reliable (or to
use a weight proportional to its local
reliability) and thus to achieve overall results
that can be considerably better than those of the
best individual classifier alone.
25
Dynamic integration of classifiers an example
26
EFS_SBC experiments results (Accute Abdominal
Pain dataset)
27
Bagging
BAGGing Bootstrap AGGregation (Breiman,
1996) Bootstrap???
28
Bootstrapping A bit of history
  • Rudolf Raspe, Baron MunchausensNarrative of
    his Marvellous Travels and Campaings in Russia,
    1785
  • He hauls himself and his horse out of the mud by
    lifting himself by his own hair.
  • By 1860s the story was modified to refer to
    bootstrapsin the USA
  • This term was also used to refer to doing
    something on your own, without the use of
    external help since 1860s
  • Since 1950s it refers to the procedure of
    getting a computer to start (to boot, to reboot)

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Weak learning motivation for Boosting
  • Schapire showed that a set of weak learners
    (learners with gt50 accuracy, but not much
    greater) could be combined into a strong learner
  • Idea weight the data set based on how well we
    have predicted data points so far- data points
    correctly predicted -gt low weight- data points
    mispredicted -gt high weight
  • Results focuses the base classifiers on portions
    of data space not previously well predicted

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Some theories on bagging/boosting
  • Error Bayes Optimal Error Bias Variance
  • Bayes Optimal Error noise error
  • Theories
  • bagging can reduce variance part of error
  • boosting can reduce variance AND bias
  • bagging will hardly ever increase error
  • boosting may increase error
  • boosting is susceptible to noise

49
Comprehensibility of ensembles
Comprehensibility is commonly an important
shortcoming of ensembles because ensembles are
often too complex and difficult to understand for
an expert, even when the base classifiers are
simple (1 DT is easy to interprete, but how
about 200 DTs?). There are currently known two
ways to cope with this problem and to achive
understanding and explanationin ensembles 1)
Black box approach. The behaviour of an
ensemble is approximated with a single model
(Domingos, 1998) 2) Decomposition approach. The
ensemble is decomposed into a set of rules
(Cunningham, ECML/PKDD 2002)
50
Overfitting
Formal definition
  • Consider error of hypothesis h over
  • Training data errortrain(h)
  • Entire distribution D of data errorD(h)
  • Hypothesis h?H overfits training data if there is
    an alternative hypothesis h?H such that
  • errortrain(h) lt errortrain(h)
  • and
  • errorD(h) gt errorD(h)

51
Overfitting
AccTrS
Accuracy
AccTestS
Search space(nodes, epochs, etc)
knee point
52
Overfitting in ensembles
  • Not that much research has been done to this time
  • A surprising recent findingensembles of overfit
    base classifiers(DTs, ANNs) are in many cases
    better than the ensembles of non-overfit base
    classifiers
  • This is related most probably to the fact that in
    that case the ensemble diversity is much higher

53
Measuring overfitting in ensembles
54
Evaluating learned models
  • Cross-validation (CV) the examples are randomly
    split into v mutually exclusive partitions (the
    folds) of approximately equal size. A sample is
    formed by setting aside one of the v folds as the
    test set, and the remaining folds make up the
    training set. This creates v possible samples. As
    each learned model is formed using one of the v
    training sets, its generalization performance is
    estimated on the corresponding test partition.
  • Random sampling or Monte Carlo cross-validation
    is a special case of v-fold cross-validation
    where a percentage of training examples
    (typically 2/3) is randomly placed in the
    training set, and the remaining examples are
    placed in the test set. After learning takes
    place on the training set, generalization
    performance is estimated on the test set. This
    whole process is repeated for many training/test
    splits (usually 30) and the algorithm with the
    best average generalization performance is
    selected.

55
Ensemble evaluation problems
(Salzberg, 1999) Statistically invalid
conclusions are to be avoided When one repeatedly
searches a large database with powerful
algorithms, it is all too easy to find a
phenomenon or pattern that looks impressive, even
when there is nothing to discover This is so
called multiplicity effect This effect is even
more important when the same search is done
within one scientific community Literature
skewness effect there is substantial danger that
published results, even when using the
appropriate significance tests, will be mere
accidents of chance. One must be very careful in
the design of an experimental study using
publicly available databases, such as
UCI Proposed solution to use important real
datasets and to make UCI ML repository larger.
56
Feature selection motivation
  • Build better predictors better quality
    predictors (classifiers/regressors) can be built
    by removing irrelevant and redundant features
  • Economy of representation and comprehensibility
    allow problem/ phenomena to be represented as
    succintly as possible
  • Knowledge discovery discover what features are
    and are not influential in weak theory domains

57
Problems of global feature selection
  • Most feature selection methods ignore the fact
    that some features may be relevant only in
    context (i.e. in some regions of the instance
    space) and cover the whole instance space by a
    single set of features
  • They may discard features that are highly
    relevant in a restricted region of the instance
    space because this relevance is swamped by their
    irrelevance everywhere else
  • They may retain features that are relevant in
    most of the space, but still unnecessarily
    confuse the classifier in some regions
  • Global feature selection can lead to poor
    performance in minority class prediction, whereas
    this is an often case (i.e. many negative/no
    disease instances in medical diagnostics)
    (Cardie and Howe 1997).

58
Feature-space heterogeneity
  • There exist many complicated data mining
    problems, where relevant features are different
    in different regions of the feature space.
  • Types of feature heterogeneity
  • Class heterogenity
  • Feature-value heterogenity
  • Instance-space heterogeneity.

59
Ensemble Feature Selection
  • How to prepare inputs for the generation of the
    base classifiers ?
  • Sample the training set
  • Manipulate input features
  • Manipulate output target (class values)
  • Goal of traditional feature selection
  • find and remove features that are unhelpful or
    destructive to learning making one feature subset
    for single classifier
  • Goal of ensemble feature selection
  • find and remove features that are unhelpful or
    destructive to learning making different feature
    subsets for a number of classifiers
  • find feature subsets that will promote
    disagreement between the classifiers

60
Advanced local feature selection
A new instance x0
TS
Training set
Local Feature Filtering
Feature subsets
...
FS1
FS2
FSk
A learning algorithm
...
Classifiers
C1
C2
Ck
Dynamic integration
Local accuracies
...
LAcc1
LAcc2
LAcck
Dynamic Selection or Dynamic Voting
61
Classification of acute abdominal pain
  • 3 large datasets with cases of acute abdominal
    pain (AAP) 1254, 2286, and 4020 instances, and
    18 parameters (features) from history-taking and
    clinical examination
  • the task of separating acute appendicitis
  • the second most important cause of abdominal
    surgeries
  • AAP I from 6 surgical departments in Germany,
    AAP II from 14 centers in Germany, and AAP III
    from 16 centers in Central and Eastern Europe
  • the 18 features are standardized by the World
    Organization of Gastroenterology (OMGE)

Features 1 Sex 2 Age 3 Progress of pain 4
Duration of pain 5 Type of pain 6 Severity of
pain 7 Location of pain at present 8 Location of
pain at onset 9 Previous similar complaints 10
Previous abdominal operation 11 Distended
abdomen 12 Tenderness 13 Severity of
tenderness 14 Movement of abdominal wall 15
Rigidity 16 Rectal tenderness 17 Rebound
tenderness 18 Leukocytes
The data sets for research were kindly provided
by the Laboratory for System Design, Faculty
of Electrical Engineering and Computer Science,
University of Maribor, Slovenia, and the
Theoretical Surgery Unit, Department of General
and Trauma Surgery, Heinrich-Heine University,
Düsseldorf, Germany
62
Search in EFS
  • Search space
  • 2NumOfFeaturesNumOfClassifiers 21825 6
    553 600
  • 4 search strategies to heuristically explore the
    search space
  • Hill-Climbing (HC) (CBMS2002)
  • Ensemble Forward Sequential Selection (EFSS)
  • Ensemble Backward Sequential Selection (EBSS)
  • Genetic Ensemble Feature Selection (GEFS)

63
Hill-Climbing (HC) strategy (CBMS2002)
  • Generation of initial feature subsets using the
    random subspace method (RSM)
  • A number of refining passes on eachfeature set
    while there is improvement in fitness

64
Ensemble Forward Sequential Selection (EFSS)
forward selection
65
Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
66
Genetic Ensemble Feature Selection (GEFS)
67
An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
68
Experiments results
69
Feature importance table (EFSS, alfa0)
70
Ensembles for tracking concept drifts in
streaming data
  • The presence of a changing target concept is
    called concept drift (CD).
  • Many real-world examples peoples preferences
    for products, SPAM, credit card fraud, intrusion
    detection, etc.
  • Algorithms that track CD must be able to
    identify a change in the target concept without
    direct knowledge of the underlying shift in
    distribution.
  • Two types of CD are distinguished sudden and
    gradual.
  • Two basic kinds of algorithms to track CDs
  • (1) window-based, and (2) ensemble-based.

71
Concept drift a definition
  • Consider 2 target concepts A and B.
  • A sequence of instances i1in.
  • Before id the concept A is stable
  • After iddx the concept B is stable
  • Between id and id dx the concept is drifting
    between A and B
  • When dx 1, the concept drift is sudden,
    otherwise gradual
  • The CD is usually modelled with a linear
    function alfa

72
Data streams
  • Much in common with the concept of CD
  • Many current information systems are real-time
    production systems that generate tremendous
    amount of data
  • Network event logs, telephone call records,
    credit card transactional flows, sensoring and
    surveillance video streams, etc.
  • Knowledge discovery on streaming data is a
    research topic of growing interest
  • Incremental or online learning algorithms are
    known, but they do not take into account CDs!

73
Data stream CD Ensemble
  • An ensemble can be trained from sequential data
    chunks in the stream
  • It may solve the problem of many conflicting
    concepts (not only 2)
  • Optimal ensemble search solves the CD problem at
    one level
  • Dynamic Integration solves the CD problem at the
    instance (or local) level

74
A Real-World Example Database Filling
75
A Real-World Example Diagnostics
76
A Real-World Example Classifier Evaluation
Selection
77
Thank you!
  • Alexey Tsymbal
  • Department of Computer Science and Information
    Systems,
  • University of Jyväskylä, FINLAND
  • www.cs.jyu.fi/alexey/alexey_at_it.jyu.fi

78
The DI algorithm keys
T training set for the base classifiers Ti i-th
fold of the training set T meta-level training
set for the combining algorithm x attributes of
an instance c(x) classification of the instance
with attributes x C set of base classifiers Cj
j-th base classifier Cj(x) prediction produced by
Cj on instance x Ej(x) estimation of error of Cj
on instance x Ej(x) prediction of error of Cj on
instance x m number of base classifiers W vector
of weights for base classifiers nn number of
near-by instances for error prediction WNNi
weight of i-th near-by instance
79
The DI algorithm learning phase
procedure learning_phase(T,C) begin fills in the
meta-level training set T partition T into
v folds loop for Ti ? T, loop for j
from 1 to m train(Cj,T-Ti)
loop for x?Ti loop for j from 1 to m
compare Cj(x) with c(x) and derive
Ej(x) collect (x,E1(x),...,Em(x)) into
T loop for j from 1 to m
train(Cj,T) end
80
The DI algorithm application phase
function DS_application_phase(T,C,x) returns
class of x begin loop for j from 1 to m
Ej?(1/nn)??WNNi?Ej(xNNi)WNN estimation
l?argmin Ej number of cl-er with min.
with the least global error in the case of
ties return Cl(x) end function
DV_application_phase(T,C,x) returns class of
x begin loop for j from 1 to m
Wj?1-(1/nn)??WNNi?Ej(xNNi) WNN estimation
return Weighted_Voting(W,C1(x),...,Cm(x)) end
81
An integrated knowledge discovery management
system
82
Experimental design
  • three variations (DS, DV, and DVS), weighted
    voting (WV), and cross-validation majority
    (CVM)
  • 30 random runs (30 - testing, 70 - training)
    for Monte-Carlo CV
  • 10-fold stratified CV for cross-validation
    history
  • Heterogeneous Euclidean-Overlap distance metric
    5 other metrics investigated
  • UCI ML Repository data sets Appendicitis
    datasets
  • interval discretization (equal lines)
  • PEBLS, C4.5, and BAYES learning algorithms for
    testing DI
  • C4.5 for multi-level meta-learning, local
    feature selection, and bagging and boosting
  • t-test for statistical significance
  • the test environment is implemented within the
    MLC framework (the Machine Learning Library in
    C)
Write a Comment
User Comments (0)
About PowerShow.com