Title: Ensemble Methods and Applications
1Ensemble Methods and Applications
2Overview of Ensemble
- Learning Set
- Predictor
- Ensemble Predictor
-
- Prediction
-
3Overview of Ensemble
- Two-class problem
- A set of classifiers
- Individual Predictions
- Ensemble Prediction
4Overview of Ensemble Methods
- Bagging( Breiman, 1996)
- Bootstrapping on data
- Boosting( Freund and Schapire.1995)
- Recursively reweighting data
- Random Forest( Breiman, 1999)
- Randomly pick features to generate different
classifiers (decision trees)
5Why Ensemble Worked
- Error Decomposition
- Error?2 bias2 variance
- ?2 expected error of Bayes-optimal classifier
- bias2 measures how closely the average guess
matches the target - variance measures the range of fluctuation of
output for different training sets -
6Why Ensemble Worked
- Bagging and Random Forest the introduction of
randomness makes the variance of each learner
independent to some extent and on average, the
bias converges and the variance decreases. - Boosting by focusing on the difficult points,
it is able to increase the minimum margin on the
learning set. Reduce both bias and variance.
7Why Ensemble Worked
8Why Ensemble Worked
9Other applications of Ensemble
- Large scale data
- Distributed data
- Streaming data
- Anonymous knowledge sharing
10Bacing Algorithm
- Combine the advantages of bagging and boosting
- New algorithm Bagging with adaptive cost
(bacing) - Recursively reweighting the data points according
to vote margin - Margincorrect votes-wrong votes
11Out-of-bag Margin
- In each round of training, a data point has
probability of 63.2 to be sampled. - For each data point, only count votes of those
classifiers that are trained without this point. - Out-of-bag margin is a more fair estimate of the
performance on test data
12Base learner Linear SVM
13Out-of-bag Margin
14Weighting Scheme
15Math Model
16Computation Experiment
- Computation experiments were performed on 14 UCI
data sets - The ensemble size is 100.
- For each data set, the result is averaged over
five 10-fold cross validations. - The results are compared with single linear SVM
and bagging.
17Results
18Results
19Results
20Results
21Conclusions
- Bacing does provide higher accuracy than bagging,
based on our experiments so far. - No overfitting is observed
22Future Work
- Compare with AdaBoost, arcing-x4
- Other weighting schemes.
- Extend to nonlinear classifiers (nonlinear SVM).
23Case Study Churning
- A user that stops using the service from the
telecom company is called a churner - Data mining problem Identify current users that
is about to churn soon. - Data demographic and historical service data for
current users and churners.
24Case Study Churning
- Targeting a small subset of customers (say 5)
that are most likely to churn. - Optimize a certain point on the Lift curve ( with
a certain percentage of positive predictions).
25Case Study Churning
- For a certain point on a lift curve, data are
classified as positive and negative according to
a threshold. - The bacing algorithm is applicable.
- Margincorrect votes-threshold
26Classifier Sharing
- Sharing classifiers within the same problem
domain improves the ensemble performance
(Prodromidis, Stolfo Chan, 1998).
27Introduction
- How about sharing classifiers across slightly
different problem domains?
28Data
- Source Direct Marketing Association
- Business Situation A catalog company
- Data
- 12 years of purchase and promotion records
- Demographic info
- Targets For each of 19 product categories, learn
to predict whether a customer will buy from this
category.
29Data
- Finalized dataset has 103,715 records with 310
features. - Data split 13,715 as training, 9,000 as tuning
and 81,000 as testing.
30Data
31Bagging Approach
- For each category, bootstrap 200 positive and 200
negative points from the training set. - Build a C4.5 decision tree based on this
half-half data. - Repeat the previous steps 25 times and get a
bagging ensemble of 25 trees for each category.
32Problem
- Hard to get a good model for categories with a
small number of buyers in the training set!
33Ideas
- People buying from the same catalog share some
common properties. - The purchase behavior on different products could
be similar. A classifier for category i may be
useful for category j. - A mechanism that allows sharing of classifiers
among different categories may help improve
overall performance.
34Algorithm
Cat1 tune
Cat19 tune
35Ensemble optimization
- Pick a good subset for each category from a pool
of 1925 decision trees. - Classifiers of a good ensemble should be
- Accurate
- Independent
- Searching for the best subset is a combinatorial
problem.
36Literature
- Pruning Adaboost ensemble, Margineantu
Dietterich, 1997 - Pruning of Meta-classifiers, A. L. Prodromidis,
S. J. Stolfo P. K. Chan, 1998 - Meta-evolutionary ensembles, Y. S. Kim, W. N.
Street F. Menczer, 2001
37Pair-wise Independence Measure
- Construct a prediction matrix P
- Pij 1, if classifier j made the wrong
prediction on data i - Pij 0, otherwise
- Let G PTP
- Gii of errors made by classifier i.
- Gij of common errors of classifier pair i, j.
- Example
38Pair-wise Independence Measure
- Normalize G
- Gii Gii/n, n is the total of tuning data
points - Gij Gij/min(Gii,Gjj)
- After normalization, all elements of G is
between 0 and 1.
39Modify G matrix
- The current G matrix only takes into account the
properties of the classifiers on the overall
tuning data, which is inappropriate for highly
skewed data. - We balance the weight between positive points and
negative points. - Define the new G?Gpos(1- ?)Gneg
40Integer Programming Formulation
41MAX CUT Problem
42MAX CUT Problem
Ensemble Optimization
Max Cut
43Transformation
44Transformation
45Transformation
Ensemble Optimization
Max Cut
46MAX CUT Problem
Semi-definite relaxation of Max Cut
47MAX CUT Problem
- The max cut problem can be relaxed into a
semi-definite programming problem. - The SDP relaxation can be solved to any
pre-specified precision in polynomial time. - The randomized solution of the SDP relaxation is
guaranteed to achieve at least 87.8 of the
optimal objective. (Goemans Williamson, 1995)
48Ensemble Optimization Problem
49pos0.14
pos0.27
pos0.04
pos0.92
50pos0.12
pos0.83
pos0.44
pos0.37
51pos0.64
pos0.02
pos0.13
pos0.05
52pos0.44
pos2.65
pos2.09
pos1.70
53pos0.65
pos0.31
pos1.17
54Computational Result
- The optimized ensemble is seldom significantly
worse than the original ensemble. - Significant improvement can be found for those
categories with very few positive points.
55Computational Result
56Computation Result
57Conclusion
- Ensemble optimization is helpful for sharing
knowledge between related domains. - Potential applications marketing, email
filtering.
58Pruning Bagging Ensembles
59Pruning Boosting Ensembles
60Pruning Random Forest
61Future Work
- Fine tuning the parameters (normalization scheme,
relative weights between strength and diversity). - Compare to other pruning algorithms.
62Case Study Marketing model for new product
- For a new product, there is limited information
about its customers - What is the best usage of the limited
information? - Directly build models on them?
- Use them to pick a subset of models of other
related products?
63Case Study Marketing model for new product
- Satellite radio data set
- Survey data
- It is known whether participants possess other
related electronic devices such as car radio, CD
player, cassett
64Case Study Marketing model for new product
- Treat related electronics as old products and
build marketing models on them - Compare which one is better
- Directly build a marketing model for satellite
radio - Select models from those for related products
65Hierarchical Document Classification
- With given hierarchical structure, and sample
documents - Build a system that can assign a test document to
one (or more?) node(s) -
66Hierarchical Document Classification
- Build a classifier at each node
- Use sample documents of its child nodes and their
respective offspring nodes as training samples.
67Hierarchical Document Classification
- Can we group the classifiers at each node to form
an ensemble and select classifiers for each node?