Ensemble Methods and Applications - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Ensemble Methods and Applications

Description:

Bagging and Random Forest: the introduction of ... Out-of-bag Margin ... Out-of-bag margin is a more fair estimate of the performance on test data. 9/2/09 ... – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 61

Provided by: hawk93

Category:

more less

Transcript and Presenter's Notes

Title: Ensemble Methods and Applications

1
Ensemble Methods and Applications

Yi Zhang

2
Overview of Ensemble

Learning Set
Predictor
Ensemble Predictor
Prediction

3
Overview of Ensemble

Two-class problem
A set of classifiers
Individual Predictions
Ensemble Prediction

4
Overview of Ensemble Methods

Bagging( Breiman, 1996)
Bootstrapping on data
Boosting( Freund and Schapire.1995)
Recursively reweighting data
Random Forest( Breiman, 1999)
Randomly pick features to generate different
classifiers (decision trees)

5
Why Ensemble Worked

Error Decomposition
Error?2 bias2 variance
?2 expected error of Bayes-optimal classifier
bias2 measures how closely the average guess
matches the target
variance measures the range of fluctuation of
output for different training sets

6
Why Ensemble Worked

Bagging and Random Forest the introduction of
randomness makes the variance of each learner
independent to some extent and on average, the
bias converges and the variance decreases.
Boosting by focusing on the difficult points,
it is able to increase the minimum margin on the
learning set. Reduce both bias and variance.

7
Why Ensemble Worked
8
Why Ensemble Worked

If set rule

9
Other applications of Ensemble

Large scale data
Distributed data
Streaming data
Anonymous knowledge sharing

10
Bacing Algorithm

Combine the advantages of bagging and boosting
New algorithm Bagging with adaptive cost
(bacing)
Recursively reweighting the data points according
to vote margin
Margincorrect votes-wrong votes

11
Out-of-bag Margin

In each round of training, a data point has
probability of 63.2 to be sampled.
For each data point, only count votes of those
classifiers that are trained without this point.
Out-of-bag margin is a more fair estimate of the
performance on test data

12
Base learner Linear SVM
13
Out-of-bag Margin
14
Weighting Scheme

Weight on each point i

15
Math Model
16
Computation Experiment

Computation experiments were performed on 14 UCI
data sets
The ensemble size is 100.
For each data set, the result is averaged over
five 10-fold cross validations.
The results are compared with single linear SVM
and bagging.

17
Results
18
Results
19
Results
20
Results
21
Conclusions

Bacing does provide higher accuracy than bagging,
based on our experiments so far.
No overfitting is observed

22
Future Work

Compare with AdaBoost, arcing-x4
Other weighting schemes.
Extend to nonlinear classifiers (nonlinear SVM).

23
Case Study Churning

A user that stops using the service from the
telecom company is called a churner
Data mining problem Identify current users that
is about to churn soon.
Data demographic and historical service data for
current users and churners.

24
Case Study Churning

Targeting a small subset of customers (say 5)
that are most likely to churn.
Optimize a certain point on the Lift curve ( with
a certain percentage of positive predictions).

25
Case Study Churning

For a certain point on a lift curve, data are
classified as positive and negative according to
a threshold.
The bacing algorithm is applicable.
Margincorrect votes-threshold

26
Classifier Sharing

Sharing classifiers within the same problem
domain improves the ensemble performance
(Prodromidis, Stolfo Chan, 1998).

27
Introduction

How about sharing classifiers across slightly
different problem domains?

28
Data

Source Direct Marketing Association
Business Situation A catalog company
Data
12 years of purchase and promotion records
Demographic info
Targets For each of 19 product categories, learn
to predict whether a customer will buy from this
category.

29
Data

Finalized dataset has 103,715 records with 310
features.
Data split 13,715 as training, 9,000 as tuning
and 81,000 as testing.

30
Data
31
Bagging Approach

For each category, bootstrap 200 positive and 200
negative points from the training set.
Build a C4.5 decision tree based on this
half-half data.
Repeat the previous steps 25 times and get a
bagging ensemble of 25 trees for each category.

32
Problem

Hard to get a good model for categories with a
small number of buyers in the training set!

33
Ideas

People buying from the same catalog share some
common properties.
The purchase behavior on different products could
be similar. A classifier for category i may be
useful for category j.
A mechanism that allows sharing of classifiers
among different categories may help improve
overall performance.

34
Algorithm
Cat1 tune
Cat19 tune
35
Ensemble optimization

Pick a good subset for each category from a pool
of 1925 decision trees.
Classifiers of a good ensemble should be
Accurate
Independent
Searching for the best subset is a combinatorial
problem.

36
Literature

Pruning Adaboost ensemble, Margineantu
Dietterich, 1997
Pruning of Meta-classifiers, A. L. Prodromidis,
S. J. Stolfo P. K. Chan, 1998
Meta-evolutionary ensembles, Y. S. Kim, W. N.
Street F. Menczer, 2001

37
Pair-wise Independence Measure

Construct a prediction matrix P
Pij 1, if classifier j made the wrong
prediction on data i
Pij 0, otherwise
Let G PTP
Gii of errors made by classifier i.
Gij of common errors of classifier pair i, j.
Example

38
Pair-wise Independence Measure

Normalize G
Gii Gii/n, n is the total of tuning data
points
Gij Gij/min(Gii,Gjj)
After normalization, all elements of G is
between 0 and 1.

39
Modify G matrix

The current G matrix only takes into account the
properties of the classifiers on the overall
tuning data, which is inappropriate for highly
skewed data.
We balance the weight between positive points and
negative points.
Define the new G?Gpos(1- ?)Gneg

40
Integer Programming Formulation
41
MAX CUT Problem
42
MAX CUT Problem
Ensemble Optimization
Max Cut
43
Transformation
44
Transformation
45
Transformation
Ensemble Optimization
Max Cut
46
MAX CUT Problem
Semi-definite relaxation of Max Cut
47
MAX CUT Problem

The max cut problem can be relaxed into a
semi-definite programming problem.
The SDP relaxation can be solved to any
pre-specified precision in polynomial time.
The randomized solution of the SDP relaxation is
guaranteed to achieve at least 87.8 of the
optimal objective. (Goemans Williamson, 1995)

48
Ensemble Optimization Problem
49
pos0.14
pos0.27
pos0.04
pos0.92
50
pos0.12
pos0.83
pos0.44
pos0.37
51
pos0.64
pos0.02
pos0.13
pos0.05
52
pos0.44
pos2.65
pos2.09
pos1.70
53
pos0.65
pos0.31
pos1.17
54
Computational Result

The optimized ensemble is seldom significantly
worse than the original ensemble.
Significant improvement can be found for those
categories with very few positive points.

55
Computational Result
56
Computation Result
57
Conclusion