Ensemble Methods and Applications - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Ensemble Methods and Applications

Description:

Bagging and Random Forest: the introduction of ... Out-of-bag Margin ... Out-of-bag margin is a more fair estimate of the performance on test data. 9/2/09 ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 61
Provided by: hawk93
Category:

less

Transcript and Presenter's Notes

Title: Ensemble Methods and Applications


1
Ensemble Methods and Applications
  • Yi Zhang

2
Overview of Ensemble
  • Learning Set
  • Predictor
  • Ensemble Predictor
  • Prediction

3
Overview of Ensemble
  • Two-class problem
  • A set of classifiers
  • Individual Predictions
  • Ensemble Prediction

4
Overview of Ensemble Methods
  • Bagging( Breiman, 1996)
  • Bootstrapping on data
  • Boosting( Freund and Schapire.1995)
  • Recursively reweighting data
  • Random Forest( Breiman, 1999)
  • Randomly pick features to generate different
    classifiers (decision trees)

5
Why Ensemble Worked
  • Error Decomposition
  • Error?2 bias2 variance
  • ?2 expected error of Bayes-optimal classifier
  • bias2 measures how closely the average guess
    matches the target
  • variance measures the range of fluctuation of
    output for different training sets

6
Why Ensemble Worked
  • Bagging and Random Forest the introduction of
    randomness makes the variance of each learner
    independent to some extent and on average, the
    bias converges and the variance decreases.
  • Boosting by focusing on the difficult points,
    it is able to increase the minimum margin on the
    learning set. Reduce both bias and variance.

7
Why Ensemble Worked
8
Why Ensemble Worked
  • If set rule

9
Other applications of Ensemble
  • Large scale data
  • Distributed data
  • Streaming data
  • Anonymous knowledge sharing

10
Bacing Algorithm
  • Combine the advantages of bagging and boosting
  • New algorithm Bagging with adaptive cost
    (bacing)
  • Recursively reweighting the data points according
    to vote margin
  • Margincorrect votes-wrong votes

11
Out-of-bag Margin
  • In each round of training, a data point has
    probability of 63.2 to be sampled.
  • For each data point, only count votes of those
    classifiers that are trained without this point.
  • Out-of-bag margin is a more fair estimate of the
    performance on test data

12
Base learner Linear SVM
13
Out-of-bag Margin
14
Weighting Scheme
  • Weight on each point i

15
Math Model
16
Computation Experiment
  • Computation experiments were performed on 14 UCI
    data sets
  • The ensemble size is 100.
  • For each data set, the result is averaged over
    five 10-fold cross validations.
  • The results are compared with single linear SVM
    and bagging.

17
Results
18
Results
19
Results
20
Results
21
Conclusions
  • Bacing does provide higher accuracy than bagging,
    based on our experiments so far.
  • No overfitting is observed

22
Future Work
  • Compare with AdaBoost, arcing-x4
  • Other weighting schemes.
  • Extend to nonlinear classifiers (nonlinear SVM).

23
Case Study Churning
  • A user that stops using the service from the
    telecom company is called a churner
  • Data mining problem Identify current users that
    is about to churn soon.
  • Data demographic and historical service data for
    current users and churners.

24
Case Study Churning
  • Targeting a small subset of customers (say 5)
    that are most likely to churn.
  • Optimize a certain point on the Lift curve ( with
    a certain percentage of positive predictions).

25
Case Study Churning
  • For a certain point on a lift curve, data are
    classified as positive and negative according to
    a threshold.
  • The bacing algorithm is applicable.
  • Margincorrect votes-threshold

26
Classifier Sharing
  • Sharing classifiers within the same problem
    domain improves the ensemble performance
    (Prodromidis, Stolfo Chan, 1998).

27
Introduction
  • How about sharing classifiers across slightly
    different problem domains?

28
Data
  • Source Direct Marketing Association
  • Business Situation A catalog company
  • Data
  • 12 years of purchase and promotion records
  • Demographic info
  • Targets For each of 19 product categories, learn
    to predict whether a customer will buy from this
    category.

29
Data
  • Finalized dataset has 103,715 records with 310
    features.
  • Data split 13,715 as training, 9,000 as tuning
    and 81,000 as testing.

30
Data
31
Bagging Approach
  • For each category, bootstrap 200 positive and 200
    negative points from the training set.
  • Build a C4.5 decision tree based on this
    half-half data.
  • Repeat the previous steps 25 times and get a
    bagging ensemble of 25 trees for each category.

32
Problem
  • Hard to get a good model for categories with a
    small number of buyers in the training set!

33
Ideas
  • People buying from the same catalog share some
    common properties.
  • The purchase behavior on different products could
    be similar. A classifier for category i may be
    useful for category j.
  • A mechanism that allows sharing of classifiers
    among different categories may help improve
    overall performance.

34
Algorithm
Cat1 tune
Cat19 tune
35
Ensemble optimization
  • Pick a good subset for each category from a pool
    of 1925 decision trees.
  • Classifiers of a good ensemble should be
  • Accurate
  • Independent
  • Searching for the best subset is a combinatorial
    problem.

36
Literature
  • Pruning Adaboost ensemble, Margineantu
    Dietterich, 1997
  • Pruning of Meta-classifiers, A. L. Prodromidis,
    S. J. Stolfo P. K. Chan, 1998
  • Meta-evolutionary ensembles, Y. S. Kim, W. N.
    Street F. Menczer, 2001

37
Pair-wise Independence Measure
  • Construct a prediction matrix P
  • Pij 1, if classifier j made the wrong
    prediction on data i
  • Pij 0, otherwise
  • Let G PTP
  • Gii of errors made by classifier i.
  • Gij of common errors of classifier pair i, j.
  • Example

38
Pair-wise Independence Measure
  • Normalize G
  • Gii Gii/n, n is the total of tuning data
    points
  • Gij Gij/min(Gii,Gjj)
  • After normalization, all elements of G is
    between 0 and 1.

39
Modify G matrix
  • The current G matrix only takes into account the
    properties of the classifiers on the overall
    tuning data, which is inappropriate for highly
    skewed data.
  • We balance the weight between positive points and
    negative points.
  • Define the new G?Gpos(1- ?)Gneg

40
Integer Programming Formulation
41
MAX CUT Problem
42
MAX CUT Problem
Ensemble Optimization
Max Cut
43
Transformation
44
Transformation
45
Transformation
Ensemble Optimization
Max Cut
46
MAX CUT Problem
Semi-definite relaxation of Max Cut
47
MAX CUT Problem
  • The max cut problem can be relaxed into a
    semi-definite programming problem.
  • The SDP relaxation can be solved to any
    pre-specified precision in polynomial time.
  • The randomized solution of the SDP relaxation is
    guaranteed to achieve at least 87.8 of the
    optimal objective. (Goemans Williamson, 1995)

48
Ensemble Optimization Problem
49
pos0.14
pos0.27
pos0.04
pos0.92
50
pos0.12
pos0.83
pos0.44
pos0.37
51
pos0.64
pos0.02
pos0.13
pos0.05
52
pos0.44
pos2.65
pos2.09
pos1.70
53
pos0.65
pos0.31
pos1.17
54
Computational Result
  • The optimized ensemble is seldom significantly
    worse than the original ensemble.
  • Significant improvement can be found for those
    categories with very few positive points.

55
Computational Result
56
Computation Result
57
Conclusion
  • Ensemble optimization is helpful for sharing
    knowledge between related domains.
  • Potential applications marketing, email
    filtering.

58
Pruning Bagging Ensembles
59
Pruning Boosting Ensembles
60
Pruning Random Forest
61
Future Work
  • Fine tuning the parameters (normalization scheme,
    relative weights between strength and diversity).
  • Compare to other pruning algorithms.

62
Case Study Marketing model for new product
  • For a new product, there is limited information
    about its customers
  • What is the best usage of the limited
    information?
  • Directly build models on them?
  • Use them to pick a subset of models of other
    related products?

63
Case Study Marketing model for new product
  • Satellite radio data set
  • Survey data
  • It is known whether participants possess other
    related electronic devices such as car radio, CD
    player, cassett

64
Case Study Marketing model for new product
  • Treat related electronics as old products and
    build marketing models on them
  • Compare which one is better
  • Directly build a marketing model for satellite
    radio
  • Select models from those for related products

65
Hierarchical Document Classification
  • With given hierarchical structure, and sample
    documents
  • Build a system that can assign a test document to
    one (or more?) node(s)

66
Hierarchical Document Classification
  • Build a classifier at each node
  • Use sample documents of its child nodes and their
    respective offspring nodes as training samples.

67
Hierarchical Document Classification
  • Can we group the classifiers at each node to form
    an ensemble and select classifiers for each node?
Write a Comment
User Comments (0)
About PowerShow.com