Decision Trees - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Decision Trees

Description:

Bagging: Pick many random subsets of the training cases (may or may not allow replacement) ... Almost always used in conjunction with bagging ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 16
Provided by: steveh81
Category:

less

Transcript and Presenter's Notes

Title: Decision Trees


1
Decision Trees
  • Steve Herrin
  • University of Washington

2
What is a Classifier?
  • Given a set of training cases with a vector of
    attributes (X, t) (x1, x2, , xk, t)
  • Usually, t is a discrete variable, representing
    into what class a case falls
  • Want to a way to predict t based on X
  • Machine learning algorithms, called classifiers,
    provide a means to do this
  • Examples Neural Nets, Decision Trees,
    Bayesian Filters, Many More

3
Decision Trees
  • Decision trees are a type of classifier
  • Node, Leaf and Branch structure
  • Generally binary
  • Leaf value may reflect a full classification (t
    0,1)
  • Or may give an idea of how close case is to one
    class (depends on implementation)

xi gt a
xi lt a
Node
t1
xj gt b
xj lt b
t0
t.83
Branch
Leaf
4
Building a Decision Tree
  • At a node, find the attribute xm in X that
    provides the most discrimination
  • Find what value of xm to branch on
  • Gini Improvement How much more pure the
    subsequent sets are
  • Information Gain How much the entropy of the
    subsequent sets decreases
  • Absolute Error How much the signal background
    separation improves
  • The first two are best and give similar results

5
Overtraining
  • For any classifier, there is a danger of
    overtraining
  • Ideally, separating surface between classes
    should be simple
  • However, in overtrained classifier, the
    separating surface is complicated
  • This occurs because classifier optimizes too much
    on training data
  • Leads to poor performance on test data

xj
xi
ideal
overtrained
6
Pre-pruning
  • The tree growing process continues recursively
  • To prevent overtraining, process stops for
    certain conditions
  • A node contains one class of case (same t)
  • A node contains cases with same X
  • A node contains fewer than N events (N100)

7
Pruning
  • For a more complicated tree, pre-pruning may
    still allow overtraining
  • Many different pruning algorithms
  • Simplest
  • Withhold a small set of the training data
  • Grow the tree using remaining data
  • After tree is finished, prune a node to a leaf if
    it leads to a lower error rate on the withheld
    data

8
Advantages
  • A decision tree can be easily parsed by a human
    or computer program, unlike the black box of a
    neural net
  • Can be grown quickly
  • Handles discrete data (e.g. of Jets)

9
Disadvantages
  • Unstable A small change in the training
    data can lead to large changes in the trees grown
  • For simplest algorithms, cannot make use of
    correlations (esp. nonlinear) that only occur in
    one of signal or background
  • Does not separate
  • on smooth lines

10
Ensemble Methods
  • Remove many disadvantages by combining multiple
    trees
  • Boosting
  • Train a series of trees, then take linear
    combination of their outputs
  • In each subsequent tree, more weight is given to
    hard cases (i.e. the ones misclassified by
    previous trees)
  • Sensitivity to noisy cases may lead to poor
    performance

11
Ensemble Methods (cont.)
  • Bagging
  • Pick many random subsets of the training cases
    (may or may not allow replacement)
  • Train trees using these subsets, then take an
    average of their results
  • It is tempting to use a weighted average based on
    how accurately a tree classifies the training
    data, but this can lead to overtraining
  • Effective for noisy data and for unstable
    classifiers like trees (small changes in training
    set can lead to large changes in predictions)

12
Ensemble Methods (cont.)
  • Random Forest
  • Almost always used in conjunction with bagging
  • In each tree, at each node, pick at random only a
    small subset of the attributes to split on OR
    take a random linear combination of the
    attributes
  • Again, weighted average can lead to overtraining

13
Testing Effectiveness
  • In past train on 60 of data, test on 40
  • Statisticians use 10-fold cross-evaluation
  • Divide data into 10 sets S1, S2, , S10
  • Train on 9 of these sets, test on remaining 1
  • Repeat for testing on each of the Sis, starting
    over from scratch each time
  • Combine results to get a measure of performance
  • Provides a better measure of expected accuracy
    (though difference is small for large data sets)

14
Appendix Gini
  • Assign each event weights Ws and WB (signal
    weight and background weight)
  • Purity
  • Gini
  • Small Gini is better (0 represents total
    separation)
  • We look at the improvement in Gini

15
Appendix Information Gain
  • Denote the probability of class i in the data set
    by pi
  • Entropy of Set
  • Divide data into subsets S
  • InfoGain
Write a Comment
User Comments (0)
About PowerShow.com