Machine Learning in Real World: CART

1 / 40
About This Presentation
Title:

Machine Learning in Real World: CART

Description:

... And Regression Tree. Developed 1974-1984 ... do with what tree has learned so far for this case ... Handling of Missing Splitter Values in Tree Growing ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 41
Provided by: grego122

less

Transcript and Presenter's Notes

Title: Machine Learning in Real World: CART


1
Machine Learning in Real WorldCART
2
Outline
  • CART Overview and Gymtutor Tutorial Example
  • Splitting Criteria
  • Handling Missing Values
  • Pruning
  • Finding Optimal Tree

3
CART Classification And Regression Tree
  • Developed 1974-1984 by 4 statistics professors
  • Leo Breiman (Berkeley), Jerry Friedman
    (Stanford), Charles Stone (Berkeley), Richard
    Olshen (Stanford)
  • Focused on accurate assessment when data is noisy
  • Currently distributed by Salford Systems

4
CART Tutorial Data Gymtutor
  • CART HELP, Sec 3 in CARTManual.pdf
  • ANYRAQT Racquet ball usage (binary indicator
    coded 0, 1)
  • ONAER Number of on-peak aerobics classes attended
  • NSUPPS Number of supplements purchased
  • OFFAER Number of off-peak aerobics classes
    attended
  • NFAMMEM Number of family members
  • TANNING Number of visits to tanning salon
  • ANYPOOL Pool usage (binary indicator coded 0, 1)
  • SMALLBUS Small business discount (binary
    indicator coded 0, 1)
  • FIT Fitness score
  • HOME Home ownership (binary indicator coded 0,
    1)
  • PERSTRN Personal trainer (binary indicator coded
    0, 1)
  • CLASSES Number of classes taken.
  • SEGMENT Members market segment (1, 2, 3) target

5
View data
  • CART Menu View - Data Info

6
CART Example Gymtutor
7
CART Model Setup
  • Target -- required
  • Predictors (default all)
  • Categorical
  • ANYRAQT, ANYPOOL, SMALLBUS, HOME
  • Categorical if field name ends in , or from
    values
  • Testing
  • default 10-fold cross-validation

8
Sample Tree
9
Color-coding using class
10
Decision Tree Splitters
11
Tree Details
12
Tree Summary Reports
13
Pruning the tree
14
Keeping only important variables
15
Revised Tree
16
Automating CART Command Log
17
Key CART features
  • Automated field selection
  • handles any number of fields
  • automatically selects relevant fields
  • No data preprocessing needed
  • Does not require any kind of variable transforms
  • Impervious to outliers
  • Missing value tolerant
  • Moderate loss of accuracy due to missing values

18
CART Key Parts of Tree Structured Data Analysis
  • Tree growing
  • Splitting rules to generate tree
  • Stopping criteria how far to grow?
  • Missing values using surrogates
  • Tree pruning
  • Trimming off parts of the tree that dont work
  • Ordering the nodes of a large tree by
    contribution to tree accuracy which nodes come
    off first?
  • Optimal tree selection
  • Deciding on the best tree after growing and
    pruning
  • Balancing simplicity against accuracy

19
CART is a form of Binary Recursive Partitioning
  • Data is split into two partitions
  • Q Does C4.5 always have binary partitions?
  • Partitions can also be split into sub-partitions
  • hence procedure is recursive
  • CART tree is generated by repeated partitioning
    of data set
  • parent gets two children
  • each child produces two grandchildren
  • four grandchildren produce 8 great grandchildren

20
Splits always determined by questions with YES/NO
answers
  • Is continuous variable X c ?
  • Does categorical variable D take on levels i, j,
    or k?
  • is GENDER M or F ?
  • Standard split
  • if answer to question is YES a case goes left
    otherwise it goes right
  • this is the form of all primary splits
  • example Is AGE ? 62.5?
  • More complex conditions possible
  • Boolean combinations AGE
  • Linear combinations .66AGE - .75BP

21
Searching all Possible Splits
  • For any node CART will examine ALL possible
    splits.
  • CART allows search over a random sample if
    desired
  • Look at first variable in our data set AGE with
    minimum value 40
  • Test split Is AGE 40?
  • Will separate out the youngest persons to the
    left
  • Could be many cases if many people have the same
    AGE
  • Next increase the AGE threshold to the next
    youngest person
  • Is AGE 43?
  • This will direct additional cases to the left
  • Continue increasing the splitting threshold value
    by value
  • each value is tested for how good the split is .
    . . how effective it is in separating the classes
    from each other
  • Q Consider splits between values of the same
    class?

22
Split Tables
Q Where splits need to be evaluated?
Sorted by Blood Pressure
Sorted by Age
X
23
CART Splitting Criteria Gini Index
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j in
    T.
  • gini(T) is minimized if the classes in T are
    skewed.
  • Advanced CART also has other splitting criteria
  • Twoing is recommended for multi-class

24
Handling of Missing Splitter Values in Tree
Growing
  • If splitter variable missing dont know which way
    to send case (Left or Right in binary tree)
  • Could delete cases that have missing values
  • method used in classical statistical modeling
  • unacceptable in a data mining context w/ many
    missings
  • Freeze case in node in which missing splitter
    encountered
  • do with what tree has learned so far for this
    case
  • Allow cases with missing split variable to follow
    majority
  • assume all missings are somehow typical
  • Allow missing to be a separate value of variable
  • used by CHAID algorithm an option in Salford
    software
  • allow special handling for missing but all
    missings treated as indistinguishable from each
    other

25
Missing as a distinct splitter value
  • CHAID treats missing as a distinct categorical
    value
  • e.g AGE is 25-44, 45-64, 65-95 or missing
  • method also adopted by C4.5
  • If missing is a distinct value then all cases
    with missing go the same way in the tree
  • Assumption whatever the unknown value it is the
    same for all cases with missing value
  • Problem can be more than one reason for a
    database field to be missing
  • E.g. Income as a splitter wants to separate high
    from low
  • Levels most likely to be missing? High Income AND
    Low Income!
  • Dont want to send both groups to same side of
    tree

26
CART Treatment of Missing Primary Splitters
Surrogates
  • CART uses a more refined method a surrogate is
    used as a stand in for a missing primary field
  • surrogate should be a valid replacement for
    primary
  • Consider our example of INCOME
  • Other variables like Education or Occupation
    might work as good surrogates
  • Higher education people usually have higher
    incomes
  • People in high income occupations will usually
    (though not always) have higher incomes
  • Using surrogate means that missing on primary not
    all treated same way
  • Whether go left or right depends on surrogate
    value
  • thus record specific . . . some cases go left
    others go right

27
Surrogates Mimicking Alternatives to Primary
Splitters
  • A primary splitter is the best splitter of a node
  • A surrogate is a splitter that splits in a
    fashion similar to the primary
  • Surrogate variable with near equivalent
    information
  • Why Useful?
  • If the primary is expensive or difficult to
    gather and the surrogate is not
  • Then consider using the surrogate instead
  • Loss in predictive accuracy might be slight
  • If primary splitter is MISSING then CART will use
    a surrogate
  • if top surrogate missing CART uses 2nd best
    surrogate etc
  • If all surrogates missing also CART uses majority
    rule

28
Competitors vs. Surrogates
Class A 100 Class B 100 Class C 100
Left
Right
Primary Split
Class A 90 10 Class B 80 20 Class C 15 85
Competitor Split
Class A 80 20 Class B 25 75 Class C 14 86
Surrogate Split
Class A 78 22 Class B 74 26 Class C 21 79
29
CART Pruning Method Grow Full Tree, Then Prune
  • You will never know when to stop . . . so dont!
  • Instead . . . grow trees that are obviously too
    big
  • Largest tree grown is called maximal tree
  • Maximal tree could have hundreds or thousands of
    nodes
  • usually instruct CART to grow only moderately too
    big
  • rule of thumb should grow trees about twice the
    size of the truly best tree
  • This becomes first stage in finding the best tree
  • Next we will have to get rid the parts of the
    overgrown tree that dont work (not supported by
    test data)

30
Maximal Tree Example
31
Tree Pruning
  • Take a very large tree (maximal tree)
  • Tree may be radically over-fit
  • Tracks all the idiosyncrasies of THIS data set
  • Tracks patterns that may not be found in other
    data sets
  • At bottom of tree splits based on very few cases
  • Analogous to a regression with very large number
    of variables
  • PRUNE away branches from this large tree
  • But which branch to cut first?
  • CART determines a pruning sequence
  • the exact order in which each node should be
    removed
  • pruning sequence determined for EVERY node
  • sequence determined all the way back to root node

32
Pruning Which nodes come off next?
33
Order of Pruning Weakest Link Goes First
  • Prune away "weakest link" the nodes that add
    least to overall accuracy of the tree
  • contribution to overall tree a function of both
    increase in accuracy and size of node
  • accuracy gain is weighted by share of sample
  • small nodes tend to get removed before large ones
  • If several nodes have same contribution they all
    prune away simultaneously
  • Hence more than two terminal nodes could be cut
    off in one pruning
  • Sequence determined all the way back to root node
  • need to allow for possibility that entire tree is
    bad
  • if target variable is unpredictable we will want
    to prune back to root . . . the no model solution

34
Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
18 Terminal Nodes
20 Terminal Nodes
35
Now we test every tree in the pruning sequence
  • Take a test data set and drop it down the largest
    tree in the sequence and measure its predictive
    accuracy
  • how many cases right and how many wrong
  • measure accuracy overall and by class
  • Do same for 2nd largest tree, 3rd largest tree,
    etc
  • Performance of every tree in sequence is measured
  • Results reported in table and graph formats
  • Note that this critical stage is impossible to
    complete without test data
  • CART procedure requires test data to guide tree
    evaluation

36
Training Data Vs. Test Data Error Rates
No. Terminal Nodes
R(T)
Rts(T)
  • Compare error rates measured by
  • learn data
  • large test set
  • Learn R(T) always decreases as tree grows (Q
    Why?)
  • Test R(T) first declines then increases (Q Why?)
  • Overfitting is the result tree of too much
    reliance on learn R(T)
  • Can lead to disasters when applied to new data

71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 3
4 .12 .32 19 .20 .31 10 .29 .30
9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75
.82 1 .86 .91
37
Why look at training data error rates (or cost)
at all?
  • First, provides a rough guide of how you are
    doing
  • Truth will typically be WORSE than training data
    measure
  • If tree performing poorly on training data error
    may not want to pursue further
  • Training data error rate more accurate for
    smaller trees
  • So reasonable guide for smaller trees
  • Poor guide for larger trees
  • At optimal tree training and test error rates
    should be similar
  • if not something is wrong
  • useful to compare not just overall error rate but
    also within node performance between training and
    test data

38
CART Optimal Tree
  • Within a single CART run which tree is best?
  • Process of pruning the maximal tree can yield
    many sub-trees
  • Test data set or cross- validation measures the
    error rate of each tree
  • Current wisdom select the tree with smallest
    error rate
  • Only drawback minimum may not be precisely
    estimated
  • Typical error rate as a function of tree size has
    flat region
  • Minimum could be anywhere in this region

39
One SE Rule -- One Standard Error Rule
  • Original monograph recommends NOT choosing
    minimum error tree because of possible
    instability of results from run to run
  • Instead suggest SMALLEST TREE within 1 SE of the
    minimum error tree
  • Tends to provide very stable results from run to
    run
  • Is possibly as accurate as minimum cost tree yet
    simpler
  • Current learning one SERULE is good for small
    data sets
  • For large data sets one should pick most accurate
    tree
  • known as the zero SE rule

40
In what sense is the optimal tree best?
  • Optimal tree has lowest or near lowest cost as
    determined by a test procedure
  • Tree should exhibit very similar accuracy when
    applied to new data
  • BUT Best Tree is NOT necessarily the one that
    happens to be most accurate on a single test
    database
  • trees somewhat larger or smaller than optimal
    may be preferred
  • Room for user judgment
  • judgment not about split variable or values
  • judgment as to how much of tree to keep
  • determined by story tree is telling
  • willingness to sacrifice a small amount of
    accuracy for simplicity

41
CART Summary
  • CART Key Features
  • binary splits
  • gini index as splitting criteria
  • grow, then prune
  • surrogates for missing values
  • optimal tree 1 SE rule
  • lots of nice graphics

42
Decision Tree Summary
  • Decision Trees
  • splits binary, multi-way
  • split criteria entropy, gini,
  • missing value treatment
  • pruning
  • rule extraction from trees
  • Both C4.5 and CART are robust tools
  • No method is always superior experiment!

witten eibe
Write a Comment
User Comments (0)