Machine Learning in Real World: CART

1 / 40

About This Presentation

Title:

Machine Learning in Real World: CART

Description:

... And Regression Tree. Developed 1974-1984 ... do with what tree has learned so far for this case ... Handling of Missing Splitter Values in Tree Growing ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 41

Provided by: grego122

more less

Transcript and Presenter's Notes

Title: Machine Learning in Real World: CART

1
Machine Learning in Real WorldCART
2
Outline

CART Overview and Gymtutor Tutorial Example
Splitting Criteria
Handling Missing Values
Pruning
Finding Optimal Tree

3
CART Classification And Regression Tree

Developed 1974-1984 by 4 statistics professors
Leo Breiman (Berkeley), Jerry Friedman
(Stanford), Charles Stone (Berkeley), Richard
Olshen (Stanford)
Focused on accurate assessment when data is noisy
Currently distributed by Salford Systems

4
CART Tutorial Data Gymtutor

CART HELP, Sec 3 in CARTManual.pdf
ANYRAQT Racquet ball usage (binary indicator
coded 0, 1)
ONAER Number of on-peak aerobics classes attended
NSUPPS Number of supplements purchased
OFFAER Number of off-peak aerobics classes
attended
NFAMMEM Number of family members
TANNING Number of visits to tanning salon
ANYPOOL Pool usage (binary indicator coded 0, 1)
SMALLBUS Small business discount (binary
indicator coded 0, 1)
FIT Fitness score
HOME Home ownership (binary indicator coded 0,
1)
PERSTRN Personal trainer (binary indicator coded
0, 1)
CLASSES Number of classes taken.
SEGMENT Members market segment (1, 2, 3) target

5
View data

CART Menu View - Data Info

6
CART Example Gymtutor
7
CART Model Setup

Target -- required
Predictors (default all)
Categorical
ANYRAQT, ANYPOOL, SMALLBUS, HOME
Categorical if field name ends in , or from
values
Testing
default 10-fold cross-validation

8
Sample Tree
9
Color-coding using class
10
Decision Tree Splitters
11
Tree Details
12
Tree Summary Reports
13
Pruning the tree
14
Keeping only important variables
15
Revised Tree
16
Automating CART Command Log
17
Key CART features

Automated field selection
handles any number of fields
automatically selects relevant fields
No data preprocessing needed
Does not require any kind of variable transforms
Impervious to outliers
Missing value tolerant
Moderate loss of accuracy due to missing values

18
CART Key Parts of Tree Structured Data Analysis

Tree growing
Splitting rules to generate tree
Stopping criteria how far to grow?
Missing values using surrogates
Tree pruning
Trimming off parts of the tree that dont work
Ordering the nodes of a large tree by
contribution to tree accuracy which nodes come
off first?
Optimal tree selection
Deciding on the best tree after growing and
pruning
Balancing simplicity against accuracy

19
CART is a form of Binary Recursive Partitioning

Data is split into two partitions
Q Does C4.5 always have binary partitions?
Partitions can also be split into sub-partitions
hence procedure is recursive
CART tree is generated by repeated partitioning
of data set
parent gets two children
each child produces two grandchildren
four grandchildren produce 8 great grandchildren

20
Splits always determined by questions with YES/NO
answers

Is continuous variable X c ?
Does categorical variable D take on levels i, j,
or k?
is GENDER M or F ?
Standard split
if answer to question is YES a case goes left
otherwise it goes right
this is the form of all primary splits
example Is AGE ? 62.5?
More complex conditions possible
Boolean combinations AGE
Linear combinations .66AGE - .75BP

21
Searching all Possible Splits

For any node CART will examine ALL possible
splits.
CART allows search over a random sample if
desired
Look at first variable in our data set AGE with
minimum value 40
Test split Is AGE 40?
Will separate out the youngest persons to the
left
Could be many cases if many people have the same
AGE
Next increase the AGE threshold to the next
youngest person
Is AGE 43?
This will direct additional cases to the left
Continue increasing the splitting threshold value
by value
each value is tested for how good the split is .
. . how effective it is in separating the classes
from each other
Q Consider splits between values of the same
class?

22
Split Tables
Q Where splits need to be evaluated?
Sorted by Blood Pressure
Sorted by Age
X
23
CART Splitting Criteria Gini Index

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j in
T.
gini(T) is minimized if the classes in T are
skewed.
Advanced CART also has other splitting criteria
Twoing is recommended for multi-class

24
Handling of Missing Splitter Values in Tree
Growing

If splitter variable missing dont know which way
to send case (Left or Right in binary tree)
Could delete cases that have missing values
method used in classical statistical modeling
unacceptable in a data mining context w/ many
missings
Freeze case in node in which missing splitter
encountered
do with what tree has learned so far for this
case
Allow cases with missing split variable to follow
majority
assume all missings are somehow typical
Allow missing to be a separate value of variable
used by CHAID algorithm an option in Salford
software
allow special handling for missing but all
missings treated as indistinguishable from each
other

25
Missing as a distinct splitter value

CHAID treats missing as a distinct categorical
value
e.g AGE is 25-44, 45-64, 65-95 or missing
method also adopted by C4.5
If missing is a distinct value then all cases
with missing go the same way in the tree
Assumption whatever the unknown value it is the
same for all cases with missing value
Problem can be more than one reason for a
database field to be missing
E.g. Income as a splitter wants to separate high
from low
Levels most likely to be missing? High Income AND
Low Income!
Dont want to send both groups to same side of
tree

26
CART Treatment of Missing Primary Splitters
Surrogates

CART uses a more refined method a surrogate is
used as a stand in for a missing primary field
surrogate should be a valid replacement for
primary
Consider our example of INCOME
Other variables like Education or Occupation
might work as good surrogates
Higher education people usually have higher
incomes
People in high income occupations will usually
(though not always) have higher incomes
Using surrogate means that missing on primary not
all treated same way
Whether go left or right depends on surrogate
value
thus record specific . . . some cases go left
others go right

27
Surrogates Mimicking Alternatives to Primary
Splitters

A primary splitter is the best splitter of a node
A surrogate is a splitter that splits in a
fashion similar to the primary
Surrogate variable with near equivalent
information
Why Useful?
If the primary is expensive or difficult to
gather and the surrogate is not
Then consider using the surrogate instead
Loss in predictive accuracy might be slight
If primary splitter is MISSING then CART will use
a surrogate
if top surrogate missing CART uses 2nd best
surrogate etc
If all surrogates missing also CART uses majority
rule

28
Competitors vs. Surrogates
Class A 100 Class B 100 Class C 100
Left
Right
Primary Split
Class A 90 10 Class B 80 20 Class C 15 85
Competitor Split
Class A 80 20 Class B 25 75 Class C 14 86
Surrogate Split
Class A 78 22 Class B 74 26 Class C 21 79
29
CART Pruning Method Grow Full Tree, Then Prune

You will never know when to stop . . . so dont!
Instead . . . grow trees that are obviously too
big
Largest tree grown is called maximal tree
Maximal tree could have hundreds or thousands of
nodes
usually instruct CART to grow only moderately too
big
rule of thumb should grow trees about twice the
size of the truly best tree
This becomes first stage in finding the best tree
Next we will have to get rid the parts of the
overgrown tree that dont work (not supported by
test data)

30
Maximal Tree Example
31
Tree Pruning

Take a very large tree (maximal tree)
Tree may be radically over-fit
Tracks all the idiosyncrasies of THIS data set
Tracks patterns that may not be found in other
data sets
At bottom of tree splits based on very few cases
Analogous to a regression with very large number
of variables
PRUNE away branches from this large tree
But which branch to cut first?
CART determines a pruning sequence
the exact order in which each node should be
removed
pruning sequence determined for EVERY node
sequence determined all the way back to root node

32
Pruning Which nodes come off next?
33
Order of Pruning Weakest Link Goes First

Prune away "weakest link" the nodes that add
least to overall accuracy of the tree
contribution to overall tree a function of both
increase in accuracy and size of node
accuracy gain is weighted by share of sample
small nodes tend to get removed before large ones
If several nodes have same contribution they all
prune away simultaneously
Hence more than two terminal nodes could be cut
off in one pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is
bad
if target variable is unpredictable we will want
to prune back to root . . . the no model solution

34
Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
18 Terminal Nodes
20 Terminal Nodes
35
Now we test every tree in the pruning sequence

Take a test data set and drop it down the largest
tree in the sequence and measure its predictive
accuracy
how many cases right and how many wrong
measure accuracy overall and by class
Do same for 2nd largest tree, 3rd largest tree,
etc
Performance of every tree in sequence is measured
Results reported in table and graph formats
Note that this critical stage is impossible to
complete without test data
CART procedure requires test data to guide tree
evaluation

36
Training Data Vs. Test Data Error Rates
No. Terminal Nodes
R(T)
Rts(T)

Compare error rates measured by
learn data
large test set
Learn R(T) always decreases as tree grows (Q
Why?)
Test R(T) first declines then increases (Q Why?)
Overfitting is the result tree of too much
reliance on learn R(T)
Can lead to disasters when applied to new data

71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 3
4 .12 .32 19 .20 .31 10 .29 .30
9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75
.82 1 .86 .91
37
Why look at training data error rates (or cost)
at all?

First, provides a rough guide of how you are
doing
Truth will typically be WORSE than training data
measure
If tree performing poorly on training data error
may not want to pursue further
Training data error rate more accurate for
smaller trees
So reasonable guide for smaller trees
Poor guide for larger trees
At optimal tree training and test error rates
should be similar
if not something is wrong
useful to compare not just overall error rate but
also within node performance between training and
test data

38
CART Optimal Tree

Within a single CART run which tree is best?
Process of pruning the maximal tree can yield
many sub-trees
Test data set or cross- validation measures the
error rate of each tree
Current wisdom select the tree with smallest
error rate
Only drawback minimum may not be precisely
estimated
Typical error rate as a function of tree size has
flat region
Minimum could be anywhere in this region

39
One SE Rule -- One Standard Error Rule

Original monograph recommends NOT choosing
minimum error tree because of possible
instability of results from run to run
Instead suggest SMALLEST TREE within 1 SE of the
minimum error tree
Tends to provide very stable results from run to
run
Is possibly as accurate as minimum cost tree yet
simpler
Current learning one SERULE is good for small
data sets
For large data sets one should pick most accurate
tree
known as the zero SE rule

40
In what sense is the optimal tree best?

Optimal tree has lowest or near lowest cost as
determined by a test procedure
Tree should exhibit very similar accuracy when
applied to new data
BUT Best Tree is NOT necessarily the one that
happens to be most accurate on a single test
database
trees somewhat larger or smaller than optimal
may be preferred
Room for user judgment
judgment not about split variable or values
judgment as to how much of tree to keep
determined by story tree is telling
willingness to sacrifice a small amount of
accuracy for simplicity