Title: Machine Learning in Real World: CART
1Machine Learning in Real WorldCART
2Outline
- CART Overview and Gymtutor Tutorial Example
- Splitting Criteria
- Handling Missing Values
- Pruning
- Finding Optimal Tree
3CART Classification And Regression Tree
- Developed 1974-1984 by 4 statistics professors
- Leo Breiman (Berkeley), Jerry Friedman
(Stanford), Charles Stone (Berkeley), Richard
Olshen (Stanford) - Focused on accurate assessment when data is noisy
- Currently distributed by Salford Systems
4CART Tutorial Data Gymtutor
- CART HELP, Sec 3 in CARTManual.pdf
- ANYRAQT Racquet ball usage (binary indicator
coded 0, 1) - ONAER Number of on-peak aerobics classes attended
- NSUPPS Number of supplements purchased
- OFFAER Number of off-peak aerobics classes
attended - NFAMMEM Number of family members
- TANNING Number of visits to tanning salon
- ANYPOOL Pool usage (binary indicator coded 0, 1)
- SMALLBUS Small business discount (binary
indicator coded 0, 1) - FIT Fitness score
- HOME Home ownership (binary indicator coded 0,
1) - PERSTRN Personal trainer (binary indicator coded
0, 1) - CLASSES Number of classes taken.
- SEGMENT Members market segment (1, 2, 3) target
5View data
- CART Menu View - Data Info
6CART Example Gymtutor
7CART Model Setup
- Target -- required
- Predictors (default all)
- Categorical
- ANYRAQT, ANYPOOL, SMALLBUS, HOME
- Categorical if field name ends in , or from
values - Testing
- default 10-fold cross-validation
8Sample Tree
9Color-coding using class
10Decision Tree Splitters
11Tree Details
12Tree Summary Reports
13Pruning the tree
14Keeping only important variables
15Revised Tree
16Automating CART Command Log
17Key CART features
- Automated field selection
- handles any number of fields
- automatically selects relevant fields
- No data preprocessing needed
- Does not require any kind of variable transforms
- Impervious to outliers
- Missing value tolerant
- Moderate loss of accuracy due to missing values
18CART Key Parts of Tree Structured Data Analysis
- Tree growing
- Splitting rules to generate tree
- Stopping criteria how far to grow?
- Missing values using surrogates
- Tree pruning
- Trimming off parts of the tree that dont work
- Ordering the nodes of a large tree by
contribution to tree accuracy which nodes come
off first? - Optimal tree selection
- Deciding on the best tree after growing and
pruning - Balancing simplicity against accuracy
19CART is a form of Binary Recursive Partitioning
- Data is split into two partitions
- Q Does C4.5 always have binary partitions?
- Partitions can also be split into sub-partitions
- hence procedure is recursive
- CART tree is generated by repeated partitioning
of data set - parent gets two children
- each child produces two grandchildren
- four grandchildren produce 8 great grandchildren
20Splits always determined by questions with YES/NO
answers
- Is continuous variable X c ?
- Does categorical variable D take on levels i, j,
or k? - is GENDER M or F ?
- Standard split
- if answer to question is YES a case goes left
otherwise it goes right - this is the form of all primary splits
- example Is AGE ? 62.5?
- More complex conditions possible
- Boolean combinations AGE
- Linear combinations .66AGE - .75BP
21Searching all Possible Splits
- For any node CART will examine ALL possible
splits. - CART allows search over a random sample if
desired - Look at first variable in our data set AGE with
minimum value 40 - Test split Is AGE 40?
- Will separate out the youngest persons to the
left - Could be many cases if many people have the same
AGE - Next increase the AGE threshold to the next
youngest person - Is AGE 43?
- This will direct additional cases to the left
- Continue increasing the splitting threshold value
by value - each value is tested for how good the split is .
. . how effective it is in separating the classes
from each other - Q Consider splits between values of the same
class?
22Split Tables
Q Where splits need to be evaluated?
Sorted by Blood Pressure
Sorted by Age
X
23CART Splitting Criteria Gini Index
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as -
- where pj is the relative frequency of class j in
T. - gini(T) is minimized if the classes in T are
skewed. - Advanced CART also has other splitting criteria
- Twoing is recommended for multi-class
24Handling of Missing Splitter Values in Tree
Growing
- If splitter variable missing dont know which way
to send case (Left or Right in binary tree) - Could delete cases that have missing values
- method used in classical statistical modeling
- unacceptable in a data mining context w/ many
missings - Freeze case in node in which missing splitter
encountered - do with what tree has learned so far for this
case - Allow cases with missing split variable to follow
majority - assume all missings are somehow typical
- Allow missing to be a separate value of variable
- used by CHAID algorithm an option in Salford
software - allow special handling for missing but all
missings treated as indistinguishable from each
other
25Missing as a distinct splitter value
- CHAID treats missing as a distinct categorical
value - e.g AGE is 25-44, 45-64, 65-95 or missing
- method also adopted by C4.5
- If missing is a distinct value then all cases
with missing go the same way in the tree - Assumption whatever the unknown value it is the
same for all cases with missing value - Problem can be more than one reason for a
database field to be missing - E.g. Income as a splitter wants to separate high
from low - Levels most likely to be missing? High Income AND
Low Income! - Dont want to send both groups to same side of
tree
26CART Treatment of Missing Primary Splitters
Surrogates
- CART uses a more refined method a surrogate is
used as a stand in for a missing primary field - surrogate should be a valid replacement for
primary - Consider our example of INCOME
- Other variables like Education or Occupation
might work as good surrogates - Higher education people usually have higher
incomes - People in high income occupations will usually
(though not always) have higher incomes - Using surrogate means that missing on primary not
all treated same way - Whether go left or right depends on surrogate
value - thus record specific . . . some cases go left
others go right
27Surrogates Mimicking Alternatives to Primary
Splitters
- A primary splitter is the best splitter of a node
- A surrogate is a splitter that splits in a
fashion similar to the primary - Surrogate variable with near equivalent
information - Why Useful?
- If the primary is expensive or difficult to
gather and the surrogate is not - Then consider using the surrogate instead
- Loss in predictive accuracy might be slight
- If primary splitter is MISSING then CART will use
a surrogate - if top surrogate missing CART uses 2nd best
surrogate etc - If all surrogates missing also CART uses majority
rule
28Competitors vs. Surrogates
Class A 100 Class B 100 Class C 100
Left
Right
Primary Split
Class A 90 10 Class B 80 20 Class C 15 85
Competitor Split
Class A 80 20 Class B 25 75 Class C 14 86
Surrogate Split
Class A 78 22 Class B 74 26 Class C 21 79
29CART Pruning Method Grow Full Tree, Then Prune
- You will never know when to stop . . . so dont!
- Instead . . . grow trees that are obviously too
big - Largest tree grown is called maximal tree
- Maximal tree could have hundreds or thousands of
nodes - usually instruct CART to grow only moderately too
big - rule of thumb should grow trees about twice the
size of the truly best tree - This becomes first stage in finding the best tree
- Next we will have to get rid the parts of the
overgrown tree that dont work (not supported by
test data)
30Maximal Tree Example
31Tree Pruning
- Take a very large tree (maximal tree)
- Tree may be radically over-fit
- Tracks all the idiosyncrasies of THIS data set
- Tracks patterns that may not be found in other
data sets - At bottom of tree splits based on very few cases
- Analogous to a regression with very large number
of variables - PRUNE away branches from this large tree
- But which branch to cut first?
- CART determines a pruning sequence
- the exact order in which each node should be
removed - pruning sequence determined for EVERY node
- sequence determined all the way back to root node
32Pruning Which nodes come off next?
33Order of Pruning Weakest Link Goes First
- Prune away "weakest link" the nodes that add
least to overall accuracy of the tree - contribution to overall tree a function of both
increase in accuracy and size of node - accuracy gain is weighted by share of sample
- small nodes tend to get removed before large ones
- If several nodes have same contribution they all
prune away simultaneously - Hence more than two terminal nodes could be cut
off in one pruning - Sequence determined all the way back to root node
- need to allow for possibility that entire tree is
bad - if target variable is unpredictable we will want
to prune back to root . . . the no model solution
34Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
18 Terminal Nodes
20 Terminal Nodes
35Now we test every tree in the pruning sequence
- Take a test data set and drop it down the largest
tree in the sequence and measure its predictive
accuracy - how many cases right and how many wrong
- measure accuracy overall and by class
- Do same for 2nd largest tree, 3rd largest tree,
etc - Performance of every tree in sequence is measured
- Results reported in table and graph formats
- Note that this critical stage is impossible to
complete without test data - CART procedure requires test data to guide tree
evaluation
36Training Data Vs. Test Data Error Rates
No. Terminal Nodes
R(T)
Rts(T)
- Compare error rates measured by
- learn data
- large test set
- Learn R(T) always decreases as tree grows (Q
Why?) - Test R(T) first declines then increases (Q Why?)
- Overfitting is the result tree of too much
reliance on learn R(T) - Can lead to disasters when applied to new data
71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 3
4 .12 .32 19 .20 .31 10 .29 .30
9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75
.82 1 .86 .91
37Why look at training data error rates (or cost)
at all?
- First, provides a rough guide of how you are
doing - Truth will typically be WORSE than training data
measure - If tree performing poorly on training data error
may not want to pursue further - Training data error rate more accurate for
smaller trees - So reasonable guide for smaller trees
- Poor guide for larger trees
- At optimal tree training and test error rates
should be similar - if not something is wrong
- useful to compare not just overall error rate but
also within node performance between training and
test data
38CART Optimal Tree
- Within a single CART run which tree is best?
- Process of pruning the maximal tree can yield
many sub-trees - Test data set or cross- validation measures the
error rate of each tree - Current wisdom select the tree with smallest
error rate - Only drawback minimum may not be precisely
estimated - Typical error rate as a function of tree size has
flat region - Minimum could be anywhere in this region
39One SE Rule -- One Standard Error Rule
- Original monograph recommends NOT choosing
minimum error tree because of possible
instability of results from run to run - Instead suggest SMALLEST TREE within 1 SE of the
minimum error tree - Tends to provide very stable results from run to
run - Is possibly as accurate as minimum cost tree yet
simpler - Current learning one SERULE is good for small
data sets - For large data sets one should pick most accurate
tree - known as the zero SE rule
40In what sense is the optimal tree best?
- Optimal tree has lowest or near lowest cost as
determined by a test procedure - Tree should exhibit very similar accuracy when
applied to new data - BUT Best Tree is NOT necessarily the one that
happens to be most accurate on a single test
database - trees somewhat larger or smaller than optimal
may be preferred - Room for user judgment
- judgment not about split variable or values
- judgment as to how much of tree to keep
- determined by story tree is telling
- willingness to sacrifice a small amount of
accuracy for simplicity
41CART Summary
- CART Key Features
- binary splits
- gini index as splitting criteria
- grow, then prune
- surrogates for missing values
- optimal tree 1 SE rule
- lots of nice graphics
42Decision Tree Summary
- Decision Trees
- splits binary, multi-way
- split criteria entropy, gini,
- missing value treatment
- pruning
- rule extraction from trees
- Both C4.5 and CART are robust tools
- No method is always superior experiment!
witten eibe