Title: Zhangxi Lin
1Lecture Notes 3Decision Tree Construction
- Zhangxi Lin
- ISQS 7342-001
- Texas Tech University
- Note Most slides are from Decision Tree
Modeling, by SAS
2Outline
- The Mechanics of Decision Tree Construction
- Recursive Partitioning
3Growing a Decision Tree Six Steps
1
2
3
Pre-Process Data
Set Input-Target Characteristics
Select Tree Growth Parameters
Source Data
Trigger Input Search
Manual/Automatic
4
Process/Cluster Inputs
5
Select Branch/Split
6a
6b
Stop/Grow/ Prune?
Select Final Tree
41. Data Pre-process
- Distinguish categorical and continuous data
- Re-code multi-category targets into a 1-of-N
scheme - make each binary - Convert dates and time data into a computable
form - Avoid information loss
- Make consistent of the scale directions
- Carefully handle multiple response items
- Understand correctly the missing value
- Pivot records if necessary
52. Set the Input and Target Modeling
Characteristics
- Target
- Check the missing value indicator for the target
field. - Look for values such as -1, -99, or 0 to ensure
the target field is in right status - Create 1-of-N derivation of the categorical codes
for the seemingly interval variable but actually
not. - Inputs
- For decision tree analysis, inputs are
transformed into discrete categories - Figure out if an input is ordered or unordered
63. Select the Decision Tree Growth Parameters
- Considerations
- Input categories combination for branching
- Branches sorting and combining
- of nodes on a branch
- of alternative branches
- Determine differences among branches
- Branch evaluation, selection and display
- Input data segmentation in terms of branches
- Branch growth strategy empirical tests or
theoretical tests - Pre- or post-branching prune
- Stopping rule potential branches and nodes
74. Cluster and Process Each Branch-Forming Input
Field (within an input)
- Goal of clustering in decision tree construction
- Cluster observations, values of input fields, and
same levels of splits on the tree - Maximize the predictive relationship between the
input and the target - The most understandable branch may not always be
the best predictor - Clustering algorithms (what is the difference of
them from k-means?) - Variance reduction
- Entropy
- Gini
- Significance tests
- Tuning the levels of significance
- The Kass merger-and-split heuristic for multiway
split
8Kass Merge-and-Split Heuristic
- Merge-and-Split. Converges on a single, optimal
clustering of like codes. - Merges codes within clusters and reassigns
consolidated groups of observations to different
branches - Breaks up consolidated groups by splitting out
the members with the weakest relationships - Remerges the broken-up groups with consolidated
groups that are similar - SAS Enterprise Miner uses a variation of
heuristic Merge-and-Shuffle. - Assigns each consolidated group of observations
to a different node. The two nodes that degrade
the worth of the split the least are merged. - Reassigns consolidated groups of observations to
different nodes - Stops when no consolidated group can be reassigned
9Dealing with missing values
- Treat a missing value as a legitimate value
(explicitly include it in the analysis) - Use surrogates to populate descendent nodes where
the input value for the preferred input is
missing - Estimate missing value based on non-missing
inputs - Distribute the missing value in the input to the
descendent node based on a distribution rule - Distribute missing values over all branches in
proportion to the missing values by branch - How SAS EM handles missing values
- Distribute missing values across all available
branches - Assign missing values to the most correlated
branch - Assign missing values to the largest branch
105. Select the Candidate Decision Tree Branches
(among inputs)
- The CHAID approach
- F-test. For numeric targets with interval-level
measurements. A measure of between-group
similarity vs. within-group similarity. - Chi-squared test. For categorical targets.
- Statistical adjustments
- Bonferroni adjustments
- The CRT approach
- Choices of branches by parameters Number of
leaves, best assessment value, the most leaves,
Gini, variance reduction - Inputs either nominal or interval ordinal
inputs are treated as interval - Different pruning
- Retrospective pruning. Try to identify the best
subtree - Cost-complexity pruning. Uses training data to
create a subtree sequence - Reduced-error pruning. Relies on validation data
11Statistical Adjustments
- Bonferroni correction
- The Bonferroni correction states that if an
experimenter is testing n dependent or
independent hypotheses on a set of data, then the
statistical significance level that should be
used for each hypothesis separately is 1/n times
what it would be if only one hypothesis were
tested. Statistically significant simply means
that a given result is unlikely to have occurred
by chance. - It was developed by Italian mathematician Carlo
Emilio Bonferroni. - Kass Adjustment
- A p-value adjustment that multiplies the p-value
by a Bonferroni factor that depends on the number
of branches and chi-square target values, and
sometimes on the number of distinct input values.
The Kass adjustment is used in the Tree node.
126. Complete the Form and Content the Final
Decision Tree
- Stop, Grow, Prune or Iterate?
- CHAID stops forming decision tree when no node
can produce any significant split below it. A
stopping rule is used. - The user decides when to stop
- The node contains too few observations
- Reaches the maximum depth
- No more split passes F-test or chi-squared test
- CRT relies on validation tests to prune branches,
to stop tree growth, and to form an optimal
decision tree.
13Issues
- Assessment Measures
- Proportion correctly classified, and
- The sum of squared errors (quantitative targets),
or average square error (continuous targets) - Others the proportion of events in the top 50
(or user defined ) on target 1 - Main Difference between CHAID and CRT
- Whether a test of significance or a
train-and-test measurement comparison is used. - Guiding tree growth with costs and benefits in
the target - Implied costs and benefits lie behind a wide
range of human decision-making - Prior probabilities
- Affect the misclassification measure
- Do not change the decision tree shape
14Effect of Decision Threshold
Decision Threshold
Hits
Misses
15Effect of Decision Threshold
Decision Threshold
False alarm
Hits
Correct rejection
Misses
16Questions
- What are differences of the 6-step process of
decision tree construction from your previous of
decision tree modeling? - Why is clustering utilized in decision tree
algorithms? How? - How is the missing values problem resolved in
decision tree modeling? - What is surrogate split? How does it work?
- What are differences between CHAID and CRT?
173
2. Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
18Recursive Partitioning
- Recursive partitioning is the standard method
used to fit decision trees. It is a top-down,
greedy algorithm. - Example
- Handwriting recognition is a classic application
of supervised prediction. The example data set is
a subset of the pen-based recognition of
handwritten digits data, available from the UCI
repository (Blake et al 1998). The cases are
digits written on a pressure-sensitive tablet.
The input variables measure the position of the
pen. They are scaled to be between 0 and 100. Two
of the original 16 inputs are shown (X1 and X10).
The target is the true written digit (0-9). - This subset contains the 1064 cases corresponding
to the three digits 1, 7, and 9. Each case
represents a point in the input space. (The data
have been jittered for display because many of
the points overlap.)
19Supervised Prediction Nominal Target
20Classification Tree
21Multiway Splits
lt22
22Decision Regions
23Root-Node Split
D1 364 D7 364 D9 336 n 1064
X1lt38.5
yes
no
D1 71 D7 1 D9 294 n 366
D1 293 D7 363 D9 42 n 698
241-Deep Space
100
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
80
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
60
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
X1
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
40
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
20
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
0
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
0
20
40
60
80
100
X10
25Depth 2
Root
D1 293 D7 363 D9 42 n 698
D1 71 D7 1 D9 294 n 366
yes
no
yes
no
X10lt0.5
X10lt51.5
D1 285 D7 143 D9 41 n 469
D1 8 D7 220 D9 1 n 229
D1 4 D7 0 D9 276 n 280
D1 67 D7 1 D9 18 n 86
262-Deep Space
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
27Split Characteristics
100
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
80
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
60
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
X1
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
40
9
9
9
9
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
20
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
0
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
0
20
40
60
80
100
X10
28Improvement of the greedy algorithm
- This greedy algorithm could be improved by
incorporating some type of lookahead or backup.
Aside from the computational burden, trees built
using limited look-ahead have not been shown to
be an improvement in many cases they give
inferior trees (Murthy and Salzberg 1995). - Another variation is oblique splits. Standard
decision trees partition the input space using
boundaries that are parallel to the input
coordinates. These coordinate-axis splits make a
fitted decision tree easy to interpret and
provide resistance to the curse of
dimensionality. - Splits on linear combinations of inputs give
oblique boundaries. Several algorithms have been
developed for inducing oblique decision trees
(BFOS 1984, Murthy et al 1994, Loh and Shih
1997).
29Outline
- Split Search
- Ordinal input
- Nominal input
- Multiway splits
- Splitting Criterion
- Impurity reduction
- Chi-squared test
- Regression Trees
- Missing Values
Variable X10 X10 X10 X1 X1 . . .
Values 0.5 1.8 11, 46 2.4 1, 4, 61 . . .
30Partitioning on an Ordinal Input
Splits
1234 1234 1234 1234 1234 1234 1234
31At Least Ordinal
.20
X
3.3
1.7
14
3.5
2515
ln(X)
1.6
1.2
.53
2.6
1.3
7.8
rank(X)
1
3
2
5
4
6
potential split locations
For interval or ordinal inputs, splits in a
decision tree depend only on the ordering of the
levels, making tree models robust to outliers in
input space. The application of a rank or any
monotonic transformation to an interval variable
will not change the fitted tree.
32Partitioning on a Nominal Input
1234 2134 3124 4123 1234 1324 1423 1234 1
324 1423 2314 2413 3412 1234
B
L
33Split Search Shortcuts
- Trees treat splits on inputs with nominal and
ordinal measurement scales differently. Splits on
a nominal input are not restricted. For a nominal
input with L distinct levels, there are S(L,B)
partitions into B branches, where S(L,B)is a
Stirling number of the second kind. - Binary splits exclusively
- ordinal L ? 1
- nominal 2L ? 1 ? 1
- Agglomerative clustering of levels
- Kass (1980)
- Minimum child size
34Stirling Number
- In mathematics, Stirling numbers arise in a
variety of combinatorics problems. They are named
after James Stirling, who introduced them in the
18th century. Two different sets of numbers bear
this name the Stirling numbers of the first kind
and the Stirling numbers of the second kind. - See http//en.wikipedia.org/wiki/Stirling_number
35Combinatorial Explosion Problem
- An exhaustive tree algorithm considers all
possible partitions of all inputs at every node
in the tree. The combinatorial explosion usually
makes an exhaustive search prohibitively
expensive. - Tree algorithms usually take shortcuts to reduce
the split search. - Restricting searches to binary splits,
- Using level clustering routines, and
- Imposing minimum child size restrictions.
- Other options designed to improve tree
efficiency, performance, and interpretability
also impact the split search. They include the
following - Minimum Categorical Size
- Use Input Once
- Within-node Sampling
36Clustering Levels
37Nominal Variable Split - Clustering Branches
- Algorithm
- Start with an L-way split.
- Collapse the two levels that are closest (based
on a splitting criterion). - Repeat the process on the set of L 1
consolidated levels. - This gives a split of each size. Choose the best
of these. - Repeat this process for every input and choose
the best. - The CHAID algorithm adds a backward elimination
step (Kass 1980). The number of splits to
consider is greatly reduced, to L(L -1)/2 for
ordinal and to (L-1)L(L1)/6 for nominal. For
example, only 165 of the 115914 splits of a
10-level nominal input would be considered.
38Multiway versus Binary
1
2
5
3
4
1
2
5
3
4
In theory, multiway splits are no more flexible
than binary splits. Multiway splits often give
more interpretable trees because split variables
tend to be used fewer times. Many prefer binary
splits because an exhaustive search is more
feasible.
39 SASs Split Search Strategy
- SAS Decision Tree node uses a blend of different
shortcuts - By default, if the node size is greater than
5000, then a sample of 5000 cases is used. For
classification trees, the sample is constructed
to be as balanced as possible among the target
classes. To make changes to sample size, use Node
Sample. - Binary splits are used by default. To change
split number, use Maximum Branch. - If multiway splits are specified, then an initial
consolidation phase is conducted to group the
levels of the inputs. - All possible splits among the consolidated levels
are examined, unless that number exceeds 5000, in
which case, an agglomerative algorithm is used.
To change this threshold, use Exhaustive. - For categorical variables, a category must
contain at least the number of observations
specified in Minimum Categorical Size (default is
5) to be considered in a split search. Otherwise,
these observations are treated as missing values. - The use of an input can be limited with the Use
Input Once option. It is turned off by default.
40Splitting Criteria
?38.5
lt38.5
X1
?Gini
?entropy
logworth
293
71
1
363
1
.197
.504
140
7
42
294
9
lt0.5
?51.5
1-41
42-51
X10
9
143
65
147
1
221
88
1
54
.255
.600
172
7
1
4
16
315
9
41Impurity Reduction
Parent impurity0 n0
Child2 impurity2 n2
Child1 impurity1 n1
Child4 impurity4 n4
Child3 impurity3 n3
42Gini Impurity
high diversity, low purity
Pr(interspecific encounter) 1-2(3/8)2-2(1/8)2
.69
low diversity, high purity
Pr(interspecific encounter) 1-(6/7)2-(1/7)2
.24
43Entropy
1.0
0.5
0.0
0.0
0.5
1.0
44Gini vs. Entropy
- Gini
- The Gini index is a measure of variability for
categorical data (developed by Italian
statistician Corrado Gini in 1912). - The Gini index can be interpreted as the
probability that any two elements of a multi-set
chosen at random are different. - In mathematical ecology, the Gini index is known
as Simpsons diversity index. In cryptanalysis,
it is 1 minus the repeat rate. - Entropy
- Entropy is a measure of variability for
categorical data. - The ?entropy splitting criterion is used by
Quinlan (1993). It is equivalent to using the
likelihood ratio chi-squared test statistic for
association between the branches and the target
categories. - For classification trees with binary splits,
Breiman (1996) showed that the ?Gini criterion
tends to favor isolating the largest target class
in one branch while the ?entropy criterion tends
to favor split balance. - The ?Gini and ?entropy splitting criteria also
tend to increase as the number of branches
increase. They are not appropriate for fairly
evaluating multiway splits because they favor
large B.
45Chi-Squared Test
Observed
Expected
?38.5
lt38.5
X1
293
71
.342
239
125
12
23
1
.342
363
1
239
125
64
123
7
.316
42
294
225
116
149
273
9
.656
.344
n1064
225 1064 0.656 0.316
46Chi-Squared Test
- The Pearson chi-squared test can be used to judge
the worth of the split. It tests whether the
column distributions (class proportions) are the
same in each row (child node). The test statistic
measures the difference between the observed cell
counts and what would be expected if the branches
and target classes (rows and columns) were
independent. - The statistical significance of the test is not
monotonically related to the size of the
chi-squared test statistic. The degrees of
freedom of the test is (r-1)(B-1), where r and B
are the dimensions of the table. The expected
value of a chi-squared test statistic with ?
degrees of freedom equals ?. Consequently, larger
tables (more branches) will naturally have larger
chi-squared statistics. - The chi-squared splitting criterion uses the
p-value of the chi-squared test (Kass 1980). When
the p-values are very small, it is more
convenient to use logworth -log10 (P-value) ,
which increases as P decreases. - The ?Gini and ?entropy splitting criteria also
tend to increase as the number of branches
increase. However, they do not have an analogous
degree of freedom adjustment. Consequently, they
are not appropriate for fairly evaluating
multiway splits because they favor large B.
47p-Value Adjustments
38.5
X1
1
644
2
140
96
138
7
9
17.5
36.5
X1
1
249
42
73
660
4
141
4560
137
7
338
25
1
9
26
16
294
0.5
41.5
51.5
X10
1
814
6
172
156849
167
7
9
logworth -log10 (P-value)
48Splitting Criteria
- The Decision Tree node uses the logworth
(ProbChisq) chi-squared splitting criterion by
default. Alternatively, the ?Gini and ?entropy
splitting criteria can be specified. - By default, the Decision Tree node uses
Bonferroni adjustments to the p-value. - Kass (1980) adjusted the p-values after the split
was chosen on each input. Thus, p-values for
splits on the same input are compared without
adjustment. The Decision Tree node allows these
adjustments to be applied before the split
variable is chosen. Thus, splits on the same
input are compared using adjusted p-values. - So, after implies the adjusted p-values are
compared between different inputs not within the
same input?
49P-Value Adjustments
- Step one Comparing splits on the same input
variable - The chi-squared test statistic (as well as ?Gini
and ?entropy) favors splits into greater numbers
of branches. The p-value (or logworth) adjusts
for this bias through the degrees of freedom. For
binary splits, no adjustment is necessary. - Step two Comparing splits on different input
variables - The maximum logworth tends to become larger as
the number of splits, m, increases. Consequently,
input variables with a larger m are favored.
Nominal inputs are favored over ordinal inputs
with the same number of levels. Among inputs with
the same measurement scale, those with more
levels are favored. Kass (1980) proposed
Bonferroni adjustments of the p-values to account
for the bias. Let a be the probability of a type
I error on each test (that is, discovering an
erroneous association). For a set of m tests, a
conservative upper bound on the probability of at
least one type I error is ma (Bonferroni
inequality). Consequently, the Kass adjustment
multiplies the p-values by m (equivalently,
subtract log10 (m) from the logworth).
50Splitting with Missing Values
1,2,3,?
1,2,3,?
2,3,?
1
2
1
3,?
1,2,3,?
1,2,3,?
2,3
1,?
2,?
1
3
1,2,3,?
1,2,3,?
1,2,3,?
3,?
1,2
2
1
3
?
2
1,?
3
1,2,3,?
1,2,3,?
3
1,2,?
2,3
1
?
1,2,3,?
1,2,3,?
?
1,2,3
3
1,2
?
51Handling Missing Values
- One of the chief benefits of recursive
partitioning is the treatment of missing input
data. Parametric regression models require
complete cases. One missing value on one input
variable eliminates that case from analysis.
Imputation methods are often used prior to model
fitting to fill in the missing values. - Decision trees can treat missing input values as
a separate level of the input variable. A nominal
input with L levels and a missing value can be
treated as an L 1 level input. If a new case
has a missing value on a splitting variable, then
the case is sent to whatever branch contains the
missing values. - In the case of an ordinal input with missing
values, the missing values cannot usually be
placed in order among the input levels but acts
as a nominal level. Consequently, the split
search should not place any restrictions on the
branch that contains the missing level.
52Surrogate Splits
- Surrogate splits can be used to handle missing
values (BFOS 1984). A surrogate split is a
partition using a different input that mimics the
selected split. - A perfect surrogate maps all the cases that are
in the same node of the primary split to the same
node of the surrogate split. - The agreement between two splits can be measured
as the proportion of cases that are sent to the
same branch. The split with the greatest
agreement is taken as the best surrogate. - The surrogates in SAS EM are used for scoring new
cases, not for fitting the training data. Missing
values on the training data are treated as a new
input level. - If a new case has a missing value on the
splitting variable, then the best surrogate is
used to classify the case. - If the surrogate variable is missing as well,
then the second best surrogate is used. - If the new case has a missing value on all the
surrogates, it is sent to the branch that
contains the missing values of the training data.
53Surrogate Splits
Agreement76
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
No
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
9
9
12
354
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
X1lt38.5
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
454
244
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
Yes
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
Yes
No
X10 lt 41.5
54Variable Importance
1
Developed by BFOS in 1984 Useful for tree
interpretation
Importance is a weighted average of the reduction
in impurity for the surrogate splits using the
jth input across all the internal nodes in the
tree
3
2
553
Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
56Split Selection Bias
Worth Logworth (Branches) Logworth (Branches) Logworth (Branches)
?Gini No Kass Kass Before Kass After
2-Way Split 2-Way Split 2-Way Split 2-Way Split
Inv .0030 (2) 8.10 (2) 7.62 (2) 7.62 (2)
Branch .0043 (2) 11.32 (2) 5.90 (2) 5.90 (2)
?4-Way Split ?4-Way Split ?4-Way Split ?4-Way Split
Inv .0042 (3) 10.12 (3) 10.12 (3) 10.12 (3)
Branch .0059 (4) 13.51 (3) 6.50 (4) 6.05 (3)
?19-Way Split ?19-Way Split ?19-Way Split ?19-Way Split
Inv .0042 (3) 10.12 (3) 10.12 (3) 10.12 (3)
Branch .0062 (19) 13.51 (3) 7.14 (19) 6.05 (3)
57Interval Targets
NOX
Density
25
50
0
RM
MEDV
58Impurity Reduction
Parent i(0) n0
Child2 i(2) n2
Child1 i(1) n1
Child4 i(4) n4
Child3 i(3) n3
59Variance Reduction
RMlt6.94
yes
no
60One-Way ANOVA
...
61Heteroscedasticity
623
2. Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
63The SAS EM Model
64Configure the SAS EM Model
65Diagnose model residuals
66Diagnose model residuals
- Option B Use the SAS Code node to enhance and
register diagnostic output. - Select the upper SAS Code node. In the Training
section of the Properties panel, select SAS Code.
Right-click in the Editor window and select
Open. Select the EX2.2a.sas program. - Select OK to exit the SAS Code node. Run the node
and view results. - Copy the plot location from the Output window to
a Windows Explorer Address field. (Or select
Start -gt Run from the Windows toolbar and copy
this location into the Open field). - Close the HTML output and SAS Code results
windows.
67Results