Zhangxi Lin - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Zhangxi Lin

Description:

Title: Maximal Tree Author: Corporate Microcomputing Department Last modified by: zlin Created Date: 3/21/1999 2:20:07 AM Document presentation format – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 68
Provided by: Corpor139
Category:

less

Transcript and Presenter's Notes

Title: Zhangxi Lin


1
Lecture Notes 3Decision Tree Construction
  • Zhangxi Lin
  • ISQS 7342-001
  • Texas Tech University
  • Note Most slides are from Decision Tree
    Modeling, by SAS

2
Outline
  • The Mechanics of Decision Tree Construction
  • Recursive Partitioning

3
Growing a Decision Tree Six Steps
1
2
3
Pre-Process Data
Set Input-Target Characteristics
Select Tree Growth Parameters
Source Data
Trigger Input Search
Manual/Automatic
4
Process/Cluster Inputs
5
Select Branch/Split
6a
6b
Stop/Grow/ Prune?
Select Final Tree
4
1. Data Pre-process
  • Distinguish categorical and continuous data
  • Re-code multi-category targets into a 1-of-N
    scheme - make each binary
  • Convert dates and time data into a computable
    form
  • Avoid information loss
  • Make consistent of the scale directions
  • Carefully handle multiple response items
  • Understand correctly the missing value
  • Pivot records if necessary

5
2. Set the Input and Target Modeling
Characteristics
  • Target
  • Check the missing value indicator for the target
    field.
  • Look for values such as -1, -99, or 0 to ensure
    the target field is in right status
  • Create 1-of-N derivation of the categorical codes
    for the seemingly interval variable but actually
    not.
  • Inputs
  • For decision tree analysis, inputs are
    transformed into discrete categories
  • Figure out if an input is ordered or unordered

6
3. Select the Decision Tree Growth Parameters
  • Considerations
  • Input categories combination for branching
  • Branches sorting and combining
  • of nodes on a branch
  • of alternative branches
  • Determine differences among branches
  • Branch evaluation, selection and display
  • Input data segmentation in terms of branches
  • Branch growth strategy empirical tests or
    theoretical tests
  • Pre- or post-branching prune
  • Stopping rule potential branches and nodes

7
4. Cluster and Process Each Branch-Forming Input
Field (within an input)
  • Goal of clustering in decision tree construction
  • Cluster observations, values of input fields, and
    same levels of splits on the tree
  • Maximize the predictive relationship between the
    input and the target
  • The most understandable branch may not always be
    the best predictor
  • Clustering algorithms (what is the difference of
    them from k-means?)
  • Variance reduction
  • Entropy
  • Gini
  • Significance tests
  • Tuning the levels of significance
  • The Kass merger-and-split heuristic for multiway
    split

8
Kass Merge-and-Split Heuristic
  • Merge-and-Split. Converges on a single, optimal
    clustering of like codes.
  • Merges codes within clusters and reassigns
    consolidated groups of observations to different
    branches
  • Breaks up consolidated groups by splitting out
    the members with the weakest relationships
  • Remerges the broken-up groups with consolidated
    groups that are similar
  • SAS Enterprise Miner uses a variation of
    heuristic Merge-and-Shuffle.
  • Assigns each consolidated group of observations
    to a different node. The two nodes that degrade
    the worth of the split the least are merged.
  • Reassigns consolidated groups of observations to
    different nodes
  • Stops when no consolidated group can be reassigned

9
Dealing with missing values
  • Treat a missing value as a legitimate value
    (explicitly include it in the analysis)
  • Use surrogates to populate descendent nodes where
    the input value for the preferred input is
    missing
  • Estimate missing value based on non-missing
    inputs
  • Distribute the missing value in the input to the
    descendent node based on a distribution rule
  • Distribute missing values over all branches in
    proportion to the missing values by branch
  • How SAS EM handles missing values
  • Distribute missing values across all available
    branches
  • Assign missing values to the most correlated
    branch
  • Assign missing values to the largest branch

10
5. Select the Candidate Decision Tree Branches
(among inputs)
  • The CHAID approach
  • F-test. For numeric targets with interval-level
    measurements. A measure of between-group
    similarity vs. within-group similarity.
  • Chi-squared test. For categorical targets.
  • Statistical adjustments
  • Bonferroni adjustments
  • The CRT approach
  • Choices of branches by parameters Number of
    leaves, best assessment value, the most leaves,
    Gini, variance reduction
  • Inputs either nominal or interval ordinal
    inputs are treated as interval
  • Different pruning
  • Retrospective pruning. Try to identify the best
    subtree
  • Cost-complexity pruning. Uses training data to
    create a subtree sequence
  • Reduced-error pruning. Relies on validation data

11
Statistical Adjustments
  • Bonferroni correction
  • The Bonferroni correction states that if an
    experimenter is testing n dependent or
    independent hypotheses on a set of data, then the
    statistical significance level that should be
    used for each hypothesis separately is 1/n times
    what it would be if only one hypothesis were
    tested. Statistically significant simply means
    that a given result is unlikely to have occurred
    by chance.
  • It was developed by Italian mathematician Carlo
    Emilio Bonferroni.
  • Kass Adjustment
  • A p-value adjustment that multiplies the p-value
    by a Bonferroni factor that depends on the number
    of branches and chi-square target values, and
    sometimes on the number of distinct input values.
    The Kass adjustment is used in the Tree node.

12
6. Complete the Form and Content the Final
Decision Tree
  • Stop, Grow, Prune or Iterate?
  • CHAID stops forming decision tree when no node
    can produce any significant split below it. A
    stopping rule is used.
  • The user decides when to stop
  • The node contains too few observations
  • Reaches the maximum depth
  • No more split passes F-test or chi-squared test
  • CRT relies on validation tests to prune branches,
    to stop tree growth, and to form an optimal
    decision tree.

13
Issues
  • Assessment Measures
  • Proportion correctly classified, and
  • The sum of squared errors (quantitative targets),
    or average square error (continuous targets)
  • Others the proportion of events in the top 50
    (or user defined ) on target 1
  • Main Difference between CHAID and CRT
  • Whether a test of significance or a
    train-and-test measurement comparison is used.
  • Guiding tree growth with costs and benefits in
    the target
  • Implied costs and benefits lie behind a wide
    range of human decision-making
  • Prior probabilities
  • Affect the misclassification measure
  • Do not change the decision tree shape

14
Effect of Decision Threshold
Decision Threshold
Hits
Misses
15
Effect of Decision Threshold
Decision Threshold
False alarm
Hits
Correct rejection
Misses
16
Questions
  1. What are differences of the 6-step process of
    decision tree construction from your previous of
    decision tree modeling?
  2. Why is clustering utilized in decision tree
    algorithms? How?
  3. How is the missing values problem resolved in
    decision tree modeling?
  4. What is surrogate split? How does it work?
  5. What are differences between CHAID and CRT?

17
3
2. Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
18
Recursive Partitioning
  • Recursive partitioning is the standard method
    used to fit decision trees. It is a top-down,
    greedy algorithm.
  • Example
  • Handwriting recognition is a classic application
    of supervised prediction. The example data set is
    a subset of the pen-based recognition of
    handwritten digits data, available from the UCI
    repository (Blake et al 1998). The cases are
    digits written on a pressure-sensitive tablet.
    The input variables measure the position of the
    pen. They are scaled to be between 0 and 100. Two
    of the original 16 inputs are shown (X1 and X10).
    The target is the true written digit (0-9).
  • This subset contains the 1064 cases corresponding
    to the three digits 1, 7, and 9. Each case
    represents a point in the input space. (The data
    have been jittered for display because many of
    the points overlap.)

19
Supervised Prediction Nominal Target
20
Classification Tree
21
Multiway Splits
lt22
22
Decision Regions
23
Root-Node Split
D1 364 D7 364 D9 336 n 1064
X1lt38.5
yes
no
D1 71 D7 1 D9 294 n 366
D1 293 D7 363 D9 42 n 698
24
1-Deep Space
100
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
80
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
60
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
X1
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
40
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
20
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
0
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
0
20
40
60
80
100
X10
25
Depth 2
Root
D1 293 D7 363 D9 42 n 698
D1 71 D7 1 D9 294 n 366
yes
no
yes
no
X10lt0.5
X10lt51.5
D1 285 D7 143 D9 41 n 469
D1 8 D7 220 D9 1 n 229
D1 4 D7 0 D9 276 n 280
D1 67 D7 1 D9 18 n 86
26
2-Deep Space
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
27
Split Characteristics
100
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
80
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
60
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
X1
9
9
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
40
9
9
9
9
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
20
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
0
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
0
20
40
60
80
100
X10
28
Improvement of the greedy algorithm
  • This greedy algorithm could be improved by
    incorporating some type of lookahead or backup.
    Aside from the computational burden, trees built
    using limited look-ahead have not been shown to
    be an improvement in many cases they give
    inferior trees (Murthy and Salzberg 1995).
  • Another variation is oblique splits. Standard
    decision trees partition the input space using
    boundaries that are parallel to the input
    coordinates. These coordinate-axis splits make a
    fitted decision tree easy to interpret and
    provide resistance to the curse of
    dimensionality.
  • Splits on linear combinations of inputs give
    oblique boundaries. Several algorithms have been
    developed for inducing oblique decision trees
    (BFOS 1984, Murthy et al 1994, Loh and Shih
    1997).

29
Outline
  • Split Search
  • Ordinal input
  • Nominal input
  • Multiway splits
  • Splitting Criterion
  • Impurity reduction
  • Chi-squared test
  • Regression Trees
  • Missing Values

Variable X10 X10 X10 X1 X1 . . .
Values 0.5 1.8 11, 46 2.4 1, 4, 61 . . .
30
Partitioning on an Ordinal Input
Splits
1234 1234 1234 1234 1234 1234 1234
31
At Least Ordinal
.20
X
3.3
1.7
14
3.5
2515
ln(X)
1.6
1.2
.53
2.6
1.3
7.8
rank(X)
1
3
2
5
4
6
potential split locations
For interval or ordinal inputs, splits in a
decision tree depend only on the ordering of the
levels, making tree models robust to outliers in
input space. The application of a rank or any
monotonic transformation to an interval variable
will not change the fitted tree.
32
Partitioning on a Nominal Input
1234 2134 3124 4123 1234 1324 1423 1234 1
324 1423 2314 2413 3412 1234
B
L
33
Split Search Shortcuts
  • Trees treat splits on inputs with nominal and
    ordinal measurement scales differently. Splits on
    a nominal input are not restricted. For a nominal
    input with L distinct levels, there are S(L,B)
    partitions into B branches, where S(L,B)is a
    Stirling number of the second kind.
  • Binary splits exclusively
  • ordinal L ? 1
  • nominal 2L ? 1 ? 1
  • Agglomerative clustering of levels
  • Kass (1980)
  • Minimum child size

34
Stirling Number
  • In mathematics, Stirling numbers arise in a
    variety of combinatorics problems. They are named
    after James Stirling, who introduced them in the
    18th century. Two different sets of numbers bear
    this name the Stirling numbers of the first kind
    and the Stirling numbers of the second kind.
  • See http//en.wikipedia.org/wiki/Stirling_number

35
Combinatorial Explosion Problem
  • An exhaustive tree algorithm considers all
    possible partitions of all inputs at every node
    in the tree. The combinatorial explosion usually
    makes an exhaustive search prohibitively
    expensive.
  • Tree algorithms usually take shortcuts to reduce
    the split search.
  • Restricting searches to binary splits,
  • Using level clustering routines, and
  • Imposing minimum child size restrictions.
  • Other options designed to improve tree
    efficiency, performance, and interpretability
    also impact the split search. They include the
    following
  • Minimum Categorical Size
  • Use Input Once
  • Within-node Sampling

36
Clustering Levels
37
Nominal Variable Split - Clustering Branches
  • Algorithm
  • Start with an L-way split.
  • Collapse the two levels that are closest (based
    on a splitting criterion).
  • Repeat the process on the set of L 1
    consolidated levels.
  • This gives a split of each size. Choose the best
    of these.
  • Repeat this process for every input and choose
    the best.
  • The CHAID algorithm adds a backward elimination
    step (Kass 1980). The number of splits to
    consider is greatly reduced, to L(L -1)/2 for
    ordinal and to (L-1)L(L1)/6 for nominal. For
    example, only 165 of the 115914 splits of a
    10-level nominal input would be considered.

38
Multiway versus Binary
1
2
5
3
4
1
2
5
3
4
In theory, multiway splits are no more flexible
than binary splits. Multiway splits often give
more interpretable trees because split variables
tend to be used fewer times. Many prefer binary
splits because an exhaustive search is more
feasible.
39
SASs Split Search Strategy
  • SAS Decision Tree node uses a blend of different
    shortcuts
  • By default, if the node size is greater than
    5000, then a sample of 5000 cases is used. For
    classification trees, the sample is constructed
    to be as balanced as possible among the target
    classes. To make changes to sample size, use Node
    Sample.
  • Binary splits are used by default. To change
    split number, use Maximum Branch.
  • If multiway splits are specified, then an initial
    consolidation phase is conducted to group the
    levels of the inputs.
  • All possible splits among the consolidated levels
    are examined, unless that number exceeds 5000, in
    which case, an agglomerative algorithm is used.
    To change this threshold, use Exhaustive.
  • For categorical variables, a category must
    contain at least the number of observations
    specified in Minimum Categorical Size (default is
    5) to be considered in a split search. Otherwise,
    these observations are treated as missing values.
  • The use of an input can be limited with the Use
    Input Once option. It is turned off by default.

40
Splitting Criteria
?38.5
lt38.5
X1
?Gini
?entropy
logworth
293
71
1
363
1
.197
.504
140
7
42
294
9
lt0.5
?51.5
1-41
42-51
X10
9
143
65
147
1
221
88
1
54
.255
.600
172
7
1
4
16
315
9
41
Impurity Reduction
Parent impurity0 n0
Child2 impurity2 n2
Child1 impurity1 n1
Child4 impurity4 n4
Child3 impurity3 n3
42
Gini Impurity
high diversity, low purity
Pr(interspecific encounter) 1-2(3/8)2-2(1/8)2
.69
low diversity, high purity
Pr(interspecific encounter) 1-(6/7)2-(1/7)2
.24
43
Entropy
1.0
0.5
0.0
0.0
0.5
1.0
44
Gini vs. Entropy
  • Gini
  • The Gini index is a measure of variability for
    categorical data (developed by Italian
    statistician Corrado Gini in 1912).
  • The Gini index can be interpreted as the
    probability that any two elements of a multi-set
    chosen at random are different.
  • In mathematical ecology, the Gini index is known
    as Simpsons diversity index. In cryptanalysis,
    it is 1 minus the repeat rate.
  • Entropy
  • Entropy is a measure of variability for
    categorical data.
  • The ?entropy splitting criterion is used by
    Quinlan (1993). It is equivalent to using the
    likelihood ratio chi-squared test statistic for
    association between the branches and the target
    categories.
  • For classification trees with binary splits,
    Breiman (1996) showed that the ?Gini criterion
    tends to favor isolating the largest target class
    in one branch while the ?entropy criterion tends
    to favor split balance.
  • The ?Gini and ?entropy splitting criteria also
    tend to increase as the number of branches
    increase. They are not appropriate for fairly
    evaluating multiway splits because they favor
    large B.

45
Chi-Squared Test
Observed
Expected
?38.5
lt38.5
X1
293
71
.342
239
125
12
23
1
.342
363
1
239
125
64
123
7
.316
42
294
225
116
149
273
9
.656
.344
n1064
225 1064 0.656 0.316
46
Chi-Squared Test
  • The Pearson chi-squared test can be used to judge
    the worth of the split. It tests whether the
    column distributions (class proportions) are the
    same in each row (child node). The test statistic
    measures the difference between the observed cell
    counts and what would be expected if the branches
    and target classes (rows and columns) were
    independent.
  • The statistical significance of the test is not
    monotonically related to the size of the
    chi-squared test statistic. The degrees of
    freedom of the test is (r-1)(B-1), where r and B
    are the dimensions of the table. The expected
    value of a chi-squared test statistic with ?
    degrees of freedom equals ?. Consequently, larger
    tables (more branches) will naturally have larger
    chi-squared statistics.
  • The chi-squared splitting criterion uses the
    p-value of the chi-squared test (Kass 1980). When
    the p-values are very small, it is more
    convenient to use logworth -log10 (P-value) ,
    which increases as P decreases.
  • The ?Gini and ?entropy splitting criteria also
    tend to increase as the number of branches
    increase. However, they do not have an analogous
    degree of freedom adjustment. Consequently, they
    are not appropriate for fairly evaluating
    multiway splits because they favor large B.

47
p-Value Adjustments
38.5
X1
1
644
2
140
96
138
7
9
17.5
36.5
X1
1
249
42
73
660
4
141
4560
137
7
338
25
1
9
26
16
294
0.5
41.5
51.5
X10
1
814
6
172
156849
167
7
9
logworth -log10 (P-value)
48
Splitting Criteria
  • The Decision Tree node uses the logworth
    (ProbChisq) chi-squared splitting criterion by
    default. Alternatively, the ?Gini and ?entropy
    splitting criteria can be specified.
  • By default, the Decision Tree node uses
    Bonferroni adjustments to the p-value.
  • Kass (1980) adjusted the p-values after the split
    was chosen on each input. Thus, p-values for
    splits on the same input are compared without
    adjustment. The Decision Tree node allows these
    adjustments to be applied before the split
    variable is chosen. Thus, splits on the same
    input are compared using adjusted p-values.
  • So, after implies the adjusted p-values are
    compared between different inputs not within the
    same input?

49
P-Value Adjustments
  • Step one Comparing splits on the same input
    variable
  • The chi-squared test statistic (as well as ?Gini
    and ?entropy) favors splits into greater numbers
    of branches. The p-value (or logworth) adjusts
    for this bias through the degrees of freedom. For
    binary splits, no adjustment is necessary.
  • Step two Comparing splits on different input
    variables
  • The maximum logworth tends to become larger as
    the number of splits, m, increases. Consequently,
    input variables with a larger m are favored.
    Nominal inputs are favored over ordinal inputs
    with the same number of levels. Among inputs with
    the same measurement scale, those with more
    levels are favored. Kass (1980) proposed
    Bonferroni adjustments of the p-values to account
    for the bias. Let a be the probability of a type
    I error on each test (that is, discovering an
    erroneous association). For a set of m tests, a
    conservative upper bound on the probability of at
    least one type I error is ma (Bonferroni
    inequality). Consequently, the Kass adjustment
    multiplies the p-values by m (equivalently,
    subtract log10 (m) from the logworth).

50
Splitting with Missing Values
1,2,3,?
1,2,3,?
2,3,?
1
2
1
3,?
1,2,3,?
1,2,3,?
2,3
1,?
2,?
1
3
1,2,3,?
1,2,3,?
1,2,3,?
3,?
1,2
2
1
3
?
2
1,?
3
1,2,3,?
1,2,3,?
3
1,2,?
2,3
1
?
1,2,3,?
1,2,3,?
?
1,2,3
3
1,2
?
51
Handling Missing Values
  • One of the chief benefits of recursive
    partitioning is the treatment of missing input
    data. Parametric regression models require
    complete cases. One missing value on one input
    variable eliminates that case from analysis.
    Imputation methods are often used prior to model
    fitting to fill in the missing values.
  • Decision trees can treat missing input values as
    a separate level of the input variable. A nominal
    input with L levels and a missing value can be
    treated as an L 1 level input. If a new case
    has a missing value on a splitting variable, then
    the case is sent to whatever branch contains the
    missing values.
  • In the case of an ordinal input with missing
    values, the missing values cannot usually be
    placed in order among the input levels but acts
    as a nominal level. Consequently, the split
    search should not place any restrictions on the
    branch that contains the missing level.

52
Surrogate Splits
  • Surrogate splits can be used to handle missing
    values (BFOS 1984). A surrogate split is a
    partition using a different input that mimics the
    selected split.
  • A perfect surrogate maps all the cases that are
    in the same node of the primary split to the same
    node of the surrogate split.
  • The agreement between two splits can be measured
    as the proportion of cases that are sent to the
    same branch. The split with the greatest
    agreement is taken as the best surrogate.
  • The surrogates in SAS EM are used for scoring new
    cases, not for fitting the training data. Missing
    values on the training data are treated as a new
    input level.
  • If a new case has a missing value on the
    splitting variable, then the best surrogate is
    used to classify the case.
  • If the surrogate variable is missing as well,
    then the second best surrogate is used.
  • If the new case has a missing value on all the
    surrogates, it is sent to the branch that
    contains the missing values of the training data.

53
Surrogate Splits
Agreement76
1
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
1
1
1
1
1
1
1
1
1
9
9
9
9
9
9
9
1
1
1
1
1
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
No
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
9
9
9
9
9
9
9
9
9
9
9
9
9
1
9
9
9
9
9
9
1
1
9
9
9
9
9
12
354
1
1
9
7
9
9
9
9
1
1
9
9
9
1
1
9
9
9
9
9
X1lt38.5
9
9
1
9
9
1
9
1
1
9
1
7
1
1
1
454
244
7
7
9
1
9
9
1
1
9
9
1
7
7
7
9
9
1
1
1
1
1
7
7
1
7
9
1
1
1
1
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
7
7
7
9
1
1
7
7
7
9
9
1
1
1
1
7
7
1
1
1
1
1
7
7
9
1
1
1
7
7
7
9
Yes
1
1
7
7
7
7
1
1
7
7
1
1
1
7
7
7
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
1
1
1
1
7
7
9
9
1
1
1
7
7
7
7
7
7
7
9
7
7
7
7
7
1
1
7
7
7
7
1
1
1
1
7
7
7
9
1
1
1
7
7
7
7
7
7
7
7
7
7
7
1
1
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
7
7
7
7
7
7
1
1
1
1
1
1
1
1
7
7
1
1
1
1
7
7
7
7
1
7
7
7
7
7
7
7
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
9
9
9
1
1
1
1
7
7
7
7
7
7
Yes
No
X10 lt 41.5
54
Variable Importance
1
Developed by BFOS in 1984 Useful for tree
interpretation
Importance is a weighted average of the reduction
in impurity for the surrogate splits using the
jth input across all the internal nodes in the
tree
3
2
55
3
Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
56
Split Selection Bias
Worth Logworth (Branches) Logworth (Branches) Logworth (Branches)
?Gini No Kass Kass Before Kass After
2-Way Split 2-Way Split 2-Way Split 2-Way Split
Inv .0030 (2) 8.10 (2) 7.62 (2) 7.62 (2)
Branch .0043 (2) 11.32 (2) 5.90 (2) 5.90 (2)
?4-Way Split ?4-Way Split ?4-Way Split ?4-Way Split
Inv .0042 (3) 10.12 (3) 10.12 (3) 10.12 (3)
Branch .0059 (4) 13.51 (3) 6.50 (4) 6.05 (3)
?19-Way Split ?19-Way Split ?19-Way Split ?19-Way Split
Inv .0042 (3) 10.12 (3) 10.12 (3) 10.12 (3)
Branch .0062 (19) 13.51 (3) 7.14 (19) 6.05 (3)
57
Interval Targets
NOX
Density
25
50
0
RM
MEDV
58
Impurity Reduction
Parent i(0) n0
Child2 i(2) n2
Child1 i(1) n1
Child4 i(4) n4
Child3 i(3) n3
59
Variance Reduction
RMlt6.94
yes
no
60
One-Way ANOVA
...
61
Heteroscedasticity
62
3
2. Recursive Partitioning
2.1 Recursive Partitioning
2.2 Split Selection Bias
2.3 Regression Diagnostics
63
The SAS EM Model
64
Configure the SAS EM Model
65
Diagnose model residuals
66
Diagnose model residuals
  • Option B Use the SAS Code node to enhance and
    register diagnostic output.
  • Select the upper SAS Code node. In the Training
    section of the Properties panel, select SAS Code.
    Right-click in the Editor window and select
    Open. Select the EX2.2a.sas program.
  • Select OK to exit the SAS Code node. Run the node
    and view results.
  • Copy the plot location from the Output window to
    a Windows Explorer Address field. (Or select
    Start -gt Run from the Windows toolbar and copy
    this location into the Open field).
  • Close the HTML output and SAS Code results
    windows.

67
Results
Write a Comment
User Comments (0)
About PowerShow.com