Title: Decision Trees
1Decision Trees
- Definition
- Mechanism
- Splitting Function
- Issues in Decision-Tree Learning
- Avoiding overfitting through pruning
- Numeric and missing attributes
2Example of a Decision Tree
Example Learning to classify stars.
Luminosity
gt T1
lt T1
Mass
lt T2
Type C
gt T2
Type B
Type A
3Short vs Long Hypotheses
We mentioned a top-down, greedy approach to
constructing decision trees denotes a preference
of short hypotheses over long hypotheses. Why is
this the right thing to do?
Occams Razor Prefer the simplest hypothesis
that fits the data.
Back since William of Occam (1320). Great debate
in the philosophy of science.
4Issues in Decision Tree Learning
- Practical issues while building a decision tree
can - be enumerated as follows
- How deep should the tree be?
- How do we handle continuous attributes?
- What is a good splitting function?
- What happens when attribute values are missing?
- How do we improve the computational efficiency?
5How deep should the tree be? Overfitting the Data
A tree overfits the data if we let it grow deep
enough so that it begins to capture aberrations
in the data that harm the predictive power on
unseen examples
t2
Possibly just noise, but the tree is grown
larger to capture these examples
humidity
t3
size
6Overtting the Data Definition
Assume a hypothesis space H. We say a hypothesis
h in H overfits a dataset D if there is another
hypothesis h in H where h has better classificati
on accuracy than h on D but worse classification
accuracy than h on D.
training data
overfitting
0.5 0.6 0.7 0.8 0.9 1.0
testing data
Size of the tree
7Causes for Overtting the Data
- What causes a hypothesis to overfit the data?
- Random errors or noise
- Examples have incorrect class label or
- incorrect attribute values.
- Coincidental patterns
- By chance examples seem to deviate
from a pattern due to - the small size of the sample.
- Overfitting is a serious problem that can cause
- strong performance degradation.
8Solutions for Overtting the Data
- There are two main classes of solutions
- Stop the tree early before it begins to overfit
the data. - In practice this solution is hard to
implement because it - is not clear what is a good stopping
point. - 2) Grow the tree until the algorithm stops
even if the overfitting - problem shows up. Then prune the tree as a
post-processing - step.
- This method has found great popularity
in the machine - learning community.
9Decision Tree Pruning
2.) Prune tree to avoid overfitting the data
1.) Grow the tree to learn the training data
10Methods to Validate the New Tree
- Training and Validation Set Approach
- Divide dataset D into a training set TR and a
- validation set TE
- Build a decision tree on TR
- Test pruned trees on TE to decide the best final
tree.
Dataset D
Training TR
Validation TE
11Training and Validation
Dataset D
Training TR (normally 2/3 of D)
Validation TE (normally 1/3 of D)
- There are two approaches
- Reduced Error Pruning
- Rule Post-Pruning
12Reduced Error Pruning
- Main Idea
- 1) Consider all internal nodes in the tree.
- For each node check if removing it (along with
the subtree - below it) and assigning the most common
class to it does - not harm accuracy on the validation set.
- Pick the node n that yields the best performance
and prune - its subtree.
- 4) Go back to (2) until no more improvements are
possible.
13Example
Possible trees after pruning
Original Tree
14Example
Possible trees after 2nd pruning
Pruned Tree
15Example
Process continues until no improvement is
observed on the validation set
Stop pruning the tree
0.5 0.6 0.7 0.8 0.9 1.0
validation data
Size of the tree
16Reduced Error Pruning
- Disadvantages
- If the original data set is small, separating
examples away for - validation may leave you with few examples
for training.
Dataset D
Training TR
Training set is too small and so is the
validation set
Testing TE
Small dataset
17Rule Post-Pruning
- Main Idea
- 1) Convert the tree into a rule-based system.
- Prune every single rule first by removing
redundant - conditions.
- 3) Sort rules by accuracy.
18Example
x1
Original tree
1
0
x3
x2
1
1
0
0
A
C
A
B
Possible rules after pruning (based on validation
set) x1 -gt Class A x1 x2 -gt
Class B x3 -gt Class A x1 x3
-gt Class C
Rules x1 x2 -gt Class A x1 x2 -gt
Class B x1 x3 -gt Class A x1 x3 -gt
Class C
19Advantages of Rule Post-Pruning
- The language is more expressive.
- Improves on interpretability.
- Pruning is more flexible.
- In practice this method yields high accuracy
performance.
20Decision Trees
- Definition
- Mechanism
- Splitting Functions
- Issues in Decision-Tree Learning
- Avoiding overfitting through pruning
- Numeric and missing attributes
21Discretizing Continuous Attributes
Example attribute temperature. 1) Order all
values in the training set 2) Consider only those
cut points where there is a change of class 3)
Choose the cut point that maximizes information
gain
97 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6
103.2
temperature
22Claude Shannon
1916 2001 Funded information theory on 1948
with his paper A Mathematical Theory of
Communication Awarded the Alfred Noble American
Institute of American Engineers Award for his
masters thesis. Worked at MIT, Bell Labs. Met
with Alan Turing, Marvin Minsky, John von
Neumann, and Albert Einstein. Creator of the
Ultimate Machine.
23Missing Attribute Values
Example
X (luminosity gt T1, mass ?)
- We are at a node n in the decision tree.
- Different approaches
- Assign the most common value for that attribute
in node n. - Assign the most common value in n among examples
with the - same classification as X.
- Assign a probability to each value of the
attribute based on the - frequency of those values in node n. Each
fraction is propagated - down the tree.
24Summary
- Decision-tree induction is a popular approach to
classification - that enables us to interpret the output
hypothesis. - The hypothesis space is very powerful all
possible DNF formulas. - We prefer shorter trees than larger trees.
- Overfitting is an important issue in
decision-tree induction. - Different methods exist to avoid overfitting
like reduced-error - pruning and rule post-processing.
- Techniques exist to deal with continuous
attributes and missing - attribute values.