Title: Decision tree
1Decision tree
2Outline
- Basic concepts
- Main issues
- Advanced topics
3Basic concepts
4A classification problem
District House type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Semi-detached High Yes Respond
Rural Semi-detached Low No Respond
Urban Detached Low Yes Nothing
5Classification and estimation problems
- Given
- a finite set of (input) attributes features
- Ex District, House type, Income, Previous
customer - a target attribute the goal
- Ex Outcome Nothing, Respond
- training data a set of classified examples in
attribute-value representation - Ex the previous table
- Predict the value of the goal given the values of
input attributes - The goal is a discrete variable ? classification
problem - The goal is a continuous variable ? estimation
problem
6Decision tree
7Decision tree representation
- Each internal node is a test
- Theoretically, a node can test multiple
attributes - In most systems, a node tests exactly one
attribute - Each branch corresponds to test results
- A branch corresponds to an attribute value or a
range of attribute values - Each leaf node assigns
- a class decision tree
- a real value regression tree
8Whats a best decision tree?
- Best You need a bias (e.g., prefer the
smallest tree) least depth? Fewest nodes?
Which trees are the best predictors of unseen
data? - Occam's Razor we prefer the simplest hypothesis
that fits the data. - ? Find a decision tree that is as small as
possible and fits the data -
9Finding a smallest decision tree
- A decision tree can represent any discrete
function of the inputs yf(x1, x2, , xn) - The space of decision trees is too big for
systemic search for a smallest decision tree. - Solution greedy algorithm
10Basic algorithm top-down induction
- Find the best decision attribute, A, and assign
A as decision attribute for node - For each value of A, create new branch, and
divide up training examples - Repeat the process until the gain is small enough
11Major issues
12Major issues
- Q1 Choosing best attribute what quality measure
to use? - Q2 Determining when to stop splitting avoid
overfitting - Q3 Handling continuous attributes
- Q4 Handling training data with missing attribute
values - Q5 Handing attributes with different costs
- Q6 Dealing with continuous goal attribute
13Q1 What quality measure
- Information gain
- Gain Ratio
14Entropy of a training set
- S is a sample of training examples
- Entropy is one way of measuring the impurity of S
- Pc is the proportion of examples in S whose
target attribute has value c.
15Information Gain
- Gain(S,A)expected reduction in entropy due to
sorting on A. - Choose the A with the max information gain.
- (a.k.a. choose the A with the min average
entropy)
16An example
E0.985
E0.592
E0.811
E1.00
InfoGain(S, Humidity) 0.940-(7/14)0.985-(7/14)0.
592 0.151
InfoGain(S, Wind) 0.940-(8/14)0.811-(6/14)1.0
0.048
17Other quality measures
- Problem of information gain
- Information Gain prefers attributes with many
values. - An alternative Gain Ratio
-
- Where Si is subset of S for which A has value
vi.
18Q2 avoiding overfitting
- Overfitting occurs when our decision tree
characterizes too much detail, or noise in our
training data. - Consider error of hypothesis h over
- Training data ErrorTrain(h)
- Entire distribution D of data ErrorD(h)
- A hypothesis h overfits training data if there is
an alternative hypothesis h, such that - ErrorTrain(h) lt ErrorTrain(h), and
- ErrorD(h) gt errorD(h)
19How to avoiding Overfitting
- Stop growing the tree earlier
- Ex InfoGain lt threshold
- Ex Size of examples in a node lt threshold
-
- Grow full tree, then post-prune
- ? In practice, the latter works better than the
former.
20Post-pruning
- Split data into training and validation set
- Do until further pruning is harmful
- Evaluate impact on validation set of pruning each
possible node (plus those below it) - Greedily remove the ones that dont improve the
performance on validation set - Produces a smaller tree with best performance
measure
21Performance measure
- Accuracy
- on validation data
- K-fold cross validation
- Misclassification cost Sometimes more accuracy
is desired for some classes than others. - MDL size(tree) errors(tree)
22Rule post-pruning
- Convert tree to equivalent set of rules
- Prune each rule independently of others
- Sort final rules into desired sequence for use
- Perhaps most frequently used method (e.g., C4.5)
23Q3 handling numeric attributes
- Continuous attribute ? discrete attribute
- Example
- Original attribute Temperature 82.5
- New attribute (temperature gt 72.3) t, f
- ? Question how to choose thresholds?
24Choosing thresholds for a continuous attribute
- Sort the examples according to the continuous
attribute. - Identify adjacent examples that differ in their
target classification ? a set of candidate
thresholds - Choose the candidate with the highest information
gain.
25Q4 Unknown attribute values
- Assume an attribute can take the value blank.
- Assign most common value of A among training data
at node n. - Assign most common value of A among training data
at node n which have the same target class. - Assign prob pi to each possible value vi of A
- Assign a fraction (pi) of example to each
descendant in tree - This method is used in C4.5.
26Q5 Attributes with cost
- Consider medical diagnosis (e.g., blood test) has
a cost - Question how to learn a consistent tree with low
expected cost? - One approach replace gain by
- Tan and Schlimmer (1990)
27Q6 Dealing with continuous goal attribute ?
Regression tree
- A variant of decision trees
- Estimation problem approximate real-valued
functions e.g., the crime rate - A leaf node is marked with a real value or a
linear function e.g., the mean of the target
values of the examples at the node. - Measure of impurity e.g., variance, standard
deviation,
28Summary of Major issues
- Q1 Choosing best attribute different quality
measure. - Q2 Determining when to stop splitting stop
earlier or post-pruning -
- Q3 Handling continuous attributes find the
breakpoints
29Summary of major issues (cont)
- Q4 Handling training data with missing attribute
values blank value, most common value, or
fractional count -
- Q5 Handing attributes with different costs use
a quality measure that includes the cost factors. - Q6 Dealing with continuous goal attribute
various ways of building regression trees.
30Common algorithms
31ID3
- Proposed by Quinlan (so is C4.5)
- Can handle basic cases discrete attributes, no
missing information, etc. - Information gain as quality measure
32C4.5
- An extension of ID3
- Several quality measures
- Incomplete information (missing attribute values)
- Numerical (continuous) attributes
- Pruning of decision trees
- Rule derivation
- Random mood and batch mood
33CART
- CART (classification and regression tree)
- Proposed by Breiman et. al. (1984)
- Constant numerical values in leaves
- Variance as measure of impurity
34Strengths of decision tree methods
- Ability to generate understandable rules
- Ease of calculation at classification time
- Ability to handle both continuous and categorical
variables - Ability to clearly indicate best attributes
35The weaknesses of decision tree methods
- Greedy algorithm no global optimization
- Error-prone with too many classes numbers of
training examples become smaller quickly in a
tree with many levels/branches. - Expensive to train sorting, combination of
attributes, calculating quality measures, etc. - Trouble with non-rectangular regions the
rectangular classification boxes that may not
correspond well with the actual distribution of
records in the decision space.
36Advanced topics
37Combining multiple models
- The inherent instability of top-down decision
tree induction different training datasets from
a given problem domain will produce quite
different trees. - Techniques
- Bagging
- Boosting
38Bagging
- Introduced by Breiman
- It first creates multiple decision trees based on
different training sets. - Then, it compares each tree and incorporates the
best features of each. - This addresses some of the problems inherent in
regular ID3.
39Boosting
- Introduced by Freund and Schapire
- It examines the trees that incorrectly classify
an instance and assign them a weight. - These weights are used to eliminate hypotheses or
refocus the algorithm on the hypotheses that are
performing well.
40Summary
- Basic case
- Discrete input attributes
- Discrete goal attribute
- No missing attribute values
- Same cost for all tests and all kinds of
misclassification. - Extended cases
- Continuous attributes
- Real-valued goal attribute
- Some examples miss some attribute values
- Some tests are more expensive than others.
- Some misclassifications are more serious than
others.
41Summary (cont)
- Basic algorithm
- greedy algorithm
- top-down induction
- Bias for small trees
- Major issues
42Uncovered issues
- Incremental decision tree induction?
- How can a decision relate to other decisions?
what's the order of making the decisions? (e.g.,
POS tagging) - What's the difference between decision tree and
decision list?