Title: Decision Tree Learning
1Decision Tree Learning
- Machine Learning, T. Mitchell
- Chapter 3
2Decision Trees
- One of the most widely used and practical methods
for inductive - inference
- Approximates discrete-valued functions (including
disjunctions) - Can be used for classification (most common) or
regression - problems
3Decision Tree for PlayTennis
4Decision Tree
- If (OSunny AND HNormal) OR (OOvercast) OR
(ORain AND WWeak) - then YES
- A disjunction of conjunctions of constraints on
attribute values - Larger hypothesis space than Candidate-Elimination
5Decision tree representation
- Each internal node corresponds to a test
- Each branch corresponds to a result of the test
- Each leaf node assigns a classification
- Once the tree is trained, a new instance is
classified by starting at the root and following
the path as dictated by the test results for this
instance.
6Decision Regions
7Divide and Conquer
- Internal decision nodes
- Univariate Uses a single attribute, xi
- Discrete xi n-way split for n possible values
- Continuous xi Binary split xi gt wm
- Multivariate Uses more than one attributes
- Leaves
- Classification Class labels, or proportions
- Regression Numeric r average, or local fit
- Learning is greedy find the best split
recursively (Breiman et al, 1984 Quinlan, 1986,
1993)
8- If the decisions are binary, then in the best
case, each - decision eliminates half of the regions (leaves).
- If there are b regions, the correct region can be
found in - log2b decisions, in the best case.
9Multivariate Trees
10Expressiveness
- A decision tree can represent a disjunction of
conjunctions - of constraints on the attribute values of
instances. - Each path corresponds to a conjunction
- The tree itself corresponds to a disjunction
- How expressive is this representation?
- How would we represent
- (A AND B) OR C
- M of N
- A XOR B
11Decision tree learning algorithm
- For a given training set, there are many trees
that code it without any error - Finding the smallest tree is NP-complete (Quinlan
1986), hence we are forced to use some (local)
search algorithm to find reasonable solutions
12The basic decision tree learning algorithm
- A decision tree can be constructed by considering
attributes of instances one by one. - Which attribute should be considered first?
- The height of a decision tree depends on the
order attributes are considered.
13Top-Down Induction of Decision Trees
14(No Transcript)
15Entropy
- Measure of uncertainty
- Expected number of bits to resolve uncertainty
- Entropy measures the information amount in a
message - High school form example
16Entropy
- Important quantity in
- coding theory
- statistical physics
- machine learning
17Entropy
- Coding theory x discrete with 8 possible states
- how many bits are needed to transmit the state
of x? - All states equally likely
18(No Transcript)
19- Entropy measures the impurity of S
- Entropy(S) -p log2 p - log2 (1-p)
- (Here pp-positive and 1-p p_negative from the
previous slide)
20 Entropy
- Suppose PrX 0 1/8
- If other events are all equally likely, the
number of events is 8. - To indicate one out of so many events, one needs
lg2 8 bits. - Consider a binary random variable X s.t. PrX
0 0.1. - The expected number of bits
- In general, if a random variable X has c values
with prob. p_c - The expected number of bits
21Entropy
- What if we have the following distribution for x?
- In order to save on transmission costs, we would
design codes that - reflect this distribution
22Entropy
23Use of Entropy in Choosing the Next Attribute
24(No Transcript)
25Other measures of impurity
- Entropy is not the only measure of impurity. If a
function satisfies certain criteria, it can be
used as a measure of impurity. - Gini index 2p(p-1)
- Misclassification error 1 max(p,1-p)
26Training Examples
27Selecting the Next Attribute
28Selecting the Next Attribute
- Computing the information gain for each
attribute, we selected the - Outlook attribute as the first test, resulting
in the following partially , - learned tree
29Partially learned tree
30- Until stopped
- Select one of the unused attributes to partition
the remaining - examples at each non-terminal node
- using only the training samples associated with
that node - Stopping criteria
- each leaf-node contains examples of one type
- algorithm ran out of attributes
31(No Transcript)
32Inductive Bias of ID3
33Hypothesis Space Search by ID3
- Hypothesis space is complete
- every finite discrete function can be represented
by a decision tree - Outputs a single hypothesis (which one?)
- Cant play 20 questions...
- No back tracking
- Local minima...
- Statistically-based search choices
- Uses all available training samples
34- Note H is the power set of instances X
- Unbiased?
- Preference for short trees, and for those with
high information gain - attributes near the root
- Bias is a preference for some hypotheses, rather
than a restriction of hypothesis space H - Occams razor prefer the shortest hypothesis
that fits the data
35Occams razor
- Prefer the shortest hypothesis that fits the data
- Occam 1320
- While this idea is intuitive, it is more
difficult to prove it formally. - Support 1
- Shorter hypotheses have better generalization
ability - Support 2
- The number of short hypotheses are small, and
therefore it is less likely a coincidence - if data fits a short hypothesis
- There may be counter arguments for this there
are other hypotheses with small numbers, why not
choose those but the small ones - Different internal representations may arrive to
different length of hypothesis - We will consider an optimal encoding
36Overfitting
37Over fitting in Decision Trees
- Why over-fitting?
- A model can become more complex than the true
target function (concept) when it tries to
satisfy noisy data as well. - Definition of overfitting
- A hypothesis is said to overfit the training data
if there exists some other hypothesis that has
larger error over the training data but smaller
error over the entire instances.
38- Consider adding the following training example
which is - incorrectly labeled as negative
- Sky Temp Humidity Wind PlayTennis
- Sunny Hot Normal Strong PlayTennis
No - Or consider the Oranges and Tangerines with Size
and Texture attributes and the orange that is
misclassified as tangerine (I will add a figure
later)
39(No Transcript)
40- ID3 will make a new split and will classify
future examples following - the new path as negative.
- Problem is due to overfitting the training
data. - Overfitting may result due to
- noise
- coincidental regularities in the training data
- What is the formal description of overfitting?
41(No Transcript)
42Curse of Dimensionality - A related concept
- Imagine a learning task, such as recognizing
printed characters. - Intuitively, adding more attributes would help
the learner, as more - information never hurts, right?
- In fact, sometimes it does, due to what is called
- curse of dimensionality.
43Curse of Dimensionality
44Curse of Dimensionality
Polynomial curve fitting, M 3
Number of independent coefficients grows
proportionally to D3 where D is the number of
variables More generally, for an M dimensional
polynomial DM The polynomial becomes unwieldy
very quickly.
45Polynomial Curve Fitting
46Sum-of-Squares Error Function
470th Order Polynomial
481st Order Polynomial
493rd Order Polynomial
509th Order Polynomial
51Over-fitting
Root-Mean-Square (RMS) Error
52Polynomial Coefficients
53Data Set Size
9th Order Polynomial
54Data Set Size
9th Order Polynomial
55Regularization
- Penalize large coefficient values
56Regularization
57Regularization
58Regularization vs.
59Polynomial Coefficients
60- Although the curse of dimensionality is an
important issue, we can - still find effective techniques applicable to
high-dimensional spaces - Real data will often be confined to a region of
the space having - lower effective dimensionality
- example of planar objects on a conveyor belt
- 3 dimensional manifold within the high
dimensional picture pixel space - Real data will typically exhibit smoothness
properties
61Back to Decision Trees
62Over fitting in Decision Trees
63Avoiding over-fitting the data
- How can we avoid overfitting? There are 2
approaches - stop growing the tree before it perfectly
classifies the training data - grow full tree, then post-prune
- Reduced error pruning
- Rule post-pruning
- the 2nd approach is found more useful in
practice.
64- Whether we are pre or post-pruning, the important
question is how - to select best tree
- Measure performance over separate validation data
set - Measure performance over training data
- apply a statistical test to see if expanding or
pruning would produce an - improvement beyond the training set (Quinlan
1986) - MDL minimize size(tree) size(misclassifications
(tree))
65- MDL
- length(h) length(additional information to
encode D given h) - length(h) length(misclassifications)
- since we only need to send a message when the
data sample is not in - agreement with h hence, only for
misclassifications.
66Reduced-Error Pruning (Quinlan 1987)
- Split data into training and validation set
- Do until further pruning is harmful
- 1. Evaluate impact of pruning each possible node
(plus those below it) - on the validation set
- 2. Greedily remove the one that most improves
validation set accuracy - Produces smallest version of the (most accurate)
tree - What if data is limited?
- We would not want to separate a validation set.
67Reduced error pruning
- Examine each decision node to see if pruning
decreases the trees performance over the
evaluation data. - Pruning here means replacing a subtree with a
leaf with the most common classification in the
subtree.
68Rule post-pruning
- Algorithm
- Build a complete decision tree.
- Convert the tree to set of rules.
- Prune each rule
- Remove any preconditions if any improvement in
accuracy - Sort the pruned rules by accuracy and use them in
that order. - Perhaps most frequently used method (e.g., in
C4.5) - More details can be found in http//www2.cs.uregin
a.ca/hamilton/courses/831/notes/ml/dtrees/4_dtree
s3.html - (read only if interested, presentation of
advanced decision tree algorithms such as this
may be added as part of a class project)
69(No Transcript)
70- IF (Outlook Sunny) (Humidity High)
- THEN PlayTennis No
- IF (Outlook Sunny) (Humidity Normal)
- THEN PlayTennis Y es
- . . .
71Rule Extraction from Trees
C4.5Rules (Quinlan, 1993)
72- Converting a decision tree to rules before
pruning has three main - advantages
- Converting to rules allows distinguishing among
the different contexts in - which a decision node is used.
- Since each distinct path through the decision
tree node produces a distinct rule, - the pruning decision regarding that attribute
test can be made differently for each - path.
- In contrast, if the tree itself were pruned, the
only two choices would be - Remove the decision node completely, or
- Retain it in its original form.
- Converting to rules removes the distinction
between attribute tests that - occur near the root of the tree and those that
occur near the leaves. - We thus avoid messy bookkeeping issues such as
how to reorganize the tree if the root node is
pruned while retaining part of the subtree below
this test. - Converting to rules improves readability.
- Rules are often easier for people to understand.
73Rule Simplification Overview
- Eliminate unecessary rule antecedents to simplify
the rules. - Construct contingency tables for each rule
consisting of more than one - antecedent.
- Rules with only one antecedent cannot be further
simplified, so we only consider those with two or
more. - To simplify a rule, eliminate antecedents that
have no effect on the conclusion - reached by the rule.
- A conclusion's independence from an antecendent
is verified using a test for independency, which
is - a chi-square test if the expected cell
frequencies are greater than 10. - Yates' Correction for Continuity when the
expected frequencies are between 5 and 10. - Fisher's Exact Test for expected frequencies less
than 5. - Once individual rules have been simplified by
eliminating redundant antecedents, simplify the
entire set by eliminating unnecessary rules. - Attempt to replace those rules that share the
most common consequent by a - default rule that is triggered when no other
rule is triggered. - In the event of a tie, use some heuristic tie
breaker to choose a default rule.
74Continuous Valued Attributes
- Create a discrete attribute to test continuous
- Temperature 825
- (Temperature gt 723) t f
- How to find the threshold?
- Temperature 40 48 60 72 80 90
- PlayTennis No No Yes Yes Yes No
75Incorporating continuous-valued attributes
Continuous valued attribute
76Split Information?
- In each tree, the leaves contain samples of only
one kind (e.g. 50, 10, 10- etc). - Hence, the remaining entropy is 0 in each one.
- Which is better?
- In terms of information gain
- In terms of gain ratio
100 examples
100 examples
A2
A1
10 positive
50 positive
50 negative
10 positive
10 negative
10 positive
77Attributes with Many Values
- One way to penalize such attributes is to use the
following alternative measure
S
Entropy of the attribute A Experimentally
determined by the training samples
78Handling training examples with missing attribute
values
- What if an example x is missing the value an
attribute A? - Simple solution
- Use the most common value among examples at node
n. - Or use the most common value among examples at
node n that have classification c(x) - More complex, probabilistic approach
- Assign a probability to each of the possible
values of A based on the observed frequencies of
the various values of A - Then, propagate examples down the tree with these
probabilities. - The same probabilities can be used in
classification of new instances (used in C4.5)
79Handling attributes with differing costs
- Sometimes, some attribute values are more
expensive or difficult to prepare. - medical diagnosis, BloodTest has cost 150
- In practice, it may be desired to postpone
acquisition of such attribute values until they
become necessary. - To this purpose, one may modify the attribute
selection measure to penalize expensive
attributes. - Tan and Schlimmer (1990)
- Nunez (1988)
80C4.5
- By Ross Quinlan
- Latest code available at http//www.cse.unsw.edu.a
u/quinlan/ - How to use it?
- Download it
- Unpack it
- Make it (make all)
- Read accompanying manual files
- groff T ps c4.5.1 gt c4.5.ps
- Use it
- c4.5 tree generator
- c4.5rules rule generator
- consult use a generated tree to classify an
instance - consultr use a generated set of rules to
classify an instance
81Model Selection in Trees
82Strengths and Advantages of Decision Trees
- Rule extraction from trees
- A decision tree can be used for feature
extraction (e.g. seeing which - features are useful)
- Interpretability human experts may verify and/or
discover patterns - It is a compact and fast classification method