Title: Decision Trees
1COMP5318, Lecture 3Data Mining and Machine
Learning
- Decision Trees
- Reference Witten and Frank 89-97, 159-164
- Dunham p.97-103
-
2Outline of the Lecture
- DT example
- Constructing DTs
- DTs Decision Boundary
- Avoiding Overfitting the Data
- Dealing with Numeric Attributes
- Alternative Measures for Selecting Attributes
- Dealing with Missing Attributes
- Handling Attributes with Different Costs
3 Decision Trees (DTs)
- DTs are supervised learners for classification
- most popular and well researched ML/DM technique
- divide-and-conquer approach
- developed in parallel in ML by Ross Quinlan
(USyd) and in statistics by Breiman, Friedman,
Olshen and Stone Classification and regression
trees 1984 - Quinlan has refined the DT algorithm over years
- ID3, 1986
- C4.5, 1993
- C5.0 (See 5 on Windows) commercial version of
C4.5 used in many DM packages - In WEKA - id3 and j48
- Many other software implementations available on
the web for free and not for free (USD 50 - 300
000 A. Moore)
4DT Example
- DT representation (model)
- each internal node test an attribute
- each branch corresponds to an attribute value
- each leaf node assigns a class
DT for the tennis data
- A DT is a tree-structured plan for testing the
values of a set of attributes in order to predict
the output A. Moore -
- Another interpretation
- Use all your data to build a tree of questions
with answers at the leaves. To answer a new
query, start from the tree root, answer the
questions until you reach a leaf node and return
its answer.
- What would be the prediction for
- outlooksunny, temperaturecool, humidityhigh,
windytrue
5Building a DT Example First Node
Example created by Eric McCreath
6Second Node
Example created by Eric McCreath
7Final Tree
Wage gt 22K
dependents lt 6
Example created by Eric McCreath
8Stopping Criteria Again
- In fact the real stopping criterion is more
complex - 4. Stop if in the current sub-set
- a) all instances have the same class gt make a
leaf
node corresponding to this class or else - b) there are no remaining attributes that can
create non-empty children (no attribute can
distinguish) gt make a leaf node and label it
with the majority class
Think about when case b) will occur!
9Constructing DTs (ID3 algorithm)
- Top-down in recursive divide-and-conquer fashion
- 1. Attribute is selected for root node and branch
is created for each possible attribute value - 2. The instances are split into subsets (one for
each branch extending from the node) - 3. The procedure is repeated recursively for each
branch, using only instances that reach the
branch - 4. Stop if
- all instances have the same class, make a
leaf node
corresponding to this class
10Expressiveness of DTs
- DTs can be expressed as a disjunction of
conjunctions - Extract the rules for the tennis tree
- Assume that all attributes are Boolean and the
class is Boolean. What is the class of Boolean
functions that we can represent using DTs? - Answer
- Proof
11How to find the best attribute?
- Four DTs for the tennis data which is the best
choice?
- We need a measure of purity of each node, as
the leafs with only one class (yes or no) will
not have to be split further - gt at each step we can choose the attribute
which produces the purest children node such
measure of purity is called ...
12Entropy
- Entropy (also called information content)
measures the homogeneity of a set of examples it
characterizes the impurity (disorder, randomness)
of a collection of examples wrt their
classification hi entropyhi disorder - It is a standard measure used in signal
compression, information theory and physics
- Entropy of data set S
- Pi - proportion of examples that belong to
class i - The smaller the entropy, the greater the purity
of the set - Tennis data - 9 yes 5 no examples gt the
entropy of the tennis data set S relative to the
classification is - log to the base 2 information is measured in
bits
13Range of the Entropy for Binary Classification
- p - proportion of positive examples in the data
set
- 0 gt all members of S belong to the same class
(no disorder) - 1 gt equal number of yes no (entropy is
maximized S is as disordered as it can be) - ? (0,1) if S contains unequal number of yes no
14Another Interpretation of Entropy - Examplefrom
http//www.cs.cmu.edu/awm/tutorials
- Suppose that X is a random variable with 4
possible values A, B, C and D P(XA)P(XB)P(XC
)P(XD)1/4. - You must transmit a sequence of X values over a
serial binary channel. You can encode each symbol
with 2 bits, e.g. A00, B01, C10 and D11 - ABBBCADBADCB
- 000101011000110100111001
- Now you are told that the probabilities are not
equal, e.g. P(XA)1/2, P(XB)1/4,
P(XC)P(XD)1/8. - Can you invent a coding that uses less than 2
bits on average per symbol, e.g. 1.75 bits? - A? C?
- B? D?
0 110 10 111
- What is the smallest possible number of bits per
symbol?
15Another Interpretation of Entropy - General
Casefrom http//www.cs.cmu.edu/awm/tutorials
- Suppose that X is a random variable with m
possible values V1Vm P(XV1)P1, P(XV2)P2,,
P(XVm)pm. - What is the smallest possible number of bits per
symbol on average needed to transmit a stream of
symbols drawn from Xs distribution?
- High entropy the values of X are all over place
- The histogram of the frequency distribution of
values of X will be flat - Low entropy the values of X are more
predictable - The histogram of the frequency distribution of
values of X have many lows and one or two highs
16Information Theory
- Information theory, Shannon and Weaver 1949
- Given a set of possible answers (messages)
Mm1,m2,,mn) and a probability P(mi) for the
occurrence of each answer, the expected
information content (entropy) of the actual
answer is -
-
- Shows the amount of surprise of the receiver by
the answer based on the probability of the
answers (i.e. based on the prior knowledge of the
receiver about the possible answers) - The less the receiver knows, the more information
is provided (the more informative the answer is) -
17Information Content - Examples
- Example 1 Flipping coin
- Case 1 Flipping an honest coin
- Case 2 Flipping a rigged coin so that it will
come heads 75 - In which case an answer telling the outcome of a
toss will contain more information?
- Example 2 Basketball game outcome
- Two teams A and B are playing basketball
- Case 1 Equal probability to win
- Case 2 Michael Jordan is playing for A and the
probability of A to win is 90 - In which case an answer telling the outcome of
the game will contain more information?
18Information Content - Solutions
- Example 1 Flipping coin
- Case 1 Entropy(coin_toss) I(1/2,
1/2)-1/2log1/2 - 1/2log1/2 1 bit - Case 2 Entropy(coin_toss) I(1/4,
3/4)-1/4log1/4 - 3/4log3/4 0.811 bits - Example 2 Basketball game outcome
- Case 1 Entropy(game_outcomes) I(1/2, 1/2) 1
bit - Case 2 Entropy(game_outcome) I(90/100,
10/100) - -90/100log90/100 - 10/100log10/100
- -.9log.9 - .1log.1 lt 1 bit
19Information Gain
- Back to DTs
- Entropy measures
- the disorder of a collection of training examples
with respect to their classification - the smallest possible number of bits per symbol
on average needed to transmit a stream of symbols
drawn from Xs distribution - shows the amount of surprise of the receiver by
the answer based on the probability of the
answers - We can use it to define a measure of the
effectiveness of an attribute in classifying the
training data Information gain - Information gain is the expected reduction in
entropy caused by the partitioning of the set of
examples using that attribute
20DT Learning as a Search
- DT algorithm searches the hypothesis space for a
hypothesis that fits the training data - What does the hypotheses space consist of?
- What is the search strategy?
- simple to complex search (starting with an empty
tree and progressively considering more elaborate
hypotheses) - hill climbing with information gain as evaluation
function - Information gain is an evaluation function of how
good the current state is (how close we are to
the goal state, i.e. the tree that classifies
correctly all training examples)
21Hill Climbing - Revision
- Hill climbing using the h-cost as an evaluation
function - Expanded nodes ABGL
- Solution path AL
22Information Gain - Definition
- Information gain is the expected reduction in
entropy caused by the partitioning of the set of
examples using that attribute - Gain(S,A) is the information gain of an attribute
A relative to S - Values(A) is the set of all possible values for A
- Sv is the subset of S for which A has value v
called also Reminder
entropy of the original data set S
expected value of the entropy after S is
partitioned by A (it is the ? of the entropies of
each subset Sv, weighed by the fraction of
examples that belong to Sv)
23More on Information Gain First Term
- Information gain is an evaluation function of how
good the current state is (how close we are to
the goal state, i.e. the tree that classifies
correctly all training examples) - Before any attribute is tested, an estimate of
this is given by the entropy of the original data
set S - S contains p positive and n negative examples
- e.g. tennis data 9 yes and 5 no at the
beginning gt entropy(S)I(p,n)0.940 bits - an answer telling the class of a randomly
selected example will contain 0.940 bits
24More on Information Gain Second Term
- After a test on a single attribute A we can
estimate how much information is still needed to
classify an example - - A divides the training set S into subsets Sv
- Sv is the subset of S for which A has value v
- - each subset Sv has pv positive and nv negative
examples - - if we go along that branch we will need in
addition entropy(Si)I(pv,nv) bits to answer the
question - - a random example has the v-th value for A with
probability -
- gt on average , after testing A , the number of
bits we will need to classify the example
25Computing the Information Gain
26Computing the Information Gain cont.
27Continuing to Split
Gain(S, temperature)0.571 bits
Gain(S, humidity)0.971 bits
Gain(S, windy)0.020 bits
Final DT
28DT Decision Boundary
Example taken from 6.034 AI, MIT
- DTs define a decision boundary in the feature
space - For a binary DT
2 attributes R ratio of earnings to
expenses L number of late payments on credit
cards over the past year
291NN Decision Boundary
Example taken from 6.034 AI, MIT
- What is the decision boundary of 1-NN algorithm?
- The space can be divided in regions that are
closer to each given data point than to the
others Voronoi partitioning of the space - In 1-NN a hypothesis is represented by the edges
in the Voronoi space that separate the points of
the two classes
30Overfitting
- ID3 typically grows each branch of the tree
deeply enough to perfectly classify the training
examples - but difficulties occur when there is
- noise in data
- too small a training set - cannot produce a
representative sample of the target function - gt ID3 can produce DTs that overfit the training
examples
- More formal definition of overfitting
- given H - a hypothesis space, a hypothesis h?H,
- D - entire distribution of instances, train -
training instances - h is said to overfit the training data if there
exist some alternative hypothesis h?H - errortrain(h)lterrortrain(h) errorD(h)gt
errorD(h)
31Overfitting
32Overfitting - Example
- How can it be possible for tree h to fit the
training examples better than h but to perform
worse over subsequent examples?
- Example. Noise in the labeling of a training
instance - - adding to the original tennis data the
following positive example that is incorrectly
labeled as negative outlooksunny,
temperaturehot, humiditynormal, windyes,
playTennisno
33Overfitting - cont.
- Overfitting is a problem not only for DTs
- Tree pruning is used to avoid overfitting in DTs
- pre-pruning - stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data - post-pruning fully grow the tree (allowing it
to overfit the data) and then post-prune it (more
successful in practice) - Tree post-pruning
- sub-tree replacement
- sub-tree raising
- Rule post-pruning (convert the tree into a set of
rules and prune them)
34When to Stop Pruning?
- How to determine when to stop pruning?
- Solution estimate the error rate using
- validation set
- training data - pessimistic error estimate based
on training data (heuristic based on some
statistical reasoning but the statistical
underpinning is rather weak)
35Error Rate Estimation Using Validation Set
- Available data is separated into 3 sets of
examples - training set - used to form the learned model
- validation set - used to evaluate the impact of
pruning and decide when to stop - test data to evaluate how good the final tree is
- Motivation
- even though the learner may be misled by random
errors and coincidental regularities within the
training set, the validation set is unlikely to
exhibit the same random fluctuations gt the
validation set can provide a safety check against
overfitting of the training set - the validation set should be large enough
typically 1/2 of the available examples are used
as training set, 1/4 as validation set and 1/4 as
test set
- Disadvantage the tree is based on less data
- when the data is limited, withholding part of it
for validation reduces even further the examples
available for training
36Tree Post-pruning by Sub-Tree Replacement
- Each node is considered as a candidate for
pruning - Start from the the leaves and work toward the
root - Typical error estimate - validation set
- Pruning a node
- remove the sub-tree rooted at that node
- make it a leaf and assign the most common label
of the training examples affiliated with that
node - Nodes are removed only if the resulting pruned
tree performs no worse than the original tree
over the validation set - gt any leaf added due to false regularities in
the training set is likely to be pruned as these
coincidences are unlikely to occur in the
validation set - Nodes are pruned iteratively, always choosing the
node whose removal most increases the tree
accuracy on the validation set - Continue until further pruning is harmful, i.e.
decreases accuracy of the tree over the
validation set
37Sub-Tree Replacement - Example
38Effect of tree pruning by sub-tree replacement
- The accuracy on test data increases as nodes are
pruned - (accuracy over validation set used for pruning
is not shown)
39Post-pruning Sub-Tree Raising
- more complex operation than sub-tree replacement
- sub-tree raising is potentially time consuming
operation gt it is restricted to raising the
sub-tree of the most popular branch - e.g. raise C only if the branch from B to C has
more training examples than the branches from B
to 4 or from B to 5 otherwise, if (for example)
4 were the majority daughter of B, consider
raising 4 to replace B and re-classifying all
examples under C, as well as the examples from 5,
into the new node
40Rule Post-Pruning - Example
- Grow the tree until the training data is fit
- Convert the tree into an equivalent set of rules
by creating 1 rule for each path from the root
to a leaf
if (outlooksunny) AND (humidityhigh) then
PlayTennisNo ...
- Prune each rule by removing any preconditions
that result in improving its estimated accuracy - consider removing (outlooksunny) and then
(humidityhigh) - select the pruning which produces the greatest
improvement - no pruning if it reduces the estimated rule
accuracy - Sort the pruned rules by their estimated
accuracy, and consider them in this sequence when
classifying subsequent instances - To estimate accuracy 1) a validation set of
examples or 2) a pessimistic error estimate based
on the training data set (C4.5)
41Rule Post-Pruning - cont.
- Why convert DT to rules before pruning?
- Bigger flexibility
- when trees are pruned only 2 choices - to remove
the node completely or retain it - when rules are pruned - less restrictions
- preconditions (not nodes) are removed
- each branch in the tree (i.e. each rule) is
treated separately - removes the distinction between attribute tests
that occur near the root of the tree and those
near the leaves - Advantage of rules over trees - easier to read
rules than a tree
42Numeric Attributes
- ID3 works only when all the attributes are
nominal but most real data sets contain numeric
attributes need for discretization
- for a numerical attribute we restrict the
possibilities to a binary split (e.g. templt60) - difference to nominal attributes every numerical
attribute offers many possible split points
- The solution is a straightforward extension
- sort the examples according the values of the
attribute - identify adjacent examples that differ in their
target classification and generate a set of
candidate splits (split points are placed
halfway) - evaluate Gain (or other measure) for every
possible split point and choose the best split
point - Gain for best split point is Gain for the
attribute
43Numeric Attributes - example
- values of temperature
- 64 65 68 69 70 71 72 73 74
75 80 81 83 85 - yes no yes yes yes no no no yes
yes no yes yes no
- 7 possible splits consider split between 70 and
71 - Information gain for - 1) temperature lt 70.5
4 yes 1 no - - 2) temperature gt70.5 4 yes 5 no
44Alternative Measures for Selecting Attributes
- Problem if an attribute is highly-branching
(with a large number of values), Information gain
will select it!
- imagine using ID code (extreme case) the
training examples will be separated into many and
very small subsets - gt highly-branching attributes are more likely to
create pure subsets - Information gain is biased towards choosing
attributes with a large number of values - this will result in overfitting
45Highly-Branching Attributes - Example
46Highly-Branching Attributes - Example cont.
- the weighted sum of entropies
- entropy at the root
- Gain
47Gain Ratio
- Gain ratio a modification of the Gain that
reduces its bias towards highly branching
attributes -
- it takes the number and size of branches into
account when choosing an attribute - it penalizes highly-branching attributes by
incorporating SplitInformation - SplitInformation is the entropy of S wrt the
values of A - Gain ratio
48Gain Ratio
- Gain ratios for tennis data
- outlook - Gain0.247, InformationSplit1.577,
GainRatio0.156 - temperature - Gain0.029, InformationSplit1.362,
GainRatio0.021 - humidity - Gain0.152, InformationSplit1.000,
GainRatio0.152 - windy - Gain0.048, InformationSplit0.985,
GainRatio0.049 - gt outlook still comes out on top but humidity is
now much closer as it splits the data into 2
subsets instead of 3 - however, IDcode will still be preferred (although
its advantage is greatly reduced)
49Gain Ratio - Problem
- Problem with GainRatio may overcompensate
- may choose an attribute just because its
SplitInformation is much lower than for the other
attribute - standard fix only consider attributes with
greater Gain than the average Gain (for all the
attributes examined)
50Handling Examples with Missing Values
- Missing attribute values in the training data
- (x,class(x)) is a training example in S
- attribute value A for example x A(x) is unknown
- When building DT, what to do with the missing
attr.value A(x)? - Gain(S,A) has to be calculated at node n to
evaluate if to split on A
- treat missing values as simply another possible
value of the attribute this assumes that the
absence of a value is significant - ignore all instances with missing attribute value
- tempting solution! But - instances with missing values often provide a
good deal of information - sometimes the attributes whose values are missing
play no part in the decision, in which case these
instances are as good as any other
51Handling Examples with Missing Values - 2
- 3) A(x)most common value for A among the
training examples at n - 4) A(x)most common value among the training
examples at n with class(x)
52Handling Examples with Missing Values - 3
- more sophisticated solution (used in C4.5)
- assign probability to each of the possible values
of A calculate these prob. using the frequencies
of the values of A among the examples at n
- example A is the Boolean attribute wind
- instance x with missing value for wind
- node n contains 6 examples with windtrue and
4 with windfalse - gt P(A(x)true)0.6, P(A(x)false)0.4
- 0.6 of instance x is distributed down the
branch for windtrue and - 0.4 of instance x down the branch for
windfalse - these fractional examples are used to compute
Gain and can be further subdivided at subsequent
branches of the tree if another missing attribute
value must be tested - the same fractioning strategy can be used for
classification of new instances with missing
values
53Handling Attributes with Different (External)
Costs
- Consider medical diagnosis and the following
attributes - temperature, biopsyResult, pulse, bloodTestResult
- attributes vary significantly in their costs
(monetary cost and patient comfort) - prefer DTs that use low-cost attributes where
possible, relying on high-cost attributes only
when needed to produce reliable classification - How to learn a consistent tree with low expected
cost? - One approach favour low-cost attribute by
replacing Gain with - Tan Schlimmer (90)
- Nunez (88) where
w?? 0,1 determines cost importance -
-
54Components of DT
- Model (structure)
- Tree (not pre-specified but derived from data)
- Preference (score function) preference criteria
used to measure the quality of the tree
structures - number of misclassifications over all examples
(loss function) - Search method (how the data is searched by the
algorithm) - hill climbing search over tree structure (2
phases grow and prune)
55DTs - Summary
- Very popular ML technique
- Easy to implement
- Efficient
- Cost of building the tree O(mn log n),
n-instances and m attributes - Cost of pruning the tree with sub-tree
replacement O(n) - Cost of pruning by subtree lifting O(n (log n)2)
- gt the total cost of tree induction O(mn log n)
O(n (log n)2) - Reference Witten and Frank pp.167-168
- The resulting hypothesis is easy to interpret by
humans