Decision Trees

About This Presentation

Title:

Decision Trees

Description:

post-pruning fully grow the tree (allowing it to overfit the data) and then ... Nodes are pruned iteratively, always choosing the node whose removal most ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 56

Provided by: irenako

Category:

more less

Transcript and Presenter's Notes

Title: Decision Trees

1
COMP5318, Lecture 3Data Mining and Machine
Learning

Decision Trees
Reference Witten and Frank 89-97, 159-164
Dunham p.97-103

2
Outline of the Lecture

DT example
Constructing DTs
DTs Decision Boundary
Avoiding Overfitting the Data
Dealing with Numeric Attributes
Alternative Measures for Selecting Attributes
Dealing with Missing Attributes
Handling Attributes with Different Costs

3
Decision Trees (DTs)

DTs are supervised learners for classification
most popular and well researched ML/DM technique
divide-and-conquer approach
developed in parallel in ML by Ross Quinlan
(USyd) and in statistics by Breiman, Friedman,
Olshen and Stone Classification and regression
trees 1984
Quinlan has refined the DT algorithm over years
ID3, 1986
C4.5, 1993
C5.0 (See 5 on Windows) commercial version of
C4.5 used in many DM packages
In WEKA - id3 and j48
Many other software implementations available on
the web for free and not for free (USD 50 - 300
000 A. Moore)

4
DT Example

DT representation (model)
each internal node test an attribute
each branch corresponds to an attribute value
each leaf node assigns a class

DT for the tennis data

A DT is a tree-structured plan for testing the
values of a set of attributes in order to predict
the output A. Moore

Another interpretation
Use all your data to build a tree of questions
with answers at the leaves. To answer a new
query, start from the tree root, answer the
questions until you reach a leaf node and return
its answer.

What would be the prediction for
outlooksunny, temperaturecool, humidityhigh,
windytrue

5
Building a DT Example First Node
Example created by Eric McCreath
6
Second Node
Example created by Eric McCreath
7
Final Tree
Wage gt 22K
dependents lt 6
Example created by Eric McCreath
8
Stopping Criteria Again

In fact the real stopping criterion is more
complex
4. Stop if in the current sub-set
a) all instances have the same class gt make a
leaf
node corresponding to this class or else
b) there are no remaining attributes that can
create non-empty children (no attribute can
distinguish) gt make a leaf node and label it
with the majority class

Think about when case b) will occur!
9
Constructing DTs (ID3 algorithm)

Top-down in recursive divide-and-conquer fashion
1. Attribute is selected for root node and branch
is created for each possible attribute value
2. The instances are split into subsets (one for
each branch extending from the node)
3. The procedure is repeated recursively for each
branch, using only instances that reach the
branch
4. Stop if
all instances have the same class, make a
leaf node
corresponding to this class

10
Expressiveness of DTs

DTs can be expressed as a disjunction of
conjunctions
Extract the rules for the tennis tree

Assume that all attributes are Boolean and the
class is Boolean. What is the class of Boolean
functions that we can represent using DTs?
Answer
Proof

11
How to find the best attribute?

A heuristic is needed!

Four DTs for the tennis data which is the best
choice?

We need a measure of purity of each node, as
the leafs with only one class (yes or no) will
not have to be split further
gt at each step we can choose the attribute
which produces the purest children node such
measure of purity is called ...

12
Entropy

Entropy (also called information content)
measures the homogeneity of a set of examples it
characterizes the impurity (disorder, randomness)
of a collection of examples wrt their
classification hi entropyhi disorder
It is a standard measure used in signal
compression, information theory and physics

Entropy of data set S
Pi - proportion of examples that belong to
class i
The smaller the entropy, the greater the purity
of the set
Tennis data - 9 yes 5 no examples gt the
entropy of the tennis data set S relative to the
classification is
log to the base 2 information is measured in
bits

13
Range of the Entropy for Binary Classification

p - proportion of positive examples in the data
set

note that

0 gt all members of S belong to the same class
(no disorder)
1 gt equal number of yes no (entropy is
maximized S is as disordered as it can be)
? (0,1) if S contains unequal number of yes no

14
Another Interpretation of Entropy - Examplefrom
http//www.cs.cmu.edu/awm/tutorials

Suppose that X is a random variable with 4
possible values A, B, C and D P(XA)P(XB)P(XC
)P(XD)1/4.
You must transmit a sequence of X values over a
serial binary channel. You can encode each symbol
with 2 bits, e.g. A00, B01, C10 and D11
ABBBCADBADCB
000101011000110100111001

Now you are told that the probabilities are not
equal, e.g. P(XA)1/2, P(XB)1/4,
P(XC)P(XD)1/8.
Can you invent a coding that uses less than 2
bits on average per symbol, e.g. 1.75 bits?
A? C?
B? D?

0 110 10 111

What is the smallest possible number of bits per
symbol?

15
Another Interpretation of Entropy - General
Casefrom http//www.cs.cmu.edu/awm/tutorials

Suppose that X is a random variable with m
possible values V1Vm P(XV1)P1, P(XV2)P2,,
P(XVm)pm.
What is the smallest possible number of bits per
symbol on average needed to transmit a stream of
symbols drawn from Xs distribution?

High entropy the values of X are all over place
The histogram of the frequency distribution of
values of X will be flat
Low entropy the values of X are more
predictable
The histogram of the frequency distribution of
values of X have many lows and one or two highs

16
Information Theory

Information theory, Shannon and Weaver 1949
Given a set of possible answers (messages)
Mm1,m2,,mn) and a probability P(mi) for the
occurrence of each answer, the expected
information content (entropy) of the actual
answer is

Shows the amount of surprise of the receiver by
the answer based on the probability of the
answers (i.e. based on the prior knowledge of the
receiver about the possible answers)
The less the receiver knows, the more information
is provided (the more informative the answer is)

17
Information Content - Examples

Example 1 Flipping coin
Case 1 Flipping an honest coin
Case 2 Flipping a rigged coin so that it will
come heads 75
In which case an answer telling the outcome of a
toss will contain more information?

Example 2 Basketball game outcome
Two teams A and B are playing basketball
Case 1 Equal probability to win
Case 2 Michael Jordan is playing for A and the
probability of A to win is 90
In which case an answer telling the outcome of
the game will contain more information?

18
Information Content - Solutions

Example 1 Flipping coin
Case 1 Entropy(coin_toss) I(1/2,
1/2)-1/2log1/2 - 1/2log1/2 1 bit
Case 2 Entropy(coin_toss) I(1/4,
3/4)-1/4log1/4 - 3/4log3/4 0.811 bits
Example 2 Basketball game outcome
Case 1 Entropy(game_outcomes) I(1/2, 1/2) 1
bit
Case 2 Entropy(game_outcome) I(90/100,
10/100)
-90/100log90/100 - 10/100log10/100
-.9log.9 - .1log.1 lt 1 bit

19
Information Gain

Back to DTs
Entropy measures
the disorder of a collection of training examples
with respect to their classification
the smallest possible number of bits per symbol
on average needed to transmit a stream of symbols
drawn from Xs distribution
shows the amount of surprise of the receiver by
the answer based on the probability of the
answers
We can use it to define a measure of the
effectiveness of an attribute in classifying the
training data Information gain
Information gain is the expected reduction in
entropy caused by the partitioning of the set of
examples using that attribute

20
DT Learning as a Search

DT algorithm searches the hypothesis space for a
hypothesis that fits the training data
What does the hypotheses space consist of?
What is the search strategy?

simple to complex search (starting with an empty
tree and progressively considering more elaborate
hypotheses)
hill climbing with information gain as evaluation
function
Information gain is an evaluation function of how
good the current state is (how close we are to
the goal state, i.e. the tree that classifies
correctly all training examples)

21
Hill Climbing - Revision

Hill climbing using the h-cost as an evaluation
function
Expanded nodes ABGL
Solution path AL

22
Information Gain - Definition

Information gain is the expected reduction in
entropy caused by the partitioning of the set of
examples using that attribute
Gain(S,A) is the information gain of an attribute
A relative to S
Values(A) is the set of all possible values for A
Sv is the subset of S for which A has value v

called also Reminder
entropy of the original data set S
expected value of the entropy after S is
partitioned by A (it is the ? of the entropies of
each subset Sv, weighed by the fraction of
examples that belong to Sv)
23
More on Information Gain First Term

Information gain is an evaluation function of how
good the current state is (how close we are to
the goal state, i.e. the tree that classifies
correctly all training examples)
Before any attribute is tested, an estimate of
this is given by the entropy of the original data
set S
S contains p positive and n negative examples
e.g. tennis data 9 yes and 5 no at the
beginning gt entropy(S)I(p,n)0.940 bits
an answer telling the class of a randomly
selected example will contain 0.940 bits

24
More on Information Gain Second Term

After a test on a single attribute A we can
estimate how much information is still needed to
classify an example
- A divides the training set S into subsets Sv
Sv is the subset of S for which A has value v
- each subset Sv has pv positive and nv negative
examples
- if we go along that branch we will need in
addition entropy(Si)I(pv,nv) bits to answer the
question
- a random example has the v-th value for A with
probability
gt on average , after testing A , the number of
bits we will need to classify the example

25
Computing the Information Gain

split based on outlook

26
Computing the Information Gain cont.
27
Continuing to Split
Gain(S, temperature)0.571 bits
Gain(S, humidity)0.971 bits
Gain(S, windy)0.020 bits
Final DT
28
DT Decision Boundary
Example taken from 6.034 AI, MIT

DTs define a decision boundary in the feature
space
For a binary DT

2 attributes R ratio of earnings to
expenses L number of late payments on credit
cards over the past year
29
1NN Decision Boundary
Example taken from 6.034 AI, MIT

What is the decision boundary of 1-NN algorithm?
The space can be divided in regions that are
closer to each given data point than to the
others Voronoi partitioning of the space
In 1-NN a hypothesis is represented by the edges
in the Voronoi space that separate the points of
the two classes

30
Overfitting

ID3 typically grows each branch of the tree
deeply enough to perfectly classify the training
examples
but difficulties occur when there is
noise in data
too small a training set - cannot produce a
representative sample of the target function
gt ID3 can produce DTs that overfit the training
examples

More formal definition of overfitting
given H - a hypothesis space, a hypothesis h?H,
D - entire distribution of instances, train -
training instances
h is said to overfit the training data if there
exist some alternative hypothesis h?H
errortrain(h)lterrortrain(h) errorD(h)gt
errorD(h)

31
Overfitting
32
Overfitting - Example

How can it be possible for tree h to fit the
training examples better than h but to perform
worse over subsequent examples?

Example. Noise in the labeling of a training
instance
- adding to the original tennis data the
following positive example that is incorrectly
labeled as negative outlooksunny,
temperaturehot, humiditynormal, windyes,
playTennisno

33
Overfitting - cont.

Overfitting is a problem not only for DTs

Tree pruning is used to avoid overfitting in DTs
pre-pruning - stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data
post-pruning fully grow the tree (allowing it
to overfit the data) and then post-prune it (more
successful in practice)
Tree post-pruning
sub-tree replacement
sub-tree raising
Rule post-pruning (convert the tree into a set of
rules and prune them)

34
When to Stop Pruning?

How to determine when to stop pruning?
Solution estimate the error rate using
validation set
training data - pessimistic error estimate based
on training data (heuristic based on some
statistical reasoning but the statistical
underpinning is rather weak)

35
Error Rate Estimation Using Validation Set

Available data is separated into 3 sets of
examples
training set - used to form the learned model
validation set - used to evaluate the impact of
pruning and decide when to stop
test data to evaluate how good the final tree is

Motivation
even though the learner may be misled by random
errors and coincidental regularities within the
training set, the validation set is unlikely to
exhibit the same random fluctuations gt the
validation set can provide a safety check against
overfitting of the training set
the validation set should be large enough
typically 1/2 of the available examples are used
as training set, 1/4 as validation set and 1/4 as
test set

Disadvantage the tree is based on less data
when the data is limited, withholding part of it
for validation reduces even further the examples
available for training

36
Tree Post-pruning by Sub-Tree Replacement

Each node is considered as a candidate for
pruning
Start from the the leaves and work toward the
root
Typical error estimate - validation set
Pruning a node
remove the sub-tree rooted at that node
make it a leaf and assign the most common label
of the training examples affiliated with that
node
Nodes are removed only if the resulting pruned
tree performs no worse than the original tree
over the validation set
gt any leaf added due to false regularities in
the training set is likely to be pruned as these
coincidences are unlikely to occur in the
validation set
Nodes are pruned iteratively, always choosing the
node whose removal most increases the tree
accuracy on the validation set
Continue until further pruning is harmful, i.e.
decreases accuracy of the tree over the
validation set

37
Sub-Tree Replacement - Example
38
Effect of tree pruning by sub-tree replacement

The accuracy on test data increases as nodes are
pruned
(accuracy over validation set used for pruning
is not shown)

39
Post-pruning Sub-Tree Raising

more complex operation than sub-tree replacement

sub-tree raising is potentially time consuming
operation gt it is restricted to raising the
sub-tree of the most popular branch
e.g. raise C only if the branch from B to C has
more training examples than the branches from B
to 4 or from B to 5 otherwise, if (for example)
4 were the majority daughter of B, consider
raising 4 to replace B and re-classifying all
examples under C, as well as the examples from 5,
into the new node

40
Rule Post-Pruning - Example

Grow the tree until the training data is fit
Convert the tree into an equivalent set of rules
by creating 1 rule for each path from the root
to a leaf

if (outlooksunny) AND (humidityhigh) then
PlayTennisNo ...

Prune each rule by removing any preconditions
that result in improving its estimated accuracy
consider removing (outlooksunny) and then
(humidityhigh)
select the pruning which produces the greatest
improvement
no pruning if it reduces the estimated rule
accuracy
Sort the pruned rules by their estimated
accuracy, and consider them in this sequence when
classifying subsequent instances
To estimate accuracy 1) a validation set of
examples or 2) a pessimistic error estimate based
on the training data set (C4.5)

41
Rule Post-Pruning - cont.

Why convert DT to rules before pruning?

Bigger flexibility
when trees are pruned only 2 choices - to remove
the node completely or retain it
when rules are pruned - less restrictions
preconditions (not nodes) are removed
each branch in the tree (i.e. each rule) is
treated separately
removes the distinction between attribute tests
that occur near the root of the tree and those
near the leaves
Advantage of rules over trees - easier to read
rules than a tree

42
Numeric Attributes

ID3 works only when all the attributes are
nominal but most real data sets contain numeric
attributes need for discretization

for a numerical attribute we restrict the
possibilities to a binary split (e.g. templt60)
difference to nominal attributes every numerical
attribute offers many possible split points

The solution is a straightforward extension
sort the examples according the values of the
attribute
identify adjacent examples that differ in their
target classification and generate a set of
candidate splits (split points are placed
halfway)
evaluate Gain (or other measure) for every
possible split point and choose the best split
point
Gain for best split point is Gain for the
attribute

43
Numeric Attributes - example

values of temperature
64 65 68 69 70 71 72 73 74
75 80 81 83 85
yes no yes yes yes no no no yes
yes no yes yes no

7 possible splits consider split between 70 and
71
Information gain for - 1) temperature lt 70.5
4 yes 1 no
- 2) temperature gt70.5 4 yes 5 no

44
Alternative Measures for Selecting Attributes

Problem if an attribute is highly-branching
(with a large number of values), Information gain
will select it!

imagine using ID code (extreme case) the
training examples will be separated into many and
very small subsets
gt highly-branching attributes are more likely to
create pure subsets
Information gain is biased towards choosing
attributes with a large number of values
this will result in overfitting

45
Highly-Branching Attributes - Example
46
Highly-Branching Attributes - Example cont.

split based on IDcode

the weighted sum of entropies
entropy at the root
Gain

47
Gain Ratio

Gain ratio a modification of the Gain that
reduces its bias towards highly branching
attributes
it takes the number and size of branches into
account when choosing an attribute
it penalizes highly-branching attributes by
incorporating SplitInformation
SplitInformation is the entropy of S wrt the
values of A
Gain ratio

48
Gain Ratio

Gain ratios for tennis data
outlook - Gain0.247, InformationSplit1.577,
GainRatio0.156
temperature - Gain0.029, InformationSplit1.362,
GainRatio0.021
humidity - Gain0.152, InformationSplit1.000,
GainRatio0.152
windy - Gain0.048, InformationSplit0.985,
GainRatio0.049
gt outlook still comes out on top but humidity is
now much closer as it splits the data into 2
subsets instead of 3
however, IDcode will still be preferred (although
its advantage is greatly reduced)

49
Gain Ratio - Problem

Problem with GainRatio may overcompensate
may choose an attribute just because its
SplitInformation is much lower than for the other
attribute
standard fix only consider attributes with
greater Gain than the average Gain (for all the
attributes examined)

50
Handling Examples with Missing Values

Missing attribute values in the training data
(x,class(x)) is a training example in S
attribute value A for example x A(x) is unknown
When building DT, what to do with the missing
attr.value A(x)?
Gain(S,A) has to be calculated at node n to
evaluate if to split on A

treat missing values as simply another possible
value of the attribute this assumes that the
absence of a value is significant
ignore all instances with missing attribute value
- tempting solution! But
instances with missing values often provide a
good deal of information
sometimes the attributes whose values are missing
play no part in the decision, in which case these
instances are as good as any other

51
Handling Examples with Missing Values - 2

3) A(x)most common value for A among the
training examples at n
4) A(x)most common value among the training
examples at n with class(x)

52
Handling Examples with Missing Values - 3

more sophisticated solution (used in C4.5)
assign probability to each of the possible values
of A calculate these prob. using the frequencies
of the values of A among the examples at n

example A is the Boolean attribute wind
instance x with missing value for wind
node n contains 6 examples with windtrue and
4 with windfalse
gt P(A(x)true)0.6, P(A(x)false)0.4
0.6 of instance x is distributed down the
branch for windtrue and
0.4 of instance x down the branch for
windfalse
these fractional examples are used to compute
Gain and can be further subdivided at subsequent
branches of the tree if another missing attribute
value must be tested
the same fractioning strategy can be used for
classification of new instances with missing
values

53
Handling Attributes with Different (External)
Costs

Consider medical diagnosis and the following
attributes
temperature, biopsyResult, pulse, bloodTestResult
attributes vary significantly in their costs
(monetary cost and patient comfort)
prefer DTs that use low-cost attributes where
possible, relying on high-cost attributes only
when needed to produce reliable classification
How to learn a consistent tree with low expected
cost?
One approach favour low-cost attribute by
replacing Gain with
Tan Schlimmer (90)
Nunez (88) where
w?? 0,1 determines cost importance

54
Components of DT

Model (structure)
Tree (not pre-specified but derived from data)
Preference (score function) preference criteria
used to measure the quality of the tree
structures
number of misclassifications over all examples
(loss function)
Search method (how the data is searched by the
algorithm)
hill climbing search over tree structure (2
phases grow and prune)

55
DTs - Summary