Title: Project EMDMLR Decision Tree Classifiers Part 1
1Project EMD-MLR Decision Tree Classifiers (Part
1)
2Presentation Outline
- Introduction to Pattern Recognition
- Introduction to the Decision Tree Classifier
- Important Tree Functions
- Growing Phase
- Pruning Phase
- Classify Phase
- Growing Phase
- Split Criteria
- Stopping Criteria
- Leaf Node Assignment
3Presentation Outline
- Pruning Phase
- How the Pruning Phase Works
- Classify Phase
- Data with Categorical Attributes
- Data with non-uniform misclassification costs
- Computational Complexity of the Decision Tree
Classifier
4Decision Tree Classifier Pruning PhaseHow do
we Prune a Tree?
- The tree that we normally grow is of larger size
than needed - The purpose of the growing phase is to build the
largest tree possible, with the smallest
misclassification error on the training set - The purpose of the pruning phase is to produce a
number (quite often more than one) pruned
versions of the tree - These pruned versions of the largest tree have
- Smaller misclassification error (at least some of
them) than the largest tree on unseen data - Are easier to interpret than the largest tree is
5Decision Tree Classifier Pruning PhaseHow do
we Prune a Tree
- To prune the tree that was built in the growing
phase we define a measure of tree goodness - This measure depends on the misclassification
error of the tree and on the size of the tree - Smaller tree misclassification error makes this
measure smaller - Smaller tree size makes this measure smaller
- The measure is defined as follows
- where is a non-negative constant ,
, is the trees misclassification error and
is the number of leaves in tree
6Decision Tree Classifier Pruning PhaseThe Big
Tree
7Decision Tree Classifier Pruning PhaseThe Big
Tree (BT)
- The misclassification error of the big tree is
equal to 0 - The number of leaves in
- the big tree are equal
- to 6
8Decision Tree Classifier Pruning PhaseThe BTs
Misclassification Cost
9Decision Tree Classifier Pruning PhaseBig Tree
minus branches of node 9 (BT-9)
- The misclassification error of the big tree minus
the branches of node 9 is equal to 1/10 - The number of leaves in the big tree minus the
branches of node 9 are equal to 5
10Decision Tree Classifier Pruning PhaseThe
(BT-9)s Misclassification Cost
11Decision Tree Classifier Pruning
PhaseComparisons of MC(BT) and MC(BT-9)
12Decision Tree Classifier Pruning PhaseBig Tree
minus branches of node 7 (BT-7)
- The misclassification error of the big tree minus
branches of node 7 is equal to 1/10 - The number of leaves in the big tree minus the
branches of node 4 are equal to 4
13Decision Tree Classifier Pruning PhaseThe
(BT-7)s Misclassification Cost
14Decision Tree Classifier Pruning
PhaseComparisons of MC(BT) and MC(BT-7)
15Decision Tree Classifier Pruning PhaseBig Tree
minus branches of node 4 (BT-4)
- The misclassification error of the big tree minus
branches of node 4 is equal to 2/10 - The number of leaves in the big tree minus the
branches of node 4 are equal to 3
16Decision Tree Classifier Pruning PhaseThe
(BT-4)s Misclassification Cost
17Decision Tree Classifier Pruning
PhaseComparisons of MC(BT) and MC(BT-4)
18Decision Tree Classifier Pruning PhaseBig Tree
minus branches of node 3 (BT-3)
- The misclassification error of the big tree minus
branches of node 3 is equal to 2/10 - The number of leaves in the big tree minus the
branches of node 3 are equal to 2
19Decision Tree Pruning PhaseThe (BT-3)s
Misclassification Cost
20Decision Tree Classifier Pruning
PhaseComparisons of MC(BT) and MC(BT-3)
21Decision Tree Classifier Pruning PhaseBig Tree
minus branches of node 1 (BT-1)
- The misclassification error of the big tree minus
branches of node 3 is equal to 5/10 - The number of leaves in the big tree minus the
branches of node 1 are equal to 1
22Decision Tree Pruning PhaseThe (BT-1)s
Misclassification Cost
23Decision Tree Classifier Pruning
PhaseComparisons of MC(BT) and MC(BT-1)
24Decision Tree Classifier Pruning PhaseWhich
nodes branches do we prune?
- We find the pruned tree (BT-9, or BT-7, or BT-4,
or BT-3, or BT-1) whose misclassification cost
becomes smaller first, as compared to the
misclassification cost of the big tree (as
increases) - The pruned tree for which this event happens
first is BT-3 (see next slide)
25Decision Tree Classifier Pruning
PhaseComparisons MC(BT), MC(BT-9), MC(BT-7),
MC(BT-4), MC(BT-3), MC(BT-1)
26Decision Tree Classifier Pruning PhaseThe
Chosen Pruned Tree BT-3
27Decision Tree Classifier Pruning PhaseWhat
Next?
- Pruning continues along the same lines
- But now we will apply additional pruning to the
already pruned big tree - That is, we will apply additional pruning to the
tree BT-3
28Decision Tree Classifier Pruning PhaseThe Tree
BT-3
29Decision Tree Classifier Pruning PhaseBT-3
minus branches of node 1 (BT-3-1)
- The misclassification error of the BT-3 minus
branches of node 1 is equal to 5/10 - The number of leaves in the BT-3 tree minus the
branches of node 1 are equal to 1
30Decision Tree Classifier Pruning PhaseThe
(BT-3-1)s Misclassification Cost
31Decision Tree Classifier Pruning
PhaseComparisons of MC(BT-3) and MC(BT-3-1)
32Decision Tree Classifier Pruning PhaseWhich
nodes branches do we prune?
- We find the pruned tree (BT-3-1) whose
misclassification cost becomes smaller than the
misclassification cost of the BT-3 first (as
increases) - Here we have no competition amongst pruned trees
- So, at some appropriate value, this will
happen
33Decision Tree Classifier Pruning PhaseThe Tree
BT-3-1
34Decision Tree Classifier Pruning PhaseWhat
Next?
- There is no more pruning that we can apply.
Since, eventually through pruning, we ended up
with a tree consisting only of the root of the
tree - So we designate the pruning process complete
- The pruning process discovered two pruned trees
- These trees are the trees BT-3 and BT-3-1
35Decision Tree Classifier Pruning PhaseThe
Pruned Trees BT-3 and BT-3-1
1
1,2,3,4,5 A6,7,8,9,10 - B
36Decision Tree Classifier Classify Phase
- We have already gone through the growing and the
pruning phase of the decision tree - We have discovered that three trees were worth
storing for further consideration - These are
- The Big Tree (BT)
- The Big Tree minus the branches of node 3 (BT-3)
- The Big Tree minus the branches of node 1
(BT-3-1) root node only - We are now ready to examine the performance of
two trees (BT and BT-3) on unseen data - The performance of the BT-3-1 tree on unseen data
will not be examined
37Decision Tree Classifier Classify
PhasePerformance of BT on New Data 147/160
38Decision Tree Classifier Classify
PhasePerformance of BT on New Data 12/160
39Decision Tree Classifier Classify
PhasePerformance of BT-3 on New Data 138/160
40Decision Tree Classifier Classify
PhasePerformance of BT-3 on New Data 22/160
41Decision Tree Classifier Classify PhaseWhich
one is the Best Tree?
- Out of the available possibilities (the unpruned
tree big tree) and its pruned versions we choose - The tree that has the smallest, or close to the
smallest, classification error on new data, and - It has the smallest size (number of leaves)
- For instance, in the previous example
- We could choose the pruned tree as our preferred
tree because it has reasonable performance on the
test set and fewer leaves - Sometimes the choice is not that obvious
42Decision Tree Classifier Special FeaturesData
with non-uniform misclassification costs
43Decision Tree Classifier Special
FeaturesNon-uniform misclassification costs
- In this example, the misclassification cost of
mistaking A for B is 1 - In this example, the misclassification cost of
mistaking B for A is equal to 2 - Hence, it is twice as expensive to predict A,
while B is true, than the other way around - There are no costs associated with making the
correct prediction
A
B
P\T
A
B
T ? True class P ? Predicted class
44Decision Tree Classifier Special FeaturesData
with non-uniform misclassification costs
Misclassification cost for class B is twice as
big as the misclassification cost for class A
45Decision Trees Special FeaturesTree grown from
data with non-uniform costs
- The growing phase of the tree works in a similar
fashion as if we had all misclassification costs
equal - But now, the tree operates as if the class B data
have twice as much weight as the class A data - The figure in the next page shows the fully grown
tree
46Decision Tree Classifier Special FeaturesThe
big tree grown from data with non-uniform costs
Misclassification cost for class B is twice as
big as the misclassification cost for class A
47Decision Tree Classifier Special FeaturesThe
big tree grown from data with uniform costs
Misclassification cost for class B is the same as
the misclassification cost for class A
48Decision Tree Classifier Special
FeaturesPruned tree produced from data with
non-uniform costs
Misclassification cost for class B is twice as
big as the misclassification cost for class A
49Decision Tree Classifier Special
FeaturesPruned tree produced from data with
uniform costs
Misclassification cost for class B is the same as
the misclassification cost for class A
50Decision Tree Classifier Special
FeaturesDifferences (uniform vs. non-uniform
costs)
- It turns out, that for this particular case of
the grown and pruned trees, there are no major
differences between the cases of uniform and
non-uniform costs (when it is twice as costly to
make mistakes in correctly classifying class B
compared to mistakes in correctly classifying
class A) - But, there are differences in the estimates of
misclassification errors for the trees grown for
uniform costs versus the trees grown for
non-uniform costs - For instance, observe in the following figure,
the misclassification error of the right child of
the root node of the tree for non-uniform and
uniform costs
51Decision Tree Classifier Special
FeaturesDifferences (uniform vs. non-uniform
costs)
Pruned tree for non-uniform cost
Pruned tree for uniform cost
Misclassification error of right child Is equal
to 2/10
Misclassification error of right child Is equal
to 2/15
52Decision Tree Classifier Special
FeaturesNon-uniform misclassification costs
- In this example, the misclassification cost of
mistaking B for A is 1 - In this example, the misclassification cost of
mistaking A for B is equal to 2 - Hence, it is twice as expensive to predict B,
while A is true, than the other way around - There are no costs associated with making the
correct prediction
A
B
P/T
A
B
T ? True class P ? Predicted class
53Decision Tree Classifier Special FeaturesData
with non-uniform misclassification costs
54Decision Trees Special FeaturesTree grown from
data with non-uniform costs
- The growing phase of the tree works in a similar
fashion as if we had all misclassification costs
equal - But now, the tree operates as if the class A data
have twice as much weight as the class B data - The figure in the next page shows the fully grown
tree
55Decision Tree Classifier Special FeaturesThe
big tree grown from data with non-uniform costs
Misclassification cost for class A is twice as
big as the misclassification cost for class B
56Decision Tree Classifier Special FeaturesThe
big tree grown from data with uniform costs
Misclassification cost for class A is the same as
the misclassification cost for class B
57Decision Tree Classifier Special Features1st
Pruned tree produced data with non-uniform costs
Misclassification cost for class A is twice as
big as the misclassification cost for class B
58Decision Tree Classifier Special Features2nd
Pruned tree produced data with non-uniform costs
Misclassification cost for class A is twice as
big as the misclassification cost for class B
59Decision Tree Classifier Special
FeaturesPruned tree produced from data with
uniform costs
Misclassification cost for class A is the same as
the misclassification cost for class B
60Decision Tree Classifier Special
FeaturesDifferences (uniform vs. non-uniform
costs)
- It turns out that for this particular case of
non-uniform costs the grown and pruned trees are
different than the ones grown and pruned for the
case of uniform costs - Furthermore, in this case of non-uniform costs,
there are differences in the estimates of
misclassification errors for the trees grown and
pruned compared to the trees grown and pruned for
the case of uniform costs - For instance, observe in the following figure,
the misclassification error of the right child of
the root node for non-uniform and uniform costs
61Decision Tree Classifier Special
FeaturesDifferences (uniform vs. non-uniform
costs)
Pruned tree for non-uniform cost
Pruned tree for uniform cost
Misclassification error of right child Is equal
to 2/10
Misclassification error of left child Is equal
to 2/15
62Decision Tree ClassifierComputational Complexity
- We assume that a training file is given to us for
the training of the classifier (e.g., iris_train) - N number of data-points (rows) of our training
set - d number of input attributes (columns) of our
training set - J number of distinct classes that the data could
belong to. This corresponds to the number of
distinct elements of the last column of the
training set (e.g., 1 or 2 or 3 for the
iris_train)
63Decision Tree ClassifierComputational Complexity
- Let us first find the complexity involved in
checking the splits associated with the first
attribute of our training set - Let us also assume that the first attribute is a
numerical attribute - First, we sort the N data-points in our training
set with respect to the first attribute. This
step requires - operations
- Secondly, we are finding the number of possible
splits with respect to this attribute. For each
one of these possible splits we are finding the
corresponding gain in information, if each and
every one of these splits is applied to the data.
The complexity of this step is proportional to
64Decision Tree ClassifierComputational Complexity
- In review, the complexity of steps 1 and 2 is
proportional to - We have to apply steps 1 and 2 d times, to
account for every attribute in our training set.
Thus, the complexity of performing the first
split with the decision tree classifier is
65Decision Tree ClassifierComputational Complexity
- We need to continue reapplying Steps 1 and 2
until the tree reaches the point where no node
can be split any more (it happens when either a
node has one data-point or a node is pure) - The complexity of reapplying these steps (if all
the attributes are numerical), until the tree
cannot grow any more, is proportional to (in the
best case)
66Decision Tree ClassifierComputational Complexity
- The complexity of reapplying these steps (if all
the attributes are numerical), until the tree
cannot grow any more, is proportional to (in the
worst case)
67Decision Tree ClassifierComputational Complexity
- Example of the Decision Tree Classifiers
Complexity - N1000
- d10
- All attributes are numerical
- Best Case Complexity is proportional to
- Worst Time Complexity is proportional to
68Decision Tree ClassifierComputational Complexity
- Example of the Decision Tree Classifiers
Complexity - N10000
- d50
- All attributes are numerical
- Best Case Complexity is proportional to
- Worst Time Complexity is proportional to
69Decision Tree ClassifierComputational Complexity
- What happens if one or more of the input
attributes are categorical attributes? - For an attribute that is categorical with L
possible distinct values we have to check either - a number of splits equal to (if we do complete
enumeration) - a number of splits approximately equal to (if we
use the IND criterion) -
70Decision Tree ClassifierComputational Complexity
- If the number of splits ( or )
that we have to check for each categorical
attribute do not exceed the number of splits (N)
that we have to check for a numerical attribute
then the previous complexity formulas - are still valid.
- The first formula above is the best scenario
case, while the second formula above is the worst
scenario case
71Decision Tree ClassifierComputational Complexity
- Consider an example where we have one categorical
attribute with L11, distinct values. Then (for a
complete enumeration of splits), - and as a result, the example with N1000, d10
(of which 9 are numerical attributes and the 10th
attribute is a categorical attribute) has
computational complexity proportional to a number
in the interval
72Decision Tree ClassifierComputational Complexity
- Consider an example where we have one categorical
attribute with L14, distinct values. Then (for
complete enumeration of splits), - and as a result, the example with N10000, d50
(of which 49 are numerical attributes and the
50th attribute is a categorical attribute) has
computational complexity proportional to a number
in the interval
Notice, how a small increase in the number of
distinct values of the categorical attribute
causes a significant increase in the number of
split values that need to be examined