Title: Project EMDMLR Decision Tree Classifiers Part 1
1Project EMD-MLR Decision Tree Classifiers (Part
1)
2Presentation Outline
- Introduction to Pattern Recognition
- Introduction to the Decision Tree Classifier
- Important Tree Functions
- Growing Phase
- Pruning Phase
- Classify Phase
- Growing Phase
- Split Criteria
- Stopping Criteria
- Leaf Node Assignment
3Presentation Outline
- Pruning Phase
- How the Pruning Phase Works
- Classify Phase
- Data with Categorical Attributes
- Data with non-uniform misclassification costs
- Computational Complexity of the Decision Tree
Classifier
4Pattern Recognition
- The ease with which we recognize a face,
understand spoken words, read handwritten
characters, identify our keys in our pocket by
feel and decide whether an apple is ripe by its
smell belies the astounding complex processes
that underlie these acts of pattern recognition
(Duda and Hart, 2001)
5Pattern RecognitionDefinition
- Pattern Recognition the act of taking in raw
data and taking an action based on the category
of the pattern -- has been crucial for our
survival-- and over the years we have tried to
develop algorithms that duplicate the amazing
ability of humans to recognize patterns(Duda and
Hart, 2001)
6Pattern RecognitionAlgorithms
- These algorithms are referred to asPattern
Recognition Algorithms orPattern
Classification Algorithms - An example of a pattern recognition or pattern
classification algorithm is the decision tree
algorithm (decision tree classifier)
7Pattern RecognitionExample
- To understand the complexity of a pattern
recognition system let us consider a simple
example, that of recognizing the type of an Iris
plant
8Pattern Recognition A case study Iris Data
- Iris data consists of 150 data-points of three
different types of flowers - Iris Virginica
- Iris Setosa
- Iris Versicolor
- Analogy Failed/Non-Failed Blades
- Each datum has four attributes
- Sepal Length (cm) (Feature 1)
- Sepal Width (cm) (Feature 2)
- Petal Length (cm) (Feature 3)
- Petal Width (cm) (Feature 4)
- Analogy Operating Hours, Starts, etc.
9Pattern Recognition Components of a Pattern
Recognition System
Problem Data
Feature Extraction
Feature Selection
Pattern Classification
Pattern Classes
10Pattern Recognition Feature Extraction,
Selection and Classification
- The feature extraction module has the purpose of
extracting (or collecting) some important
information for the task at hand - The feature selection module has the purpose of
extracting the features that are important to
achieve the objective of interest - The classifier module has the purpose of
classifying the data relying on the information
conveyed by the features selected
11Pattern Recognition Feature Extraction Iris
Data
- In our case the features have already been
extracted from the data and they are - Sepal Length (Feature 1)
- Sepal Width (Feature 2)
- Petal Length (Feature 3)
- Petal Width (Feature 4)
- Analogy Features already extracted for the blade
data include Operating Hours (OH), Various Types
of Trips (TR), etc.
12Pattern Recognition Feature Selection Iris Data
- In this case we are trying to determine which
features are the most important in recognizing
(classifying) the type of the iris plant - Colored Scatter plots (2-D plots) of 2 features
at a time might be useful (see next slide) - Analogy Scatter Plots of Blade Feature Data,
such as Operating Hours (OH), Trips (TR), Fired
Aborts (FA)
13Pattern Recognition Iris Data Feature Selection
14Pattern Recognition Histogram of the Petal
Length Feature
15Pattern Recognition Histogram of the Petal Width
Feature
16Pattern Recognition Simple Classifier Model
(Model 1)
17Pattern Recognition Simple Classifier Model
(Model 2)
18Pattern Recognition More Complex Classifier
Model (Model 3)
19Pattern RecognitionPerformance of Model 1
Separating Planes for testing data
20Pattern Recognition Performance of Model 2
21Pattern Recognition Performance of Model 3
22Pattern Recognition Selection of a Classifier
Model
- A classifier model is normally selected based on
the following measures of goodness - Performance of the classifier model on previously
unseen data - Simplicity of the classifier
- Other measures of goodness might be of interest
to the designer, such as - Computational Complexity of the Classifier
- Robustness of the classifier in the presence of
noise
23Pattern Recognition SystemSelection of a
Classifier Model
- An example of a classifier model is a classifier
model called - Decision Tree Classifier
24Decision Tree Classifier General Overview
- The method for constructing a decision tree
classifier from a collection of data is easy to
understand - Data consist of data attributes (e.g., operating
hours, number of starts) and the class label
(scrapped versus non-scrapped blade). - Initially all the data belong to the same set,
located at the root of the tree - Then, a data attribute is chosen, and a test on
this attribute, to split the data into smaller
subsets of higher percentages of one-class
labels, is employed
Decision Trees help you understand what type of
data attributes and attribute values lead to
certain class labels
25Decision Tree ClassifierGraphic Representation
of a Tree Classifier
Node 0 Root node Nodes 1,2 children of node
0 Node 0 parent of nodes 1 2 Nodes 1,3,4
leaves of the tree
0
Branch of the Tree
Node of the Tree
2
1
4
3
26Decision Tree ClassifierOperational Phases of
the Tree Classifier
- The decision tree has three distinct but
interrelated phases. These are - Growing Phase
- Pruning Phase
- Test (Classify/Performance) Phase
27Decision Tree ClassifierGrowing Phase of the
Decision Tree Classifier
0
2
1
4
3
28Decision Tree ClassifierPruning Phase of the
Decision Tree Classifier
0
2
1
4
3
29Decision Tree ClassifierAdvantages of a Decision
Tree Classifier
- It requires easy to understand elements for its
design, such as - A set of questions Q
- A rule for selecting the best split at any node
- A criterion for choosing the right size tree
- It can be applied to any data structure through
the appropriate formulation of the set of
questions Q
30Decision Tree ClassifierAdvantages of a Decision
Tree Classifier
- It handles both ordered and categorical
variables - Ordered Variable A variable assuming values from
the set1, 2, 3, 4, 5, 6, - Categorical Variable A variable assuming values
from the set green, red, blue, orange, - The final classification has a simple form which
can be compactly stored to efficiently classify
new data
31Decision Tree ClassifierAdvantages of a Decision
Tree Classifier
- It does automatic stepwise variable selection and
complexity reduction - It gives, with no additional effort, not only a
classification, but also an estimate of the
misclassification probability for the object
32Decision Tree ClassifierAdvantages of a Decision
Tree Classifier
- It is invariant under all monotone
transformations of the individual ordered
variables - E.g.After multiplying a specific feature by a
constant (say, change its measurement units), the
resulting decision tree remains unchanged - It is extremely robust to outliers and
misclassified points in the training set, used
for the trees design
33Decision Tree ClassifierAdvantages of a Decision
Tree Classifier
- The tree procedure gives easily understood and
interpreted information regarding the predictive
structure of the data - Given a decision tree, you can extract simple
IF-THEN rules, that show you how the thought
process of the tree, when it classifies. - It has been used successfully in a variety of
applications (see following slides)
34Decision Tree ClassifierApplications of a
Decision Tree Classifier
- Medical Applications
- Wisconsin Breast Cancer (predict whether a tissue
sample taken from a patient is malignant or
benign two classes, nine numerical attributes) - Bupa Livers Disorder (predict whether or not a
male patient has a liver disorder based on blood
tests and alcohol consumption two classes, six
numerical attributes)
35Decision Tree ClassifierApplications of a
Decision Tree Classifier
- Medical Applications
- PIMA Indian Diabetes (the patients are females at
least 21 years old of Pima Indian heritage living
near Phoenix, Arizona the problem is to predict
whether a patient would test positive for
diabetes there are two classes, seven numerical
attributes) - Heart Disease (the problem here is to predict the
presence or absence of heart disease based on
various medical tests there are two classes,
seven numerical attributes and six categorical
attributes)
36Decision Tree ClassifierApplications of a
Decision Tree Classifier
- Image Recognition Applications
- Satellite Image (this dataset gives the
multi-spectral values of pixels within 3x3
neighborhoods in a satellite image, and the
classification associated with the central pixel
the aim is to predict the classification given
the multi-spectral values. There are six classes
and thirty six numerical attributes) - Image Segmentation (this is a database of seven
outdoor images every pixel should be classified
as brickface, sky, foliage, cement, window, path,
or grass there are seven classes and nineteen
numerical attributes)
37Decision Tree ClassifierApplications of a
Decision Tree Classifier
- Other Applications
- Boston Housing (this dataset gives housing values
in Boston suburbs there are three classes,
twelve numerical attributes, one binary
attribute) - Congressional Voting Records (this database gives
the votes of each member of the U.S. House of
Representatives of the 98th Congress on sixteen
key issues the problem is to classify a
congressman as a Democrat, or a Republican based
on the sixteen votes there are two classes,
sixteen categorical attributes (yea, nay,
neither)
38Decision Tree Classifier Growing Phase
- The growing phase of the tree revolves around
three elements - The selection of the splits
- The decision of when to designate a node terminal
or to continue splitting it - How to determine the class assignments of the
terminal nodes
39Decision Tree Classifier Selection of Splits
- What is a split?
- Each node of the tree represents a box (rectangle
in 2 dimensions) in the feature space. - Growing of the tree can be accomplished by
splitting the box into 2 new boxes. - The node t representing the original box becomes
the parent node of the two nodes (children tL
and tR) representing the 2 new boxes. - A rectangle can be split in two ways across the
x1 or the x2 dimension
40Decision Tree ClassifierSelection of Splits
- What is a split? (continued)
- A box in n dimensions can be split in many
different ways. - The dimension along which we perform the split is
called split attribute or split feature. - The specific value at which the split occurs is
called split value. - What do we accomplish by splitting?
- The growing of a tree, whose terminal nodes
represent very specific rules. - Smaller rectangles that contain patterns, most of
which are of the same class label, will provide
us with very specific, accurate classification
rules. - How do we select a good split?
- We select a split attribute and corresponding
split value so that the resulting children nodes
are purer.
41Decision Tree Classifier Selection of Splits
- Define by the proportion of class j
cases at node t of the tree - Define also by a measure of impurity for
node t of the decision tree. Note that - is a nonnegative function
- it depends on the probabilities
, where J is the number of different
classes - It achieves its maximum value (equal to 1) when
the are all equal to - It achieves its minimum value (equal to 0) when
one of the is equal to 1 and the
rest of the are equal to 0
42Decision Tree Classifier Examples of impurity
measures
- The Entropy impurity measure
- The Gini Function impurity measure
43Decision Tree Classifier Entropy Gini Impurity
Functions
44Decision Tree Classifier Selection of Splits
- Selection procedure
- Given a node t select the split attribute n and
the split value s so that the difference between
the impurity of t and the average impurity of the
children tL and tR is maximized, viz.
45Decision Tree Classifier Selection of Splits An
Example
- Example
- Let a node t be represented by the rectangle
0?x1?10, 0?x2?10 containing 50 patterns of
class 1 (blue) and 50 patterns of class 2 (red). - Lets consider splitting t along the x1
attribute.
46Decision Tree Classifier Differences between
Impurity Functions
- Difference in impurities versus x1
- Best split value is 5.146 for both impurity
functions
47Decision Tree Classifier Differences between
Impurity Functions
- Entropy and Gini impurities are qualitatively
similar and, therefore, most often they give
similar, if not identical, splits.
48Decision Tree Classifier Resubstitution Error
as Impurity Function
- Another candidate function that seems natural to
be used as a node impurity measure would be the
Resubstitution Error (misclassification error on
the training set)
49Decision Tree Classifier Resubstitution Error
as Impurity Function
- Most of the times using the resubstitution error
(RE) will provide the same best split as when
using the Gini or entropy impurity (Case A) - However, there are a number of occasions, where
the difference in impurity ?i(n,st) as measured
by the RE is locally flat (regions of same
values), implying a variety of equally good
splits. (Case B)Among those equally good splits
there is usually one or two that intuitively
would seem more reasonable. These latter splits
are typically easier to identify via the Gini or
entropy impurities - The phenomenon of non-uniqueness of best splits,
which rarely occurs when using the Gini or
entropy impurities, makes the RE less
suitable/convenient to determine splits
50Decision Tree Classifier Resubstitution Error
as Impurity Function
- Case A
- RE, Gini, entropy suggest a unique best split.
- It is quite common that all three impurity
measures may suggest similar if not identical
splits.
51Decision Tree Classifier Resubstitution Error
as Impurity Function
- Case B
- RE claims all splits are equivalent!
- Gini entropy suggest only two equivalent
splits, which intuitively are quite reasonable.
52Decision Tree Classifier An Example
53Decision Tree ClassifierAn Example
54Decision Tree Classifier The first Split Level 0
- We are at the root of the tree with data 1, 2,
3, 4, 5 of Class A and data (6, 7, 8, 9, 10 of
class B - The possible x-splits that we need to consider
are - The possible y-splits that we need to consider
are
55Decision Tree ClassifierChange in Impurity for
x-splits Level 0
56Decision Tree ClassifierChange in Impurity for
y-splits
57Decision Tree Classifier Calculation of the
impurity difference
- Best Split
- Left Node Data 1, 2, 4 , Right Node Data
3, 5, 6, 7, 8, 9,10 - Impurity of Parent
- Impurity of left child
58Decision Tree Classifier Calculation of the
Impurity Difference
- Impurity of the right child
- Average Impurity of the left and right child
- Difference in Impurity
59Decision Tree Classifier Picture of how the best
1st split looks
1,2,3,4,5 A 6,7,8,9,10 B
x gt 0.35
x lt 0.35
1, 2, 4 - A
3,5 A 6,7,8,9,10 - B
Numerals ? data Letters ? class label
60Decision Tree ClassifierPicture of how another
1st split looks
Numerals ? data Letters ? class label
61Decision Tree ClassifierPicture of how another
1st split looks
Numerals ? data Letters ? class label
62Decision Tree ClassifierPicture of how another
1st split looks
Numerals ? data Letters ? class label
63Decision Tree ClassifierThe second Split Level 1
- The left node (child) of the tree has data 1, 2,
4 of the same classification (class A). So, no
further splitting of the data residing in the
left node is needed. - The right node (child) of the tree has data 3,
5, 6, 7, 8, 9, 10 of which data 3, 5 are of
class A and data 6, 7, 8, 9, 10 are of class B.
So further splitting of the data residing in the
right node is needed. - The possible x-splits that we need to consider
are - The possible y-splits that we need to consider
are
64Decision Tree Classifier Change in Impurity for
x-splits Level 1, right node
65Decision Tree ClassifierChange in Impurity for
y-splits Level 1, right node
66Decision Tree Classifier Picture of how 2nd
split looks
1,2,3,4,5 A 6,7,8,9,10 - B
3,5 A 6,7,8,9,10 - B
1, 2, 4 - A
x gt0.5
x lt 0.5
3,5 A 7,9 - B
6,8,10 - B
Numerals ? data Letters ? class label
67Decision Tree ClassifierPicture of how 3rd split
looks
1,2,3,4,5 A 6,7,8,9,10 - B
3,5 A 6,7,8,9,10 - B
1, 2, 4 - A
3,5 A 7,9 - B
6,8,10 - B
y gt 0.5
y lt 0.5
5 A 7,9 - B
3 A
Numerals ? data Letters ? class label
68Decision Tree ClassifierPicture of how 4th split
looks
1,2,3,4,5 A 6,7,8,9,10 - B
3,5 A 6,7,8,9,10 - B
1, 2, 4 - A
3,5 A 7,9 - B
6,8,10 - B
5 A 7,9 - B
3 A
x gt 0.4
x lt 0.4
5 A 7 - B
9 B
Numerals ? data Letters ? class label
69Decision Tree Classifier Picture of how 5th
split looks
1,2,3,4,5 A 6,7,8,9,10 - B
3,5 A 6,7,8,9,10 - B
1, 2, 4 - A
3,5 A 7,9 - B
6,8,10 - B
5 A 7,9 - B
3 A
5 A 7 - B
9 A
y gt 0.6
y lt 0.6
7 B
5 A
Numerals ? data Letters ? class label
70Decision Tree Classifier Understanding Split
Choices
Pr(Class 1) 0.5 Pr(Class 2) 0.5
Pr(Class 1) Pr(Class 2)
Pr(Class 1) Pr(Class 2)
is the portion of class 1 data going to
the left child is the portion of class 2
data going to the left child is the
portion of class 1 data going to the right child
is the portion of class 2 data going to
the right child
71Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.5, Pr (Class 2)0.5 An Example
72Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.5, Pr (Class 2)0.5 An Example
Pr(Class 1) 0.5 Pr(Class 2) 0.5
Pr(Class 1) 0.05 Pr(Class 2) 0.4
Pr(Class 1) 0.45 Pr(Class 2) 0.1
73Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.5, Pr (Class 2)0.5 An Example
74Decision Tree Classifier Understanding Split
Choices
Pr(Class 1) 0.6 Pr(Class 2) 0.4
Pr(Class 1) Pr(Class 2)
Pr(Class 1) Pr(Class 2)
is the portion of class 1 data going to
the left child is the portion of class 2
data going to the left child is the
portion of class 1 data going to the right child
is the portion of class 2 data going to
the right child
75Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.6, Pr (Class 2)0.4
76Decision Tree Classifier Understanding Split
Choices
Pr(Class 1) 0.7 Pr(Class 2) 0.3
Pr(Class 1) Pr(Class 2)
Pr(Class 1) Pr(Class 2)
is the portion of class 1 data going to
the left child is the portion of class 2
data going to the left child is the
portion of class 1 data going to the right child
is the portion of class 2 data going to
the right child
77Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.7, Pr (Class 2)0.3
78Decision Tree Classifier Understanding Split
Choices
Pr(Class 1) 0.8 Pr(Class 2) 0.2
Pr(Class 1) Pr(Class 2)
Pr(Class 1) Pr(Class 2)
is the portion of class 1 data going to
the left child is the portion of class 2
data going to the left child is the
portion of class 1 data going to the right child
is the portion of class 2 data going to
the right child
79Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.8, Pr (Class 2)0.2
80Decision Tree ClassifierUnderstanding Split
Choices
Pr(Class 1) 0.9 Pr(Class 2) 0.1
Pr(Class 1) Pr(Class 2)
Pr(Class 1) Pr(Class 2)
is the portion of class 1 data going to
the left child is the portion of class 2
data going to the left child is the
portion of class 1 data going to the right child
is the portion of class 2 data going to
the right child
81Decision TreesUnderstanding Split ChoicesPr
(Class 1) 0.9, Pr (Class 2)0.1
82Decision Tree Classifier -Terminal Node Issue
When does the tree stops growing?
1,2,3,4,5 A 6,7,8,9,10 - B
- Criterion 1 (Stop Min Records)
- The number of records in an node is below a
minimum number of records threshold - The minimum number of records criterion is
checked first
3,5 A 6,7,8,9,10 - B
1, 2, 4 - A
3,5 A 7,9 - B
6,8,10 - B
y gt 0.5
y lt 0.5
5 A 7,9 - B
3 A
Stop Beta 0.0 Stop Purity 100 Stop Min
Records 2
Stop Reason Reached Min Records
83Decision Tree Classifier -Terminal Node
IssueWhen does the tree stops growing?
- Criterion 2 (Stop Purity)
- We have reached an acceptable purity level
- The purity level stop criterion
- is checked second
1,2,3,4,5 A 6,7,8,9,10 - B
x gt 0.35
x lt 0.35
1, 2, 4 - A
3,5 A 6,7,8,9,10 - B
Stop Beta 0.0 Stop Purity 100 Stop Min
Records 2
Stop Reason Reached Purity Level
84Decision Tree Classifier -Terminal Node Issue
When does the tree stops growing?
- Criterion 3 (Stop Beta)
- The maximum difference in impurity between parent
and children is smaller than an allowable
difference threshold - The maximum difference in impurity stop criterion
is checked third
1,2,3,4,5 A 6,7,8,9,10 B
x lt 0.35
x gt 0.35
3,5 - A 6,7,8,9,10 - B
1, 2, 4 - A
Stop Reason Reached threshold for Beta
Stop Beta 0.3 Stop Purity 100 Stop Min.
Records 2
85Decision Tree Classifier Class Node Assignments
1,2,3,4,5 A 6,7,8,9,10 - B
- In the figure to the right the class assignment
for the right node of the tree is Class B
because the majority class is Class B. - The right node has 5 records from Class B and 2
records from Class A
x gt 0.35
x lt 0.35
1, 2, 4 - A
3,5 A 6,7,8,9,10 - B
Class Assignment Majority Class Class A
Class Assignment Majority Class Class B