Title: Classification: Decision Trees
1Classification Decision Trees
2Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
3ClassificationA Two-Step Process
- Model construction describing a set of
predetermined classes - Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute - The set of tuples used for model construction
training set - The model is represented as classification rules,
decision trees, or mathematical formulae - Model usage for classifying future or unknown
objects - Estimate accuracy of the model
- The known label of test sample is compared with
the classified result from the model - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model - Test set is independent of training set,
otherwise over-fitting will occur
4Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6Illustrating Classification Task
7Examples of Classification Task
- Predicting tumor cells as benign or malignant
- Classifying credit card transactions as
legitimate or fraudulent - Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil - Categorizing news stories as finance, weather,
entertainment, sports, etc
8Classification Techniques
- Decision Tree based Methods
- Rule-based Methods
- Memory based reasoning
- Neural Networks
- Naïve Bayes and Bayesian Belief Networks
- Support Vector Machines
9Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
10Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
11Decision Tree Classification Task
Decision Tree
12Apply Model to Test Data
Test Data
Start from the root of tree.
13Apply Model to Test Data
Test Data
14Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
15Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
16Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
17Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
18Decision Tree Classification Task
Decision Tree
19Outline
- Top-Down Decision Tree Construction
- Choosing the Splitting Attribute
- Information Gain and Gain Ratio
20DECISION TREE
- An internal node is a test on an attribute.
- A branch represents an outcome of the test, e.g.,
Colorred. - A leaf node represents a class label or class
label distribution. - At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible - A new case is classified by following a matching
path to a leaf node.
21Weather Data Play or not Play?
Note Outlook is the Forecast, no relation to
Microsoft email program
22Example Tree for Play?
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
23Building Decision Tree Q93
- Top-down tree construction
- At start, all training examples are at the root.
- Partition the examples recursively by choosing
one attribute each time. - Bottom-up tree pruning
- Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases.
24Choosing the Splitting Attribute
- At each node, available attributes are evaluated
on the basis of separating the classes of the
training examples. A Goodness function is used
for this purpose. - Typical goodness functions
- information gain (ID3/C4.5)
- information gain ratio
- gini index
witteneibe
25How to determine the Best Split
- Greedy approach
- Nodes with homogeneous class distribution are
preferred - Need a measure of node impurity
Non-homogeneous, High degree of impurity
Homogeneous, Low degree of impurity
26Splitting Criteria based on INFO
- Entropy at a given node t
- (NOTE p( j t) is the relative frequency of
class j at node t). - Measures homogeneity of a node.
- Maximum (log nc) when records are equally
distributed among all classes implying least
information - Minimum (0.0) when all records belong to one
class, implying most information - Entropy based computations are similar to the
GINI index computations
27Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (1/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
28Splitting Based on INFO...
- Information Gain
- Parent Node, p is split into k partitions
- ni is number of records in partition i
- Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most
reduction (maximizes GAIN) - Used in ID3 and C4.5
- Disadvantage Tends to prefer splits that result
in large number of partitions, each being small
but pure.
29Which attribute to select?
witteneibe
30A criterion for attribute selection
- Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic choose the attribute that produces the
purest nodes - Popular impurity criterion information gain
- Information gain increases with the average
purity of the subsets that an attribute produces - Strategy choose attribute that results in
greatest information gain
witteneibe
31Example attribute Outlook
- Outlook Sunny
- Outlook Overcast
- Outlook Rainy
- Expected information for attribute
Note log(0) is not defined, but we evaluate
0log(0) as zero
witteneibe
32Computing the information gain
- Information gain
- (information before split) (information after
split) - Information gain for attributes from weather data
witteneibe
33Continuing to split
witteneibe
34The final decision tree
- Note not all leaves need to be pure sometimes
identical instances have different classes - ? Splitting stops when data cant be split any
further
witteneibe
35Gini Index
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as - where pj is the relative frequency of class j
in T. - If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as - The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).
36Measure of Impurity GINI
- Gini Index for a given node t
- (NOTE p( j t) is the relative frequency of
class j at node t). - Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information - Minimum (0.0) when all records belong to one
class, implying most interesting information
37Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38Stopping Criteria for Tree Induction
- Stop expanding a node when all the records belong
to the same class - Stop expanding a node when all the records have
similar attribute values - Early termination (to be discussed later)
39Decision Tree Based Classification
- Advantages
- Inexpensive to construct
- Extremely fast at classifying unknown records
- Easy to interpret for small-sized trees
- Accuracy is comparable to other classification
techniques for many simple data sets
40Example C4.5
- Simple depth-first construction.
- Uses Information Gain
- Sorts Continuous Attributes at each node.
- Needs entire data to fit in memory.
- Unsuitable for Large Datasets.
- Needs out-of-core sorting.
- You can download the software fromhttp//www.cse
.unsw.edu.au/quinlan/c4.5r8.tar.gz
41Practical Issues of Classification
- Underfitting and Overfitting
- Missing Values
- Costs of Classification
42Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
43Overfitting due to Noise
Decision boundary is distorted by noise point
44Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
45Notes on Overfitting
- Overfitting results in decision trees that are
more complex than necessary - Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records - Need new ways for estimating errors
46Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among
competing models?
47Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
- Methods for Model Comparison
- How to compare the relative performance among
competing models?
48Metrics for Performance Evaluation
- Focus on the predictive capability of a model
- Rather than how fast it takes to classify or
build models, scalability, etc. - Confusion Matrix
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
49Metrics for Performance Evaluation
50Limitation of Accuracy
- Consider a 2-class problem
- Number of Class 0 examples 9990
- Number of Class 1 examples 10
- If model predicts everything to be class 0,
accuracy is 9990/10000 99.9 - Accuracy is misleading because model does not
detect any class 1 example
51Cost Matrix
C(ij) Cost of misclassifying class j example as
class i
52Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
53Cost vs Accuracy
54Cost-Sensitive Measures
- Precision is biased towards C(YesYes)
C(YesNo) - Recall is biased towards C(YesYes) C(NoYes)
- F-measure is biased towards all except C(NoNo)
55Model Evaluation
- Metrics for Performance Evaluation
- How to evaluate the performance of a model?
- Methods for Performance Evaluation
- How to obtain reliable estimates?
56Methods for Performance Evaluation
- How to obtain a reliable estimate of performance?
- Performance of a model may depend on other
factors besides the learning algorithm - Class distribution
- Cost of misclassification
- Size of training and test sets
57Methods of Estimation
- Holdout
- Reserve 2/3 for training and 1/3 for testing
- Random subsampling
- Repeated holdout
- Cross validation
- Partition data into k disjoint subsets
- k-fold train on k-1 partitions, test on the
remaining one - Leave-one-out kn
58Bayes Classifier
- A probabilistic framework for solving
classification problems - Conditional Probability
- Bayes theorem
59Example of Bayes Theorem
- Given
- A doctor knows that meningitis causes stiff neck
50 of the time - Prior probability of any patient having
meningitis is 1/50,000 - Prior probability of any patient having stiff
neck is 1/20 - If a patient has stiff neck, whats the
probability he/she has meningitis?
60Bayesian Classifiers
- Consider each attribute and class label as random
variables - Given a record with attributes (A1, A2,,An)
- Goal is to predict class C
- Specifically, we want to find the value of C that
maximizes P(C A1, A2,,An ) - Can we estimate P(C A1, A2,,An ) directly from
data?
61Bayesian Classifiers
- Approach
- compute the posterior probability P(C A1, A2,
, An) for all values of C using the Bayes
theorem - Choose value of C that maximizes P(C A1, A2,
, An) - Equivalent to choosing value of C that maximizes
P(A1, A2, , AnC) P(C) - How to estimate P(A1, A2, , An C )?
62Naïve Bayes Classifier
- Assume independence among attributes Ai when
class is given - P(A1, A2, , An C) P(A1 Cj) P(A2 Cj) P(An
Cj) -
- Can estimate P(Ai Cj) for all Ai and Cj.
- New point is classified to Cj if P(Cj) ? P(Ai
Cj) is maximal.
63Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
64Nearest Neighbor Classifiers
- Basic idea
- If it walks like a duck, quacks like a duck, then
its probably a duck
65Nearest-Neighbor Classifiers
- Requires three things
- The set of stored records
- Distance Metric to compute distance between
records - The value of k, the number of nearest neighbors
to retrieve - To classify an unknown record
- Compute distance to other training records
- Identify k nearest neighbors
- Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)
66Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
671 nearest-neighbor
Voronoi Diagram
68Nearest Neighbor Classification
- Compute distance between two points
- Euclidean distance
- Determine the class from nearest neighbor list
- take the majority vote of class labels among the
k-nearest neighbors - Weigh the vote according to distance
- weight factor, w 1/d2
69Nearest Neighbor Classification
- Choosing the value of k
- If k is too small, sensitive to noise points
- If k is too large, neighborhood may include
points from other classes
70Nearest Neighbor Classification
- Scaling issues
- Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes - Example
- height of a person may vary from 1.5m to 1.8m
- weight of a person may vary from 90lb to 300lb
- income of a person may vary from 10K to 1M
71Nearest neighbor Classification
- k-NN classifiers are lazy learners
- It does not build models explicitly
- Unlike eager learners such as decision tree
induction and rule-based systems - Classifying unknown records are relatively
expensive