Classification: Decision Trees - PowerPoint PPT Presentation

1 / 71

About This Presentation

Title:

Classification: Decision Trees

Description:

Find a model for class attribute as a function of the values of other attributes. ... is represented as classification rules, decision trees, or mathematical formulae ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 72

Provided by: grego127

Category:

more less

Transcript and Presenter's Notes

Title: Classification: Decision Trees

1
Classification Decision Trees
2
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

3
ClassificationA Two-Step Process

Model construction describing a set of
predetermined classes
Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
The set of tuples used for model construction
training set
The model is represented as classification rules,
decision trees, or mathematical formulae
Model usage for classifying future or unknown
objects
Estimate accuracy of the model
The known label of test sample is compared with
the classified result from the model
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
Test set is independent of training set,
otherwise over-fitting will occur

4
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6
Illustrating Classification Task
7
Examples of Classification Task

Predicting tumor cells as benign or malignant
Classifying credit card transactions as
legitimate or fraudulent
Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance, weather,
entertainment, sports, etc

8
Classification Techniques

Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

9
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
10
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
11
Decision Tree Classification Task
Decision Tree
12
Apply Model to Test Data
Test Data
Start from the root of tree.
13
Apply Model to Test Data
Test Data
14
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
15
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
16
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
17
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
18
Decision Tree Classification Task
Decision Tree
19
Outline

Top-Down Decision Tree Construction
Choosing the Splitting Attribute
Information Gain and Gain Ratio

20
DECISION TREE

An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,
Colorred.
A leaf node represents a class label or class
label distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
A new case is classified by following a matching
path to a leaf node.

21
Weather Data Play or not Play?
Note Outlook is the Forecast, no relation to
Microsoft email program
22
Example Tree for Play?
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
23
Building Decision Tree Q93

Top-down tree construction
At start, all training examples are at the root.
Partition the examples recursively by choosing
one attribute each time.
Bottom-up tree pruning
Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases.

24
Choosing the Splitting Attribute

At each node, available attributes are evaluated
on the basis of separating the classes of the
training examples. A Goodness function is used
for this purpose.
Typical goodness functions
information gain (ID3/C4.5)
information gain ratio
gini index

witteneibe
25
How to determine the Best Split

Greedy approach
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity

Non-homogeneous, High degree of impurity
Homogeneous, Low degree of impurity
26
Splitting Criteria based on INFO

Entropy at a given node t
(NOTE p( j t) is the relative frequency of
class j at node t).
Measures homogeneity of a node.
Maximum (log nc) when records are equally
distributed among all classes implying least
information
Minimum (0.0) when all records belong to one
class, implying most information
Entropy based computations are similar to the
GINI index computations

27
Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (1/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
28
Splitting Based on INFO...

Information Gain
Parent Node, p is split into k partitions
ni is number of records in partition i
Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most
reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage Tends to prefer splits that result
in large number of partitions, each being small
but pure.

29
Which attribute to select?
witteneibe
30
A criterion for attribute selection

Which is the best attribute?
The one which will result in the smallest tree
Heuristic choose the attribute that produces the
purest nodes
Popular impurity criterion information gain
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy choose attribute that results in
greatest information gain

witteneibe
31
Example attribute Outlook

Outlook Sunny
Outlook Overcast
Outlook Rainy
Expected information for attribute

Note log(0) is not defined, but we evaluate
0log(0) as zero
witteneibe
32
Computing the information gain

Information gain
(information before split) (information after
split)
Information gain for attributes from weather data

witteneibe
33
Continuing to split
witteneibe
34
The final decision tree

Note not all leaves need to be pure sometimes
identical instances have different classes
? Splitting stops when data cant be split any
further

witteneibe
35
Gini Index

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j
in T.
If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as
The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).

36
Measure of Impurity GINI

Gini Index for a given node t
(NOTE p( j t) is the relative frequency of
class j at node t).
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information

37
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38
Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong
to the same class
Stop expanding a node when all the records have
similar attribute values
Early termination (to be discussed later)

39
Decision Tree Based Classification

Advantages
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets

40
Example C4.5

Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You can download the software fromhttp//www.cse
.unsw.edu.au/quinlan/c4.5r8.tar.gz

41
Practical Issues of Classification

Underfitting and Overfitting
Missing Values
Costs of Classification

42
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
43
Overfitting due to Noise
Decision boundary is distorted by noise point
44
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
45
Notes on Overfitting

Overfitting results in decision trees that are
more complex than necessary
Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Need new ways for estimating errors

46
Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among
competing models?

47
Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among
competing models?

48
Metrics for Performance Evaluation

Focus on the predictive capability of a model
Rather than how fast it takes to classify or
build models, scalability, etc.
Confusion Matrix

a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
49
Metrics for Performance Evaluation

Most widely-used metric

50
Limitation of Accuracy

Consider a 2-class problem
Number of Class 0 examples 9990
Number of Class 1 examples 10
If model predicts everything to be class 0,
accuracy is 9990/10000 99.9
Accuracy is misleading because model does not
detect any class 1 example

51
Cost Matrix
C(ij) Cost of misclassifying class j example as
class i
52
Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
53
Cost vs Accuracy
54
Cost-Sensitive Measures

Precision is biased towards C(YesYes)
C(YesNo)
Recall is biased towards C(YesYes) C(NoYes)
F-measure is biased towards all except C(NoNo)

55
Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?

56
Methods for Performance Evaluation

How to obtain a reliable estimate of performance?
Performance of a model may depend on other
factors besides the learning algorithm
Class distribution
Cost of misclassification
Size of training and test sets

57
Methods of Estimation

Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold train on k-1 partitions, test on the
remaining one
Leave-one-out kn

58
Bayes Classifier

A probabilistic framework for solving
classification problems
Conditional Probability
Bayes theorem

59
Example of Bayes Theorem

Given
A doctor knows that meningitis causes stiff neck
50 of the time
Prior probability of any patient having
meningitis is 1/50,000
Prior probability of any patient having stiff
neck is 1/20
If a patient has stiff neck, whats the
probability he/she has meningitis?

60
Bayesian Classifiers

Consider each attribute and class label as random
variables
Given a record with attributes (A1, A2,,An)
Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C A1, A2,,An )
Can we estimate P(C A1, A2,,An ) directly from
data?

61
Bayesian Classifiers

Approach
compute the posterior probability P(C A1, A2,
, An) for all values of C using the Bayes
theorem
Choose value of C that maximizes P(C A1, A2,
, An)
Equivalent to choosing value of C that maximizes
P(A1, A2, , AnC) P(C)
How to estimate P(A1, A2, , An C )?

62
Naïve Bayes Classifier

Assume independence among attributes Ai when
class is given
P(A1, A2, , An C) P(A1 Cj) P(A2 Cj) P(An
Cj)
Can estimate P(Ai Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj) ? P(Ai
Cj) is maximal.

63
Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
64
Nearest Neighbor Classifiers

Basic idea
If it walks like a duck, quacks like a duck, then
its probably a duck

65
Nearest-Neighbor Classifiers

Requires three things
The set of stored records
Distance Metric to compute distance between
records
The value of k, the number of nearest neighbors
to retrieve
To classify an unknown record
Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)

66
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
67
1 nearest-neighbor
Voronoi Diagram
68
Nearest Neighbor Classification

Compute distance between two points
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the
k-nearest neighbors
Weigh the vote according to distance
weight factor, w 1/d2

69
Nearest Neighbor Classification

Choosing the value of k
If k is too small, sensitive to noise points
If k is too large, neighborhood may include
points from other classes

70
Nearest Neighbor Classification

Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
Example
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from 10K to 1M

71
Nearest neighbor Classification