Classification: Decision Trees - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Classification: Decision Trees

Description:

Find a model for class attribute as a function of the values of other attributes. ... is represented as classification rules, decision trees, or mathematical formulae ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 72
Provided by: grego127
Category:

less

Transcript and Presenter's Notes

Title: Classification: Decision Trees


1
Classification Decision Trees
2
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.

3
ClassificationA Two-Step Process
  • Model construction describing a set of
    predetermined classes
  • Each tuple/sample is assumed to belong to a
    predefined class, as determined by the class
    label attribute
  • The set of tuples used for model construction
    training set
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Model usage for classifying future or unknown
    objects
  • Estimate accuracy of the model
  • The known label of test sample is compared with
    the classified result from the model
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model
  • Test set is independent of training set,
    otherwise over-fitting will occur

4
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6
Illustrating Classification Task
7
Examples of Classification Task
  • Predicting tumor cells as benign or malignant
  • Classifying credit card transactions as
    legitimate or fraudulent
  • Classifying secondary structures of protein as
    alpha-helix, beta-sheet, or random coil
  • Categorizing news stories as finance, weather,
    entertainment, sports, etc

8
Classification Techniques
  • Decision Tree based Methods
  • Rule-based Methods
  • Memory based reasoning
  • Neural Networks
  • Naïve Bayes and Bayesian Belief Networks
  • Support Vector Machines

9
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
10
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
11
Decision Tree Classification Task
Decision Tree
12
Apply Model to Test Data
Test Data
Start from the root of tree.
13
Apply Model to Test Data
Test Data
14
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
15
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
16
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
17
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
18
Decision Tree Classification Task
Decision Tree
19
Outline
  • Top-Down Decision Tree Construction
  • Choosing the Splitting Attribute
  • Information Gain and Gain Ratio

20
DECISION TREE
  • An internal node is a test on an attribute.
  • A branch represents an outcome of the test, e.g.,
    Colorred.
  • A leaf node represents a class label or class
    label distribution.
  • At each node, one attribute is chosen to split
    training examples into distinct classes as much
    as possible
  • A new case is classified by following a matching
    path to a leaf node.

21
Weather Data Play or not Play?
Note Outlook is the Forecast, no relation to
Microsoft email program
22
Example Tree for Play?
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
23
Building Decision Tree Q93
  • Top-down tree construction
  • At start, all training examples are at the root.
  • Partition the examples recursively by choosing
    one attribute each time.
  • Bottom-up tree pruning
  • Remove subtrees or branches, in a bottom-up
    manner, to improve the estimated accuracy on new
    cases.

24
Choosing the Splitting Attribute
  • At each node, available attributes are evaluated
    on the basis of separating the classes of the
    training examples. A Goodness function is used
    for this purpose.
  • Typical goodness functions
  • information gain (ID3/C4.5)
  • information gain ratio
  • gini index

witteneibe
25
How to determine the Best Split
  • Greedy approach
  • Nodes with homogeneous class distribution are
    preferred
  • Need a measure of node impurity

Non-homogeneous, High degree of impurity
Homogeneous, Low degree of impurity
26
Splitting Criteria based on INFO
  • Entropy at a given node t
  • (NOTE p( j t) is the relative frequency of
    class j at node t).
  • Measures homogeneity of a node.
  • Maximum (log nc) when records are equally
    distributed among all classes implying least
    information
  • Minimum (0.0) when all records belong to one
    class, implying most information
  • Entropy based computations are similar to the
    GINI index computations

27
Examples for computing Entropy
P(C1) 0/6 0 P(C2) 6/6 1 Entropy 0
log 0 1 log 1 0 0 0
P(C1) 1/6 P(C2) 5/6 Entropy
(1/6) log2 (1/6) (5/6) log2 (1/6) 0.65
P(C1) 2/6 P(C2) 4/6 Entropy
(2/6) log2 (2/6) (4/6) log2 (4/6) 0.92
28
Splitting Based on INFO...
  • Information Gain
  • Parent Node, p is split into k partitions
  • ni is number of records in partition i
  • Measures Reduction in Entropy achieved because of
    the split. Choose the split that achieves most
    reduction (maximizes GAIN)
  • Used in ID3 and C4.5
  • Disadvantage Tends to prefer splits that result
    in large number of partitions, each being small
    but pure.

29
Which attribute to select?
witteneibe
30
A criterion for attribute selection
  • Which is the best attribute?
  • The one which will result in the smallest tree
  • Heuristic choose the attribute that produces the
    purest nodes
  • Popular impurity criterion information gain
  • Information gain increases with the average
    purity of the subsets that an attribute produces
  • Strategy choose attribute that results in
    greatest information gain

witteneibe
31
Example attribute Outlook
  • Outlook Sunny
  • Outlook Overcast
  • Outlook Rainy
  • Expected information for attribute

Note log(0) is not defined, but we evaluate
0log(0) as zero
witteneibe
32
Computing the information gain
  • Information gain
  • (information before split) (information after
    split)
  • Information gain for attributes from weather data

witteneibe
33
Continuing to split
witteneibe
34
The final decision tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • ? Splitting stops when data cant be split any
    further

witteneibe
35
Gini Index
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j
    in T.
  • If a data set T is split into two subsets T1 and
    T2 with sizes N1 and N2 respectively, the gini
    index of the split data contains examples from n
    classes, the gini index gini(T) is defined as
  • The attribute provides the smallest ginisplit(T)
    is chosen to split the node (need to enumerate
    all possible splitting points for each attribute).

36
Measure of Impurity GINI
  • Gini Index for a given node t
  • (NOTE p( j t) is the relative frequency of
    class j at node t).
  • Maximum (1 - 1/nc) when records are equally
    distributed among all classes, implying least
    interesting information
  • Minimum (0.0) when all records belong to one
    class, implying most interesting information

37
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
38
Stopping Criteria for Tree Induction
  • Stop expanding a node when all the records belong
    to the same class
  • Stop expanding a node when all the records have
    similar attribute values
  • Early termination (to be discussed later)

39
Decision Tree Based Classification
  • Advantages
  • Inexpensive to construct
  • Extremely fast at classifying unknown records
  • Easy to interpret for small-sized trees
  • Accuracy is comparable to other classification
    techniques for many simple data sets

40
Example C4.5
  • Simple depth-first construction.
  • Uses Information Gain
  • Sorts Continuous Attributes at each node.
  • Needs entire data to fit in memory.
  • Unsuitable for Large Datasets.
  • Needs out-of-core sorting.
  • You can download the software fromhttp//www.cse
    .unsw.edu.au/quinlan/c4.5r8.tar.gz

41
Practical Issues of Classification
  • Underfitting and Overfitting
  • Missing Values
  • Costs of Classification

42
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
43
Overfitting due to Noise
Decision boundary is distorted by noise point
44
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
45
Notes on Overfitting
  • Overfitting results in decision trees that are
    more complex than necessary
  • Training error no longer provides a good estimate
    of how well the tree will perform on previously
    unseen records
  • Need new ways for estimating errors

46
Model Evaluation
  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
  • How to obtain reliable estimates?
  • Methods for Model Comparison
  • How to compare the relative performance among
    competing models?

47
Model Evaluation
  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
  • How to obtain reliable estimates?
  • Methods for Model Comparison
  • How to compare the relative performance among
    competing models?

48
Metrics for Performance Evaluation
  • Focus on the predictive capability of a model
  • Rather than how fast it takes to classify or
    build models, scalability, etc.
  • Confusion Matrix

a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
49
Metrics for Performance Evaluation
  • Most widely-used metric

50
Limitation of Accuracy
  • Consider a 2-class problem
  • Number of Class 0 examples 9990
  • Number of Class 1 examples 10
  • If model predicts everything to be class 0,
    accuracy is 9990/10000 99.9
  • Accuracy is misleading because model does not
    detect any class 1 example

51
Cost Matrix
C(ij) Cost of misclassifying class j example as
class i
52
Computing Cost of Classification
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
53
Cost vs Accuracy
54
Cost-Sensitive Measures
  • Precision is biased towards C(YesYes)
    C(YesNo)
  • Recall is biased towards C(YesYes) C(NoYes)
  • F-measure is biased towards all except C(NoNo)

55
Model Evaluation
  • Metrics for Performance Evaluation
  • How to evaluate the performance of a model?
  • Methods for Performance Evaluation
  • How to obtain reliable estimates?

56
Methods for Performance Evaluation
  • How to obtain a reliable estimate of performance?
  • Performance of a model may depend on other
    factors besides the learning algorithm
  • Class distribution
  • Cost of misclassification
  • Size of training and test sets

57
Methods of Estimation
  • Holdout
  • Reserve 2/3 for training and 1/3 for testing
  • Random subsampling
  • Repeated holdout
  • Cross validation
  • Partition data into k disjoint subsets
  • k-fold train on k-1 partitions, test on the
    remaining one
  • Leave-one-out kn

58
Bayes Classifier
  • A probabilistic framework for solving
    classification problems
  • Conditional Probability
  • Bayes theorem

59
Example of Bayes Theorem
  • Given
  • A doctor knows that meningitis causes stiff neck
    50 of the time
  • Prior probability of any patient having
    meningitis is 1/50,000
  • Prior probability of any patient having stiff
    neck is 1/20
  • If a patient has stiff neck, whats the
    probability he/she has meningitis?

60
Bayesian Classifiers
  • Consider each attribute and class label as random
    variables
  • Given a record with attributes (A1, A2,,An)
  • Goal is to predict class C
  • Specifically, we want to find the value of C that
    maximizes P(C A1, A2,,An )
  • Can we estimate P(C A1, A2,,An ) directly from
    data?

61
Bayesian Classifiers
  • Approach
  • compute the posterior probability P(C A1, A2,
    , An) for all values of C using the Bayes
    theorem
  • Choose value of C that maximizes P(C A1, A2,
    , An)
  • Equivalent to choosing value of C that maximizes
    P(A1, A2, , AnC) P(C)
  • How to estimate P(A1, A2, , An C )?

62
Naïve Bayes Classifier
  • Assume independence among attributes Ai when
    class is given
  • P(A1, A2, , An C) P(A1 Cj) P(A2 Cj) P(An
    Cj)
  • Can estimate P(Ai Cj) for all Ai and Cj.
  • New point is classified to Cj if P(Cj) ? P(Ai
    Cj) is maximal.

63
Example of Naïve Bayes Classifier
A attributes M mammals N non-mammals
P(AM)P(M) gt P(AN)P(N) gt Mammals
64
Nearest Neighbor Classifiers
  • Basic idea
  • If it walks like a duck, quacks like a duck, then
    its probably a duck

65
Nearest-Neighbor Classifiers
  • Requires three things
  • The set of stored records
  • Distance Metric to compute distance between
    records
  • The value of k, the number of nearest neighbors
    to retrieve
  • To classify an unknown record
  • Compute distance to other training records
  • Identify k nearest neighbors
  • Use class labels of nearest neighbors to
    determine the class label of unknown record
    (e.g., by taking majority vote)

66
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
67
1 nearest-neighbor
Voronoi Diagram
68
Nearest Neighbor Classification
  • Compute distance between two points
  • Euclidean distance
  • Determine the class from nearest neighbor list
  • take the majority vote of class labels among the
    k-nearest neighbors
  • Weigh the vote according to distance
  • weight factor, w 1/d2

69
Nearest Neighbor Classification
  • Choosing the value of k
  • If k is too small, sensitive to noise points
  • If k is too large, neighborhood may include
    points from other classes

70
Nearest Neighbor Classification
  • Scaling issues
  • Attributes may have to be scaled to prevent
    distance measures from being dominated by one of
    the attributes
  • Example
  • height of a person may vary from 1.5m to 1.8m
  • weight of a person may vary from 90lb to 300lb
  • income of a person may vary from 10K to 1M

71
Nearest neighbor Classification
  • k-NN classifiers are lazy learners
  • It does not build models explicitly
  • Unlike eager learners such as decision tree
    induction and rule-based systems
  • Classifying unknown records are relatively
    expensive
Write a Comment
User Comments (0)
About PowerShow.com