Classification with Decision Trees

About This Presentation

Title:

Classification with Decision Trees

Description:

Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang_at_cs.ust.hk Thanks: Eibe Frank and Jiawei Han – PowerPoint PPT presentation

Number of Views:273

Avg rating:3.0/5.0

Slides: 63

Provided by: Qiang

Category:

more less

Transcript and Presenter's Notes

Title: Classification with Decision Trees

1
Classification with Decision Trees

Instructor Qiang Yang
Hong Kong University of Science and Technology
Qyang_at_cs.ust.hk
Thanks Eibe Frank and Jiawei Han

2
Continuous Classes

Sometimes, classes are continuous in that they
come from a continuous domain,
e.g., temperature or stock price.
Regression is well suited in this case
Linear and multiple regression
Non-Linear regression
We shall focus on categorical classes, e.g.,
colors or Yes/No binary decisions.
We will deal with continuous class values later
in CART

3
DECISION TREE Quinlan93

An internal node represents a test on an
attribute.
A branch represents an outcome of the test, e.g.,
Colorred.
A leaf node represents a class label or class
label distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
A new case is classified by following a matching
path to a leaf node.

4
Training Set
5
Example
Outlook
sunny
overcast
rain
overcast
humidity
windy
P
high
normal
false
true
N
N
P
P
6
Building Decision Tree Q93

Top-down tree construction
At start, all training examples are at the root.
Partition the examples recursively by choosing
one attribute each time.
Bottom-up tree pruning
Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases.

7
Choosing the Splitting Attribute

At each node, available attributes are evaluated
on the basis of separating the classes of the
training examples. A Goodness function is used
for this purpose.
Typical goodness functions
information gain (ID3/C4.5)
information gain ratio
gini index

8
Which attribute to select?
9
A criterion for attribute selection

Which is the best attribute?
The one which will result in the smallest tree
Heuristic choose the attribute that produces the
purest nodes
Popular impurity criterion information gain
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy choose attribute that results in
greatest information gain

10
Computing information

Information is measured in bits
Given a probability distribution, the info
required to predict an event is the
distributions entropy
Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy

11
Example attribute Outlook

Outlook Sunny
Outlook Overcast
Outlook Rainy
Expected information for attribute

Note this is normally not defined.
12
Computing the information gain

Information gain information before splitting
information after splitting
Information gain for attributes from weather data

13
Continuing to split
14
The final decision tree

Note not all leaves need to be pure sometimes
identical instances have different classes
? Splitting stops when data cant be split any
further

15
Highly-branching attributes

Problematic attributes with a large number of
values (extreme case ID code)
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
Another problem fragmentation

16
The gain ratio

Gain ratio a modification of the information
gain that reduces its bias on high-branch
attributes
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account
Also called split ratio
Intrinsic information entropy of distribution of
instances into branches
(i.e. how much info do we need to tell which
branch an instance belongs to)

17
Gain Ratio

Gain ratio should be
Large when data is evenly spread
Small when all data belong to one branch
Gain ratio (Quinlan86) normalizes info gain by
this reduction

18
Computing the gain ratio

Example intrinsic information for ID code
Importance of attribute decreases as intrinsic
information gets larger
Example of gain ratio
Example

19
Gain ratios for weather data
Outlook Outlook Temperature Temperature
Info 0.693 Info 0.911
Gain 0.940-0.693 0.247 Gain 0.940-0.911 0.029
Split info info(5,4,5) 1.577 Split info info(4,6,4) 1.362
Gain ratio 0.247/1.577 0.156 Gain ratio 0.029/1.362 0.021
Humidity Humidity Windy Windy
Info 0.788 Info 0.892
Gain 0.940-0.788 0.152 Gain 0.940-0.892 0.048
Split info info(7,7) 1.000 Split info info(8,6) 0.985
Gain ratio 0.152/1 0.152 Gain ratio 0.048/0.985 0.049
20
More on the gain ratio

Outlook still comes out top
However ID code has greater gain ratio
Standard fix ad hoc test to prevent splitting on
that type of attribute
Problem with gain ratio it may overcompensate
May choose an attribute just because its
intrinsic information is very low
Standard fix
First, only consider attributes with greater than
average information gain
Then, compare them on gain ratio

21
Gini Index

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j
in T. gini(T) is minimized if the classes in T
are skewed.
After splitting T into two subsets T1 and T2 with
sizes N1 and N2, the gini index of the split data
is defined as
The attribute providing smallest ginisplit(T) is
chosen to split the node.

22
Discussion

Consider the following variations of decision
trees

23
1. Apply KNN to each leaf node

Instead of choosing a class label as the majority
class label, use KNN to choose a class label

24
2. Apply Naïve Bayesian at each leaf node

For each leave node, use all the available
information we know about the test case to make
decisions
Instead of using the majority rule, use
probability/likelihood to make decisions

25
3. Use error rates instead of entropy

If a node has N1 positive class labels P, and N2
negative class labels N,
If N1gt N2, then choose P
The error rate N2/(N1N2) at this node
The expected error at a parent node can be
calculated as weighted sum of the error rates at
each child node
The weights are the proportion of training data
in each child

26
4. When there is missing value, allow tests to be
done

Attribute selection criterion minimal total cost
(Ctotal Cmc Ctest) instead of minimal
entropy in C4.5
If growing a tree has a smaller total cost, then
choose an attribute with minimal total cost.
Otherwise, stop and form a leaf.
Label leaf also according to minimal total cost
Suppose the leaf have P positive examples and N
negative examples
FP denotes the cost of a false positive example
and FN false negative
If (PFN ? NFP) THEN label
positive ELSE label negative
More in the next lecture slides

27
Missing Values

Missing values in test data
ltOutlookSunny, TempHot, Humidity?,
WindyFalsegt
HumidityHigh, Normal, but which one?
Allow splitting of the values down to each branch
of the decision tree
Methods
1. equal proportion ½ to each side,
2. unequal proportion use proportion training
data
Weighted result

28
Dealing with Continuous Class Values

Use the mean of a set as a predicted value
Use a linear regression formula to compute
the predicted value

In linear algebra
29
Using Entropy Reduction to Discretize Continuous
Variables

Given the following data sorted by increasing
Temperature values, and associated Play attribute
values
Task Partition the continuous ranged temperature
into discrete values Cold and Warm
Hint decision of boundary by entropy reduction!

10 14 15 20 22 25 26 27 29 30 32 36 39 40
F F F F T T T T T T T T T F
30
Entropy-Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
Experiments show that it may reduce data size and
improve classification accuracy

31
How to Calculate ent(S)?

Given two classes Yes and No, in a set S,
Let p1 be the proportion of Yes
Let p2 be the proportion of No,
p1 p2 100
Entropy is
ent(S) -p1log(p1) p2log(p2)
When p11, p20, ent(S)0,
When p150, p250, ent(S)maximum!
See TAs tutorial notes for an Example.

32
Numeric attributes

Standard method binary splits (i.e. temp lt 45)
Difference to nominal attributes every attribute
offers many possible split points
Solution is straightforward extension
Evaluate info gain (or other measure) for every
possible split point of attribute
Choose best split point
Info gain for best split point is info gain for
attribute
Computationally more demanding

33
An example

Split on temperature attribute from weather data
Eg. 4 yeses and 2 nos for temperature lt 71.5 and
5 yeses and 3 nos for temperature ? 71.5
Info(4,2,5,3) (6/14)info(4,2)
(8/14)info(5,3) 0.939 bits
Split points are placed halfway between values
All split points can be evaluated in one pass!

64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
34
Missing values

C4.5 splits instances with missing values into
pieces (with weights summing to 1)
A piece going down a particular branch receives a
weight proportional to the popularity of the
branch
Info gain etc. can be used with fractional
instances using sums of weights instead of counts
During classification, the same procedure is used
to split instances into pieces
Probability distributions are merged using weights

35
Stopping Criteria

When all cases have the same class. The leaf node
is labeled by this class.
When there is no available attribute. The leaf
node is labeled by the majority class.
When the number of cases is less than a specified
threshold. The leaf node is labeled by the
majority class.

36
Pruning

Pruning simplifies a decision tree to prevent
overfitting to noise in the data
Two main pruning strategies
Postpruning takes a fully-grown decision tree
and discards unreliable parts
Prepruning stops growing a branch when
information becomes unreliable
Postpruning preferred in practice because of
early stopping in prepruning

37
Prepruning

Usually based on statistical significance test
Stops growing the tree when there is no
statistically significant association between any
attribute and the class at a particular node
Most popular test chi-squared test
ID3 used chi-squared test in addition to
information gain
Only statistically significant attributes where
allowed to be selected by information gain
procedure

38
The Weather example Observed Count
Play ? Outlook Yes No Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Play Subtotal 2 1 Total count in table 3
39
The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this
Play ? Outlook Yes No Subtotal
Sunny 22/64/31.3 21/62/30.6 2
Cloudy 21/30.6 11/30.3 1
Subtotal 2 1 Total count in table 3
40
Question How different between observed and
expected?

If Chi-squared value is very large, then A1 and
A2 are not independent ? that is, they are
dependent!
Degrees of freedom if table has nm items, then
freedom (n-1)(m-1)
If all attributes in a node are independent with
the class attribute, then stop splitting further.

41
Postpruning

Builds full tree first and prunes it afterwards
Attribute interactions are visible in fully-grown
tree
Problem identification of subtrees and nodes
that are due to chance effects
Two main pruning operations
Subtree replacement
Subtree raising
Possible strategies error estimation,
significance testing, MDL principle

42
Subtree replacement

Bottom-up tree is considered for replacement
once all its subtrees have been considered

43
Subtree raising

Deletes node and redistributes instances
Slower than subtree replacement (Worthwhile?)

44
Estimating error rates

Pruning operation is performed if this does not
increase the estimated error
Of course, error on the training data is not a
useful estimator (would result in almost no
pruning)
One possibility using hold-out set for pruning
(reduced-error pruning)
C4.5s method using upper limit of 25
confidence interval derived from the training
data
Standard Bernoulli-process-based method

45
Training Set
46
Post-pruning in C4.5

Bottom-up pruning at each non-leaf node v, if
merging the subtree at v into a leaf node
improves accuracy, perform the merging.
Method 1 compute accuracy using examples not
seen by the algorithm.
Method 2 estimate accuracy using the training
examples
Consider classifying E examples incorrectly out
of N examples as observing E events in N trials
in the binomial distribution.
For a given confidence level CF, the upper limit
on the error rate over the whole population is
with CF confidence.

47
Pessimistic Estimate

Usage in Statistics Sampling error estimation
Example
population 1,000,000 people, could be regarded
as infinite
population mean percentage of the left handed
people
sample 100 people
sample mean 6 left-handed
How to estimate the REAL population mean?

15
U0.25(100,6)
L0.25(100,6)
48
Pessimistic Estimate

Usage in Decision Tree (DT) error estimation for
some node in the DT
example
unknown testing data could be regarded as
infinite universe
population mean percentage of error made by this
node
sample 100 examples from training data set
sample mean 6 errors for the training data set
How to estimate the REAL average error rate?

Heuristic! But works well...
U0.25(100,6)
L0.25(100,6)
49
C4.5s method

Error estimate for subtree is weighted sum of
error estimates for all its leaves
Error estimate for a node
If c 25 then z 0.69 (from normal
distribution)
f is the error on the training data
N is the number of instances covered by the leaf

50
Example for Estimating Error

Consider a subtree rooted at Outlook with 3 leaf
nodes
Sunny Play yes (0 error, 6 instances)
Overcast Play yes (0 error, 9 instances)
Cloudy Play no (0 error, 1 instance)
The estimated error for this subtree is
60.07490.05010.3231.217
If the subtree is replaced with the leaf yes,
the estimated error is
So no pruning is performed

51
Example continued
Outlook
?
sunny
cloudy
yes
overcast
yes
yes
no
52
Another Example
Combined using ratios 626 this gives 0.51
f5/14 e0.46
f0.33 e0.47
f0.5 e0.72
f0.33 e0.47
53
Continuous Case The CART Algorithm
54
Numeric prediction

Counterparts exist for all schemes that we
previously discussed
Decision trees, rule learners, SVMs, etc.
All classification schemes can be applied to
regression problems using discretization
Prediction weighted average of intervals
midpoints (weighted according to class
probabilities)
Regression more difficult than classification
(i.e. percent correct vs. mean squared error)

55
Regression trees

Differences to decision trees
Splitting criterion minimizing intra-subset
variation
Pruning criterion based on numeric error measure
Leaf node predicts average class values of
training instances reaching that node
Can approximate piecewise constant functions
Easy to interpret
More sophisticated version model trees

56
Model trees

Regression trees with linear regression functions
at each node
Linear regression applied to instances that reach
a node after full regression tree has been built
Only a subset of the attributes is used for LR
Attributes occurring in subtree (maybe
attributes occurring in path to the root)
Fast overhead for LR not large because usually
only a small subset of attributes is used in tree

57
Smoothing

Naïve method for prediction outputs value of LR
for corresponding leaf node
Performance can be improved by smoothing
predictions using internal LR models
Predicted value is weighted average of LR models
along path from root to leaf
Smoothing formula
Same effect can be achieved by incorporating the
internal models into the leaf nodes

58
Building the tree

Splitting criterion standard deviation reduction
Termination criteria (important when building
trees for numeric prediction)
Standard deviation becomes smaller than certain
fraction of sd for full training set (e.g. 5)
Too few instances remain (e.g. less than four)

59
Model tree for servo data
60
Variations of CART

Applying Logistic Regression
predict probability of True or False instead
of making a numerical valued prediction
predict a probability value (p) rather than the
outcome itself
Probability odds ratio

61
Other Trees

Classification Trees
Current node
Children nodes (L, R)
Decision Trees
Current node
Children nodes (L, R)
GINI index used in CART (STD )
Current node
Children nodes (L, R)

62
Scalability Previous works

Incremental tree construction Quinlan 1993
using partial data to build a tree.
testing other examples and mis-classified ones
are used to rebuild the tree interactively.
still a main-memory algorithm.
Best known algorithms
ID3
C4.5
C5

63
Efforts on Scalability