Title: Classification with Decision Trees
1Classification with Decision Trees
- Instructor Qiang Yang
- Hong Kong University of Science and Technology
- Qyang_at_cs.ust.hk
- Thanks Eibe Frank and Jiawei Han
2Continuous Classes
- Sometimes, classes are continuous in that they
come from a continuous domain, - e.g., temperature or stock price.
- Regression is well suited in this case
- Linear and multiple regression
- Non-Linear regression
- We shall focus on categorical classes, e.g.,
colors or Yes/No binary decisions. - We will deal with continuous class values later
in CART
3DECISION TREE Quinlan93
- An internal node represents a test on an
attribute. - A branch represents an outcome of the test, e.g.,
Colorred. - A leaf node represents a class label or class
label distribution. - At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible - A new case is classified by following a matching
path to a leaf node.
4Training Set
5Example
Outlook
sunny
overcast
rain
overcast
humidity
windy
P
high
normal
false
true
N
N
P
P
6Building Decision Tree Q93
- Top-down tree construction
- At start, all training examples are at the root.
- Partition the examples recursively by choosing
one attribute each time. - Bottom-up tree pruning
- Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on new
cases.
7Choosing the Splitting Attribute
- At each node, available attributes are evaluated
on the basis of separating the classes of the
training examples. A Goodness function is used
for this purpose. - Typical goodness functions
- information gain (ID3/C4.5)
- information gain ratio
- gini index
8Which attribute to select?
9A criterion for attribute selection
- Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic choose the attribute that produces the
purest nodes - Popular impurity criterion information gain
- Information gain increases with the average
purity of the subsets that an attribute produces - Strategy choose attribute that results in
greatest information gain
10Computing information
- Information is measured in bits
- Given a probability distribution, the info
required to predict an event is the
distributions entropy - Entropy gives the information required in bits
(this can involve fractions of bits!) - Formula for computing the entropy
11Example attribute Outlook
- Outlook Sunny
- Outlook Overcast
- Outlook Rainy
- Expected information for attribute
Note this is normally not defined.
12Computing the information gain
- Information gain information before splitting
information after splitting - Information gain for attributes from weather data
13Continuing to split
14The final decision tree
- Note not all leaves need to be pure sometimes
identical instances have different classes - ? Splitting stops when data cant be split any
further
15Highly-branching attributes
- Problematic attributes with a large number of
values (extreme case ID code) - Subsets are more likely to be pure if there is a
large number of values - Information gain is biased towards choosing
attributes with a large number of values - This may result in overfitting (selection of an
attribute that is non-optimal for prediction) - Another problem fragmentation
16The gain ratio
- Gain ratio a modification of the information
gain that reduces its bias on high-branch
attributes - Gain ratio takes number and size of branches into
account when choosing an attribute - It corrects the information gain by taking the
intrinsic information of a split into account - Also called split ratio
- Intrinsic information entropy of distribution of
instances into branches - (i.e. how much info do we need to tell which
branch an instance belongs to)
17Gain Ratio
- Gain ratio should be
- Large when data is evenly spread
- Small when all data belong to one branch
- Gain ratio (Quinlan86) normalizes info gain by
this reduction
18Computing the gain ratio
- Example intrinsic information for ID code
- Importance of attribute decreases as intrinsic
information gets larger - Example of gain ratio
- Example
19Gain ratios for weather data
Outlook Outlook Temperature Temperature
Info 0.693 Info 0.911
Gain 0.940-0.693 0.247 Gain 0.940-0.911 0.029
Split info info(5,4,5) 1.577 Split info info(4,6,4) 1.362
Gain ratio 0.247/1.577 0.156 Gain ratio 0.029/1.362 0.021
Humidity Humidity Windy Windy
Info 0.788 Info 0.892
Gain 0.940-0.788 0.152 Gain 0.940-0.892 0.048
Split info info(7,7) 1.000 Split info info(8,6) 0.985
Gain ratio 0.152/1 0.152 Gain ratio 0.048/0.985 0.049
20More on the gain ratio
- Outlook still comes out top
- However ID code has greater gain ratio
- Standard fix ad hoc test to prevent splitting on
that type of attribute - Problem with gain ratio it may overcompensate
- May choose an attribute just because its
intrinsic information is very low - Standard fix
- First, only consider attributes with greater than
average information gain - Then, compare them on gain ratio
21Gini Index
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as -
- where pj is the relative frequency of class j
in T. gini(T) is minimized if the classes in T
are skewed. - After splitting T into two subsets T1 and T2 with
sizes N1 and N2, the gini index of the split data
is defined as - The attribute providing smallest ginisplit(T) is
chosen to split the node.
22Discussion
- Consider the following variations of decision
trees
231. Apply KNN to each leaf node
- Instead of choosing a class label as the majority
class label, use KNN to choose a class label
242. Apply Naïve Bayesian at each leaf node
- For each leave node, use all the available
information we know about the test case to make
decisions - Instead of using the majority rule, use
probability/likelihood to make decisions
253. Use error rates instead of entropy
- If a node has N1 positive class labels P, and N2
negative class labels N, - If N1gt N2, then choose P
- The error rate N2/(N1N2) at this node
- The expected error at a parent node can be
calculated as weighted sum of the error rates at
each child node - The weights are the proportion of training data
in each child
264. When there is missing value, allow tests to be
done
- Attribute selection criterion minimal total cost
(Ctotal Cmc Ctest) instead of minimal
entropy in C4.5 - If growing a tree has a smaller total cost, then
choose an attribute with minimal total cost.
Otherwise, stop and form a leaf. - Label leaf also according to minimal total cost
- Suppose the leaf have P positive examples and N
negative examples - FP denotes the cost of a false positive example
and FN false negative - If (PFN ? NFP) THEN label
positive ELSE label negative - More in the next lecture slides
27Missing Values
- Missing values in test data
- ltOutlookSunny, TempHot, Humidity?,
WindyFalsegt - HumidityHigh, Normal, but which one?
- Allow splitting of the values down to each branch
of the decision tree - Methods
- 1. equal proportion ½ to each side,
- 2. unequal proportion use proportion training
data - Weighted result
28Dealing with Continuous Class Values
- Use the mean of a set as a predicted value
- Use a linear regression formula to compute
- the predicted value
In linear algebra
29Using Entropy Reduction to Discretize Continuous
Variables
- Given the following data sorted by increasing
Temperature values, and associated Play attribute
values - Task Partition the continuous ranged temperature
into discrete values Cold and Warm - Hint decision of boundary by entropy reduction!
10 14 15 20 22 25 26 27 29 30 32 36 39 40
F F F F T T T T T T T T T F
30Entropy-Based Discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is - The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization. - The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g., - Experiments show that it may reduce data size and
improve classification accuracy
31How to Calculate ent(S)?
- Given two classes Yes and No, in a set S,
- Let p1 be the proportion of Yes
- Let p2 be the proportion of No,
- p1 p2 100
- Entropy is
- ent(S) -p1log(p1) p2log(p2)
- When p11, p20, ent(S)0,
- When p150, p250, ent(S)maximum!
- See TAs tutorial notes for an Example.
32Numeric attributes
- Standard method binary splits (i.e. temp lt 45)
- Difference to nominal attributes every attribute
offers many possible split points - Solution is straightforward extension
- Evaluate info gain (or other measure) for every
possible split point of attribute - Choose best split point
- Info gain for best split point is info gain for
attribute - Computationally more demanding
33An example
- Split on temperature attribute from weather data
- Eg. 4 yeses and 2 nos for temperature lt 71.5 and
5 yeses and 3 nos for temperature ? 71.5 - Info(4,2,5,3) (6/14)info(4,2)
(8/14)info(5,3) 0.939 bits - Split points are placed halfway between values
- All split points can be evaluated in one pass!
64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
34Missing values
- C4.5 splits instances with missing values into
pieces (with weights summing to 1) - A piece going down a particular branch receives a
weight proportional to the popularity of the
branch - Info gain etc. can be used with fractional
instances using sums of weights instead of counts - During classification, the same procedure is used
to split instances into pieces - Probability distributions are merged using weights
35Stopping Criteria
- When all cases have the same class. The leaf node
is labeled by this class. - When there is no available attribute. The leaf
node is labeled by the majority class. - When the number of cases is less than a specified
threshold. The leaf node is labeled by the
majority class.
36Pruning
- Pruning simplifies a decision tree to prevent
overfitting to noise in the data - Two main pruning strategies
- Postpruning takes a fully-grown decision tree
and discards unreliable parts - Prepruning stops growing a branch when
information becomes unreliable - Postpruning preferred in practice because of
early stopping in prepruning
37Prepruning
- Usually based on statistical significance test
- Stops growing the tree when there is no
statistically significant association between any
attribute and the class at a particular node - Most popular test chi-squared test
- ID3 used chi-squared test in addition to
information gain - Only statistically significant attributes where
allowed to be selected by information gain
procedure
38The Weather example Observed Count
Play ? Outlook Yes No Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Play Subtotal 2 1 Total count in table 3
39The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this
Play ? Outlook Yes No Subtotal
Sunny 22/64/31.3 21/62/30.6 2
Cloudy 21/30.6 11/30.3 1
Subtotal 2 1 Total count in table 3
40Question How different between observed and
expected?
- If Chi-squared value is very large, then A1 and
A2 are not independent ? that is, they are
dependent! - Degrees of freedom if table has nm items, then
freedom (n-1)(m-1) - If all attributes in a node are independent with
the class attribute, then stop splitting further.
41Postpruning
- Builds full tree first and prunes it afterwards
- Attribute interactions are visible in fully-grown
tree - Problem identification of subtrees and nodes
that are due to chance effects - Two main pruning operations
- Subtree replacement
- Subtree raising
- Possible strategies error estimation,
significance testing, MDL principle
42Subtree replacement
- Bottom-up tree is considered for replacement
once all its subtrees have been considered
43Subtree raising
- Deletes node and redistributes instances
- Slower than subtree replacement (Worthwhile?)
44Estimating error rates
- Pruning operation is performed if this does not
increase the estimated error - Of course, error on the training data is not a
useful estimator (would result in almost no
pruning) - One possibility using hold-out set for pruning
(reduced-error pruning) - C4.5s method using upper limit of 25
confidence interval derived from the training
data - Standard Bernoulli-process-based method
45Training Set
46Post-pruning in C4.5
- Bottom-up pruning at each non-leaf node v, if
merging the subtree at v into a leaf node
improves accuracy, perform the merging. - Method 1 compute accuracy using examples not
seen by the algorithm. - Method 2 estimate accuracy using the training
examples - Consider classifying E examples incorrectly out
of N examples as observing E events in N trials
in the binomial distribution. - For a given confidence level CF, the upper limit
on the error rate over the whole population is
with CF confidence.
47Pessimistic Estimate
- Usage in Statistics Sampling error estimation
- Example
- population 1,000,000 people, could be regarded
as infinite - population mean percentage of the left handed
people - sample 100 people
- sample mean 6 left-handed
- How to estimate the REAL population mean?
15
U0.25(100,6)
L0.25(100,6)
48Pessimistic Estimate
- Usage in Decision Tree (DT) error estimation for
some node in the DT - example
- unknown testing data could be regarded as
infinite universe - population mean percentage of error made by this
node - sample 100 examples from training data set
- sample mean 6 errors for the training data set
- How to estimate the REAL average error rate?
Heuristic! But works well...
U0.25(100,6)
L0.25(100,6)
49C4.5s method
- Error estimate for subtree is weighted sum of
error estimates for all its leaves - Error estimate for a node
- If c 25 then z 0.69 (from normal
distribution) - f is the error on the training data
- N is the number of instances covered by the leaf
50Example for Estimating Error
- Consider a subtree rooted at Outlook with 3 leaf
nodes - Sunny Play yes (0 error, 6 instances)
- Overcast Play yes (0 error, 9 instances)
- Cloudy Play no (0 error, 1 instance)
- The estimated error for this subtree is
- 60.07490.05010.3231.217
- If the subtree is replaced with the leaf yes,
the estimated error is - So no pruning is performed
51Example continued
Outlook
?
sunny
cloudy
yes
overcast
yes
yes
no
52Another Example
Combined using ratios 626 this gives 0.51
f5/14 e0.46
f0.33 e0.47
f0.5 e0.72
f0.33 e0.47
53Continuous Case The CART Algorithm
54Numeric prediction
- Counterparts exist for all schemes that we
previously discussed - Decision trees, rule learners, SVMs, etc.
- All classification schemes can be applied to
regression problems using discretization - Prediction weighted average of intervals
midpoints (weighted according to class
probabilities) - Regression more difficult than classification
(i.e. percent correct vs. mean squared error)
55Regression trees
- Differences to decision trees
- Splitting criterion minimizing intra-subset
variation - Pruning criterion based on numeric error measure
- Leaf node predicts average class values of
training instances reaching that node - Can approximate piecewise constant functions
- Easy to interpret
- More sophisticated version model trees
56Model trees
- Regression trees with linear regression functions
at each node - Linear regression applied to instances that reach
a node after full regression tree has been built - Only a subset of the attributes is used for LR
- Attributes occurring in subtree (maybe
attributes occurring in path to the root) - Fast overhead for LR not large because usually
only a small subset of attributes is used in tree
57Smoothing
- Naïve method for prediction outputs value of LR
for corresponding leaf node - Performance can be improved by smoothing
predictions using internal LR models - Predicted value is weighted average of LR models
along path from root to leaf - Smoothing formula
- Same effect can be achieved by incorporating the
internal models into the leaf nodes
58Building the tree
- Splitting criterion standard deviation reduction
- Termination criteria (important when building
trees for numeric prediction) - Standard deviation becomes smaller than certain
fraction of sd for full training set (e.g. 5) - Too few instances remain (e.g. less than four)
59Model tree for servo data
60Variations of CART
- Applying Logistic Regression
- predict probability of True or False instead
of making a numerical valued prediction - predict a probability value (p) rather than the
outcome itself - Probability odds ratio
61Other Trees
- Classification Trees
- Current node
- Children nodes (L, R)
- Decision Trees
- Current node
- Children nodes (L, R)
- GINI index used in CART (STD )
- Current node
- Children nodes (L, R)
62Scalability Previous works
- Incremental tree construction Quinlan 1993
- using partial data to build a tree.
- testing other examples and mis-classified ones
are used to rebuild the tree interactively. - still a main-memory algorithm.
- Best known algorithms
- ID3
- C4.5
- C5
63Efforts on Scalability
- Most algorithms assume data can fit in memory.
- Recent efforts focus on disk-resident
implementation for decision trees. - Random sampling
- Partitioning
- Examples
- SLIQ (EDBT96 -- MAR96)
- SPRINT (VLDB96 -- SAM96)
- PUBLIC (VLDB98 -- RS98)
- RainForest (VLDB98 -- GRG98)