Classification with Decision Trees - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Classification with Decision Trees

Description:

Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang_at_cs.ust.hk Thanks: Eibe Frank and Jiawei Han – PowerPoint PPT presentation

Number of Views:268
Avg rating:3.0/5.0
Slides: 63
Provided by: Qiang
Category:

less

Transcript and Presenter's Notes

Title: Classification with Decision Trees


1
Classification with Decision Trees
  • Instructor Qiang Yang
  • Hong Kong University of Science and Technology
  • Qyang_at_cs.ust.hk
  • Thanks Eibe Frank and Jiawei Han

2
Continuous Classes
  • Sometimes, classes are continuous in that they
    come from a continuous domain,
  • e.g., temperature or stock price.
  • Regression is well suited in this case
  • Linear and multiple regression
  • Non-Linear regression
  • We shall focus on categorical classes, e.g.,
    colors or Yes/No binary decisions.
  • We will deal with continuous class values later
    in CART

3
DECISION TREE Quinlan93
  • An internal node represents a test on an
    attribute.
  • A branch represents an outcome of the test, e.g.,
    Colorred.
  • A leaf node represents a class label or class
    label distribution.
  • At each node, one attribute is chosen to split
    training examples into distinct classes as much
    as possible
  • A new case is classified by following a matching
    path to a leaf node.

4
Training Set
5
Example
Outlook
sunny
overcast
rain
overcast
humidity
windy
P
high
normal
false
true
N
N
P
P
6
Building Decision Tree Q93
  • Top-down tree construction
  • At start, all training examples are at the root.
  • Partition the examples recursively by choosing
    one attribute each time.
  • Bottom-up tree pruning
  • Remove subtrees or branches, in a bottom-up
    manner, to improve the estimated accuracy on new
    cases.

7
Choosing the Splitting Attribute
  • At each node, available attributes are evaluated
    on the basis of separating the classes of the
    training examples. A Goodness function is used
    for this purpose.
  • Typical goodness functions
  • information gain (ID3/C4.5)
  • information gain ratio
  • gini index

8
Which attribute to select?
9
A criterion for attribute selection
  • Which is the best attribute?
  • The one which will result in the smallest tree
  • Heuristic choose the attribute that produces the
    purest nodes
  • Popular impurity criterion information gain
  • Information gain increases with the average
    purity of the subsets that an attribute produces
  • Strategy choose attribute that results in
    greatest information gain

10
Computing information
  • Information is measured in bits
  • Given a probability distribution, the info
    required to predict an event is the
    distributions entropy
  • Entropy gives the information required in bits
    (this can involve fractions of bits!)
  • Formula for computing the entropy

11
Example attribute Outlook
  • Outlook Sunny
  • Outlook Overcast
  • Outlook Rainy
  • Expected information for attribute

Note this is normally not defined.
12
Computing the information gain
  • Information gain information before splitting
    information after splitting
  • Information gain for attributes from weather data

13
Continuing to split
14
The final decision tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • ? Splitting stops when data cant be split any
    further

15
Highly-branching attributes
  • Problematic attributes with a large number of
    values (extreme case ID code)
  • Subsets are more likely to be pure if there is a
    large number of values
  • Information gain is biased towards choosing
    attributes with a large number of values
  • This may result in overfitting (selection of an
    attribute that is non-optimal for prediction)
  • Another problem fragmentation

16
The gain ratio
  • Gain ratio a modification of the information
    gain that reduces its bias on high-branch
    attributes
  • Gain ratio takes number and size of branches into
    account when choosing an attribute
  • It corrects the information gain by taking the
    intrinsic information of a split into account
  • Also called split ratio
  • Intrinsic information entropy of distribution of
    instances into branches
  • (i.e. how much info do we need to tell which
    branch an instance belongs to)

17
Gain Ratio
  • Gain ratio should be
  • Large when data is evenly spread
  • Small when all data belong to one branch
  • Gain ratio (Quinlan86) normalizes info gain by
    this reduction

18
Computing the gain ratio
  • Example intrinsic information for ID code
  • Importance of attribute decreases as intrinsic
    information gets larger
  • Example of gain ratio
  • Example

19
Gain ratios for weather data
Outlook Outlook Temperature Temperature
Info 0.693 Info 0.911
Gain 0.940-0.693 0.247 Gain 0.940-0.911 0.029
Split info info(5,4,5) 1.577 Split info info(4,6,4) 1.362
Gain ratio 0.247/1.577 0.156 Gain ratio 0.029/1.362 0.021
Humidity Humidity Windy Windy
Info 0.788 Info 0.892
Gain 0.940-0.788 0.152 Gain 0.940-0.892 0.048
Split info info(7,7) 1.000 Split info info(8,6) 0.985
Gain ratio 0.152/1 0.152 Gain ratio 0.048/0.985 0.049
20
More on the gain ratio
  • Outlook still comes out top
  • However ID code has greater gain ratio
  • Standard fix ad hoc test to prevent splitting on
    that type of attribute
  • Problem with gain ratio it may overcompensate
  • May choose an attribute just because its
    intrinsic information is very low
  • Standard fix
  • First, only consider attributes with greater than
    average information gain
  • Then, compare them on gain ratio

21
Gini Index
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j
    in T. gini(T) is minimized if the classes in T
    are skewed.
  • After splitting T into two subsets T1 and T2 with
    sizes N1 and N2, the gini index of the split data
    is defined as
  • The attribute providing smallest ginisplit(T) is
    chosen to split the node.

22
Discussion
  • Consider the following variations of decision
    trees

23
1. Apply KNN to each leaf node
  • Instead of choosing a class label as the majority
    class label, use KNN to choose a class label

24
2. Apply Naïve Bayesian at each leaf node
  • For each leave node, use all the available
    information we know about the test case to make
    decisions
  • Instead of using the majority rule, use
    probability/likelihood to make decisions

25
3. Use error rates instead of entropy
  • If a node has N1 positive class labels P, and N2
    negative class labels N,
  • If N1gt N2, then choose P
  • The error rate N2/(N1N2) at this node
  • The expected error at a parent node can be
    calculated as weighted sum of the error rates at
    each child node
  • The weights are the proportion of training data
    in each child

26
4. When there is missing value, allow tests to be
done
  • Attribute selection criterion minimal total cost
    (Ctotal Cmc Ctest) instead of minimal
    entropy in C4.5
  • If growing a tree has a smaller total cost, then
    choose an attribute with minimal total cost.
    Otherwise, stop and form a leaf.
  • Label leaf also according to minimal total cost
  • Suppose the leaf have P positive examples and N
    negative examples
  • FP denotes the cost of a false positive example
    and FN false negative
  • If (PFN ? NFP) THEN label
    positive ELSE label negative
  • More in the next lecture slides

27
Missing Values
  • Missing values in test data
  • ltOutlookSunny, TempHot, Humidity?,
    WindyFalsegt
  • HumidityHigh, Normal, but which one?
  • Allow splitting of the values down to each branch
    of the decision tree
  • Methods
  • 1. equal proportion ½ to each side,
  • 2. unequal proportion use proportion training
    data
  • Weighted result

28
Dealing with Continuous Class Values
  • Use the mean of a set as a predicted value
  • Use a linear regression formula to compute
  • the predicted value

In linear algebra
29
Using Entropy Reduction to Discretize Continuous
Variables
  • Given the following data sorted by increasing
    Temperature values, and associated Play attribute
    values
  • Task Partition the continuous ranged temperature
    into discrete values Cold and Warm
  • Hint decision of boundary by entropy reduction!

10 14 15 20 22 25 26 27 29 30 32 36 39 40
F F F F T T T T T T T T T F
30
Entropy-Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization.
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met,
    e.g.,
  • Experiments show that it may reduce data size and
    improve classification accuracy

31
How to Calculate ent(S)?
  • Given two classes Yes and No, in a set S,
  • Let p1 be the proportion of Yes
  • Let p2 be the proportion of No,
  • p1 p2 100
  • Entropy is
  • ent(S) -p1log(p1) p2log(p2)
  • When p11, p20, ent(S)0,
  • When p150, p250, ent(S)maximum!
  • See TAs tutorial notes for an Example.

32
Numeric attributes
  • Standard method binary splits (i.e. temp lt 45)
  • Difference to nominal attributes every attribute
    offers many possible split points
  • Solution is straightforward extension
  • Evaluate info gain (or other measure) for every
    possible split point of attribute
  • Choose best split point
  • Info gain for best split point is info gain for
    attribute
  • Computationally more demanding

33
An example
  • Split on temperature attribute from weather data
  • Eg. 4 yeses and 2 nos for temperature lt 71.5 and
    5 yeses and 3 nos for temperature ? 71.5
  • Info(4,2,5,3) (6/14)info(4,2)
    (8/14)info(5,3) 0.939 bits
  • Split points are placed halfway between values
  • All split points can be evaluated in one pass!

64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
34
Missing values
  • C4.5 splits instances with missing values into
    pieces (with weights summing to 1)
  • A piece going down a particular branch receives a
    weight proportional to the popularity of the
    branch
  • Info gain etc. can be used with fractional
    instances using sums of weights instead of counts
  • During classification, the same procedure is used
    to split instances into pieces
  • Probability distributions are merged using weights

35
Stopping Criteria
  • When all cases have the same class. The leaf node
    is labeled by this class.
  • When there is no available attribute. The leaf
    node is labeled by the majority class.
  • When the number of cases is less than a specified
    threshold. The leaf node is labeled by the
    majority class.

36
Pruning
  • Pruning simplifies a decision tree to prevent
    overfitting to noise in the data
  • Two main pruning strategies
  • Postpruning takes a fully-grown decision tree
    and discards unreliable parts
  • Prepruning stops growing a branch when
    information becomes unreliable
  • Postpruning preferred in practice because of
    early stopping in prepruning

37
Prepruning
  • Usually based on statistical significance test
  • Stops growing the tree when there is no
    statistically significant association between any
    attribute and the class at a particular node
  • Most popular test chi-squared test
  • ID3 used chi-squared test in addition to
    information gain
  • Only statistically significant attributes where
    allowed to be selected by information gain
    procedure

38
The Weather example Observed Count
Play ? Outlook Yes No Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Play Subtotal 2 1 Total count in table 3
39
The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this
Play ? Outlook Yes No Subtotal
Sunny 22/64/31.3 21/62/30.6 2
Cloudy 21/30.6 11/30.3 1
Subtotal 2 1 Total count in table 3
40
Question How different between observed and
expected?
  • If Chi-squared value is very large, then A1 and
    A2 are not independent ? that is, they are
    dependent!
  • Degrees of freedom if table has nm items, then
    freedom (n-1)(m-1)
  • If all attributes in a node are independent with
    the class attribute, then stop splitting further.

41
Postpruning
  • Builds full tree first and prunes it afterwards
  • Attribute interactions are visible in fully-grown
    tree
  • Problem identification of subtrees and nodes
    that are due to chance effects
  • Two main pruning operations
  • Subtree replacement
  • Subtree raising
  • Possible strategies error estimation,
    significance testing, MDL principle

42
Subtree replacement
  • Bottom-up tree is considered for replacement
    once all its subtrees have been considered

43
Subtree raising
  • Deletes node and redistributes instances
  • Slower than subtree replacement (Worthwhile?)

44
Estimating error rates
  • Pruning operation is performed if this does not
    increase the estimated error
  • Of course, error on the training data is not a
    useful estimator (would result in almost no
    pruning)
  • One possibility using hold-out set for pruning
    (reduced-error pruning)
  • C4.5s method using upper limit of 25
    confidence interval derived from the training
    data
  • Standard Bernoulli-process-based method

45
Training Set
46
Post-pruning in C4.5
  • Bottom-up pruning at each non-leaf node v, if
    merging the subtree at v into a leaf node
    improves accuracy, perform the merging.
  • Method 1 compute accuracy using examples not
    seen by the algorithm.
  • Method 2 estimate accuracy using the training
    examples
  • Consider classifying E examples incorrectly out
    of N examples as observing E events in N trials
    in the binomial distribution.
  • For a given confidence level CF, the upper limit
    on the error rate over the whole population is
    with CF confidence.

47
Pessimistic Estimate
  • Usage in Statistics Sampling error estimation
  • Example
  • population 1,000,000 people, could be regarded
    as infinite
  • population mean percentage of the left handed
    people
  • sample 100 people
  • sample mean 6 left-handed
  • How to estimate the REAL population mean?

15
U0.25(100,6)
L0.25(100,6)
48
Pessimistic Estimate
  • Usage in Decision Tree (DT) error estimation for
    some node in the DT
  • example
  • unknown testing data could be regarded as
    infinite universe
  • population mean percentage of error made by this
    node
  • sample 100 examples from training data set
  • sample mean 6 errors for the training data set
  • How to estimate the REAL average error rate?

Heuristic! But works well...
U0.25(100,6)
L0.25(100,6)
49
C4.5s method
  • Error estimate for subtree is weighted sum of
    error estimates for all its leaves
  • Error estimate for a node
  • If c 25 then z 0.69 (from normal
    distribution)
  • f is the error on the training data
  • N is the number of instances covered by the leaf

50
Example for Estimating Error
  • Consider a subtree rooted at Outlook with 3 leaf
    nodes
  • Sunny Play yes (0 error, 6 instances)
  • Overcast Play yes (0 error, 9 instances)
  • Cloudy Play no (0 error, 1 instance)
  • The estimated error for this subtree is
  • 60.07490.05010.3231.217
  • If the subtree is replaced with the leaf yes,
    the estimated error is
  • So no pruning is performed

51
Example continued
Outlook
?
sunny
cloudy
yes
overcast
yes
yes
no
52
Another Example
Combined using ratios 626 this gives 0.51
f5/14 e0.46
f0.33 e0.47
f0.5 e0.72
f0.33 e0.47
53
Continuous Case The CART Algorithm
54
Numeric prediction
  • Counterparts exist for all schemes that we
    previously discussed
  • Decision trees, rule learners, SVMs, etc.
  • All classification schemes can be applied to
    regression problems using discretization
  • Prediction weighted average of intervals
    midpoints (weighted according to class
    probabilities)
  • Regression more difficult than classification
    (i.e. percent correct vs. mean squared error)

55
Regression trees
  • Differences to decision trees
  • Splitting criterion minimizing intra-subset
    variation
  • Pruning criterion based on numeric error measure
  • Leaf node predicts average class values of
    training instances reaching that node
  • Can approximate piecewise constant functions
  • Easy to interpret
  • More sophisticated version model trees

56
Model trees
  • Regression trees with linear regression functions
    at each node
  • Linear regression applied to instances that reach
    a node after full regression tree has been built
  • Only a subset of the attributes is used for LR
  • Attributes occurring in subtree (maybe
    attributes occurring in path to the root)
  • Fast overhead for LR not large because usually
    only a small subset of attributes is used in tree

57
Smoothing
  • Naïve method for prediction outputs value of LR
    for corresponding leaf node
  • Performance can be improved by smoothing
    predictions using internal LR models
  • Predicted value is weighted average of LR models
    along path from root to leaf
  • Smoothing formula
  • Same effect can be achieved by incorporating the
    internal models into the leaf nodes

58
Building the tree
  • Splitting criterion standard deviation reduction
  • Termination criteria (important when building
    trees for numeric prediction)
  • Standard deviation becomes smaller than certain
    fraction of sd for full training set (e.g. 5)
  • Too few instances remain (e.g. less than four)

59
Model tree for servo data
60
Variations of CART
  • Applying Logistic Regression
  • predict probability of True or False instead
    of making a numerical valued prediction
  • predict a probability value (p) rather than the
    outcome itself
  • Probability odds ratio

61
Other Trees
  • Classification Trees
  • Current node
  • Children nodes (L, R)
  • Decision Trees
  • Current node
  • Children nodes (L, R)
  • GINI index used in CART (STD )
  • Current node
  • Children nodes (L, R)

62
Scalability Previous works
  • Incremental tree construction Quinlan 1993
  • using partial data to build a tree.
  • testing other examples and mis-classified ones
    are used to rebuild the tree interactively.
  • still a main-memory algorithm.
  • Best known algorithms
  • ID3
  • C4.5
  • C5

63
Efforts on Scalability
  • Most algorithms assume data can fit in memory.
  • Recent efforts focus on disk-resident
    implementation for decision trees.
  • Random sampling
  • Partitioning
  • Examples
  • SLIQ (EDBT96 -- MAR96)
  • SPRINT (VLDB96 -- SAM96)
  • PUBLIC (VLDB98 -- RS98)
  • RainForest (VLDB98 -- GRG98)
Write a Comment
User Comments (0)
About PowerShow.com