Decision Trees - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Decision Trees

Description:

post-pruning fully grow the tree (allowing it to overfit the data) and then ... Nodes are pruned iteratively, always choosing the node whose removal most ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 56
Provided by: irenako
Category:
Tags: decision | pruned | trees

less

Transcript and Presenter's Notes

Title: Decision Trees


1
COMP5318, Lecture 3Data Mining and Machine
Learning
  • Decision Trees
  • Reference Witten and Frank 89-97, 159-164
  • Dunham p.97-103

2
Outline of the Lecture
  • DT example
  • Constructing DTs
  • DTs Decision Boundary
  • Avoiding Overfitting the Data
  • Dealing with Numeric Attributes
  • Alternative Measures for Selecting Attributes
  • Dealing with Missing Attributes
  • Handling Attributes with Different Costs

3
Decision Trees (DTs)
  • DTs are supervised learners for classification
  • most popular and well researched ML/DM technique
  • divide-and-conquer approach
  • developed in parallel in ML by Ross Quinlan
    (USyd) and in statistics by Breiman, Friedman,
    Olshen and Stone Classification and regression
    trees 1984
  • Quinlan has refined the DT algorithm over years
  • ID3, 1986
  • C4.5, 1993
  • C5.0 (See 5 on Windows) commercial version of
    C4.5 used in many DM packages
  • In WEKA - id3 and j48
  • Many other software implementations available on
    the web for free and not for free (USD 50 - 300
    000 A. Moore)

4
DT Example
  • DT representation (model)
  • each internal node test an attribute
  • each branch corresponds to an attribute value
  • each leaf node assigns a class

DT for the tennis data
  • A DT is a tree-structured plan for testing the
    values of a set of attributes in order to predict
    the output A. Moore
  • Another interpretation
  • Use all your data to build a tree of questions
    with answers at the leaves. To answer a new
    query, start from the tree root, answer the
    questions until you reach a leaf node and return
    its answer.
  • What would be the prediction for
  • outlooksunny, temperaturecool, humidityhigh,
    windytrue

5
Building a DT Example First Node
Example created by Eric McCreath
6
Second Node
Example created by Eric McCreath
7
Final Tree
Wage gt 22K
dependents lt 6
Example created by Eric McCreath
8
Stopping Criteria Again
  • In fact the real stopping criterion is more
    complex
  • 4. Stop if in the current sub-set
  • a) all instances have the same class gt make a
    leaf
    node corresponding to this class or else
  • b) there are no remaining attributes that can
    create non-empty children (no attribute can
    distinguish) gt make a leaf node and label it
    with the majority class

Think about when case b) will occur!
9
Constructing DTs (ID3 algorithm)
  • Top-down in recursive divide-and-conquer fashion
  • 1. Attribute is selected for root node and branch
    is created for each possible attribute value
  • 2. The instances are split into subsets (one for
    each branch extending from the node)
  • 3. The procedure is repeated recursively for each
    branch, using only instances that reach the
    branch
  • 4. Stop if
  • all instances have the same class, make a
    leaf node
    corresponding to this class

10
Expressiveness of DTs
  • DTs can be expressed as a disjunction of
    conjunctions
  • Extract the rules for the tennis tree
  • Assume that all attributes are Boolean and the
    class is Boolean. What is the class of Boolean
    functions that we can represent using DTs?
  • Answer
  • Proof

11
How to find the best attribute?
  • A heuristic is needed!
  • Four DTs for the tennis data which is the best
    choice?
  • We need a measure of purity of each node, as
    the leafs with only one class (yes or no) will
    not have to be split further
  • gt at each step we can choose the attribute
    which produces the purest children node such
    measure of purity is called ...

12
Entropy
  • Entropy (also called information content)
    measures the homogeneity of a set of examples it
    characterizes the impurity (disorder, randomness)
    of a collection of examples wrt their
    classification hi entropyhi disorder
  • It is a standard measure used in signal
    compression, information theory and physics
  • Entropy of data set S
  • Pi - proportion of examples that belong to
    class i
  • The smaller the entropy, the greater the purity
    of the set
  • Tennis data - 9 yes 5 no examples gt the
    entropy of the tennis data set S relative to the
    classification is
  • log to the base 2 information is measured in
    bits

13
Range of the Entropy for Binary Classification
  • p - proportion of positive examples in the data
    set
  • note that
  • 0 gt all members of S belong to the same class
    (no disorder)
  • 1 gt equal number of yes no (entropy is
    maximized S is as disordered as it can be)
  • ? (0,1) if S contains unequal number of yes no

14
Another Interpretation of Entropy - Examplefrom
http//www.cs.cmu.edu/awm/tutorials
  • Suppose that X is a random variable with 4
    possible values A, B, C and D P(XA)P(XB)P(XC
    )P(XD)1/4.
  • You must transmit a sequence of X values over a
    serial binary channel. You can encode each symbol
    with 2 bits, e.g. A00, B01, C10 and D11
  • ABBBCADBADCB
  • 000101011000110100111001
  • Now you are told that the probabilities are not
    equal, e.g. P(XA)1/2, P(XB)1/4,
    P(XC)P(XD)1/8.
  • Can you invent a coding that uses less than 2
    bits on average per symbol, e.g. 1.75 bits?
  • A? C?
  • B? D?

0 110 10 111
  • What is the smallest possible number of bits per
    symbol?

15
Another Interpretation of Entropy - General
Casefrom http//www.cs.cmu.edu/awm/tutorials
  • Suppose that X is a random variable with m
    possible values V1Vm P(XV1)P1, P(XV2)P2,,
    P(XVm)pm.
  • What is the smallest possible number of bits per
    symbol on average needed to transmit a stream of
    symbols drawn from Xs distribution?
  • High entropy the values of X are all over place
  • The histogram of the frequency distribution of
    values of X will be flat
  • Low entropy the values of X are more
    predictable
  • The histogram of the frequency distribution of
    values of X have many lows and one or two highs

16
Information Theory
  • Information theory, Shannon and Weaver 1949
  • Given a set of possible answers (messages)
    Mm1,m2,,mn) and a probability P(mi) for the
    occurrence of each answer, the expected
    information content (entropy) of the actual
    answer is
  • Shows the amount of surprise of the receiver by
    the answer based on the probability of the
    answers (i.e. based on the prior knowledge of the
    receiver about the possible answers)
  • The less the receiver knows, the more information
    is provided (the more informative the answer is)

17
Information Content - Examples
  • Example 1 Flipping coin
  • Case 1 Flipping an honest coin
  • Case 2 Flipping a rigged coin so that it will
    come heads 75
  • In which case an answer telling the outcome of a
    toss will contain more information?
  • Example 2 Basketball game outcome
  • Two teams A and B are playing basketball
  • Case 1 Equal probability to win
  • Case 2 Michael Jordan is playing for A and the
    probability of A to win is 90
  • In which case an answer telling the outcome of
    the game will contain more information?

18
Information Content - Solutions
  • Example 1 Flipping coin
  • Case 1 Entropy(coin_toss) I(1/2,
    1/2)-1/2log1/2 - 1/2log1/2 1 bit
  • Case 2 Entropy(coin_toss) I(1/4,
    3/4)-1/4log1/4 - 3/4log3/4 0.811 bits
  • Example 2 Basketball game outcome
  • Case 1 Entropy(game_outcomes) I(1/2, 1/2) 1
    bit
  • Case 2 Entropy(game_outcome) I(90/100,
    10/100)
  • -90/100log90/100 - 10/100log10/100
  • -.9log.9 - .1log.1 lt 1 bit

19
Information Gain
  • Back to DTs
  • Entropy measures
  • the disorder of a collection of training examples
    with respect to their classification
  • the smallest possible number of bits per symbol
    on average needed to transmit a stream of symbols
    drawn from Xs distribution
  • shows the amount of surprise of the receiver by
    the answer based on the probability of the
    answers
  • We can use it to define a measure of the
    effectiveness of an attribute in classifying the
    training data Information gain
  • Information gain is the expected reduction in
    entropy caused by the partitioning of the set of
    examples using that attribute

20
DT Learning as a Search
  • DT algorithm searches the hypothesis space for a
    hypothesis that fits the training data
  • What does the hypotheses space consist of?
  • What is the search strategy?
  • simple to complex search (starting with an empty
    tree and progressively considering more elaborate
    hypotheses)
  • hill climbing with information gain as evaluation
    function
  • Information gain is an evaluation function of how
    good the current state is (how close we are to
    the goal state, i.e. the tree that classifies
    correctly all training examples)

21
Hill Climbing - Revision
  • Hill climbing using the h-cost as an evaluation
    function
  • Expanded nodes ABGL
  • Solution path AL

22
Information Gain - Definition
  • Information gain is the expected reduction in
    entropy caused by the partitioning of the set of
    examples using that attribute
  • Gain(S,A) is the information gain of an attribute
    A relative to S
  • Values(A) is the set of all possible values for A
  • Sv is the subset of S for which A has value v

called also Reminder
entropy of the original data set S
expected value of the entropy after S is
partitioned by A (it is the ? of the entropies of
each subset Sv, weighed by the fraction of
examples that belong to Sv)
23
More on Information Gain First Term
  • Information gain is an evaluation function of how
    good the current state is (how close we are to
    the goal state, i.e. the tree that classifies
    correctly all training examples)
  • Before any attribute is tested, an estimate of
    this is given by the entropy of the original data
    set S
  • S contains p positive and n negative examples
  • e.g. tennis data 9 yes and 5 no at the
    beginning gt entropy(S)I(p,n)0.940 bits
  • an answer telling the class of a randomly
    selected example will contain 0.940 bits

24
More on Information Gain Second Term
  • After a test on a single attribute A we can
    estimate how much information is still needed to
    classify an example
  • - A divides the training set S into subsets Sv
  • Sv is the subset of S for which A has value v
  • - each subset Sv has pv positive and nv negative
    examples
  • - if we go along that branch we will need in
    addition entropy(Si)I(pv,nv) bits to answer the
    question
  • - a random example has the v-th value for A with
    probability
  • gt on average , after testing A , the number of
    bits we will need to classify the example

25
Computing the Information Gain
  • split based on outlook

26
Computing the Information Gain cont.
27
Continuing to Split
Gain(S, temperature)0.571 bits
Gain(S, humidity)0.971 bits
Gain(S, windy)0.020 bits
Final DT
28
DT Decision Boundary
Example taken from 6.034 AI, MIT
  • DTs define a decision boundary in the feature
    space
  • For a binary DT

2 attributes R ratio of earnings to
expenses L number of late payments on credit
cards over the past year
29
1NN Decision Boundary
Example taken from 6.034 AI, MIT
  • What is the decision boundary of 1-NN algorithm?
  • The space can be divided in regions that are
    closer to each given data point than to the
    others Voronoi partitioning of the space
  • In 1-NN a hypothesis is represented by the edges
    in the Voronoi space that separate the points of
    the two classes

30
Overfitting
  • ID3 typically grows each branch of the tree
    deeply enough to perfectly classify the training
    examples
  • but difficulties occur when there is
  • noise in data
  • too small a training set - cannot produce a
    representative sample of the target function
  • gt ID3 can produce DTs that overfit the training
    examples
  • More formal definition of overfitting
  • given H - a hypothesis space, a hypothesis h?H,
  • D - entire distribution of instances, train -
    training instances
  • h is said to overfit the training data if there
    exist some alternative hypothesis h?H
  • errortrain(h)lterrortrain(h) errorD(h)gt
    errorD(h)

31
Overfitting
32
Overfitting - Example
  • How can it be possible for tree h to fit the
    training examples better than h but to perform
    worse over subsequent examples?
  • Example. Noise in the labeling of a training
    instance
  • - adding to the original tennis data the
    following positive example that is incorrectly
    labeled as negative outlooksunny,
    temperaturehot, humiditynormal, windyes,
    playTennisno

33
Overfitting - cont.
  • Overfitting is a problem not only for DTs
  • Tree pruning is used to avoid overfitting in DTs
  • pre-pruning - stop growing the tree earlier,
    before it reaches the point where it perfectly
    classifies the training data
  • post-pruning fully grow the tree (allowing it
    to overfit the data) and then post-prune it (more
    successful in practice)
  • Tree post-pruning
  • sub-tree replacement
  • sub-tree raising
  • Rule post-pruning (convert the tree into a set of
    rules and prune them)

34
When to Stop Pruning?
  • How to determine when to stop pruning?
  • Solution estimate the error rate using
  • validation set
  • training data - pessimistic error estimate based
    on training data (heuristic based on some
    statistical reasoning but the statistical
    underpinning is rather weak)

35
Error Rate Estimation Using Validation Set
  • Available data is separated into 3 sets of
    examples
  • training set - used to form the learned model
  • validation set - used to evaluate the impact of
    pruning and decide when to stop
  • test data to evaluate how good the final tree is
  • Motivation
  • even though the learner may be misled by random
    errors and coincidental regularities within the
    training set, the validation set is unlikely to
    exhibit the same random fluctuations gt the
    validation set can provide a safety check against
    overfitting of the training set
  • the validation set should be large enough
    typically 1/2 of the available examples are used
    as training set, 1/4 as validation set and 1/4 as
    test set
  • Disadvantage the tree is based on less data
  • when the data is limited, withholding part of it
    for validation reduces even further the examples
    available for training

36
Tree Post-pruning by Sub-Tree Replacement
  • Each node is considered as a candidate for
    pruning
  • Start from the the leaves and work toward the
    root
  • Typical error estimate - validation set
  • Pruning a node
  • remove the sub-tree rooted at that node
  • make it a leaf and assign the most common label
    of the training examples affiliated with that
    node
  • Nodes are removed only if the resulting pruned
    tree performs no worse than the original tree
    over the validation set
  • gt any leaf added due to false regularities in
    the training set is likely to be pruned as these
    coincidences are unlikely to occur in the
    validation set
  • Nodes are pruned iteratively, always choosing the
    node whose removal most increases the tree
    accuracy on the validation set
  • Continue until further pruning is harmful, i.e.
    decreases accuracy of the tree over the
    validation set

37
Sub-Tree Replacement - Example
38
Effect of tree pruning by sub-tree replacement
  • The accuracy on test data increases as nodes are
    pruned
  • (accuracy over validation set used for pruning
    is not shown)

39
Post-pruning Sub-Tree Raising
  • more complex operation than sub-tree replacement
  • sub-tree raising is potentially time consuming
    operation gt it is restricted to raising the
    sub-tree of the most popular branch
  • e.g. raise C only if the branch from B to C has
    more training examples than the branches from B
    to 4 or from B to 5 otherwise, if (for example)
    4 were the majority daughter of B, consider
    raising 4 to replace B and re-classifying all
    examples under C, as well as the examples from 5,
    into the new node

40
Rule Post-Pruning - Example
  • Grow the tree until the training data is fit
  • Convert the tree into an equivalent set of rules
    by creating 1 rule for each path from the root
    to a leaf

if (outlooksunny) AND (humidityhigh) then
PlayTennisNo ...
  • Prune each rule by removing any preconditions
    that result in improving its estimated accuracy
  • consider removing (outlooksunny) and then
    (humidityhigh)
  • select the pruning which produces the greatest
    improvement
  • no pruning if it reduces the estimated rule
    accuracy
  • Sort the pruned rules by their estimated
    accuracy, and consider them in this sequence when
    classifying subsequent instances
  • To estimate accuracy 1) a validation set of
    examples or 2) a pessimistic error estimate based
    on the training data set (C4.5)

41
Rule Post-Pruning - cont.
  • Why convert DT to rules before pruning?
  • Bigger flexibility
  • when trees are pruned only 2 choices - to remove
    the node completely or retain it
  • when rules are pruned - less restrictions
  • preconditions (not nodes) are removed
  • each branch in the tree (i.e. each rule) is
    treated separately
  • removes the distinction between attribute tests
    that occur near the root of the tree and those
    near the leaves
  • Advantage of rules over trees - easier to read
    rules than a tree

42
Numeric Attributes
  • ID3 works only when all the attributes are
    nominal but most real data sets contain numeric
    attributes need for discretization
  • for a numerical attribute we restrict the
    possibilities to a binary split (e.g. templt60)
  • difference to nominal attributes every numerical
    attribute offers many possible split points
  • The solution is a straightforward extension
  • sort the examples according the values of the
    attribute
  • identify adjacent examples that differ in their
    target classification and generate a set of
    candidate splits (split points are placed
    halfway)
  • evaluate Gain (or other measure) for every
    possible split point and choose the best split
    point
  • Gain for best split point is Gain for the
    attribute

43
Numeric Attributes - example
  • values of temperature
  • 64 65 68 69 70 71 72 73 74
    75 80 81 83 85
  • yes no yes yes yes no no no yes
    yes no yes yes no
  • 7 possible splits consider split between 70 and
    71
  • Information gain for - 1) temperature lt 70.5
    4 yes 1 no
  • - 2) temperature gt70.5 4 yes 5 no

44
Alternative Measures for Selecting Attributes
  • Problem if an attribute is highly-branching
    (with a large number of values), Information gain
    will select it!
  • imagine using ID code (extreme case) the
    training examples will be separated into many and
    very small subsets
  • gt highly-branching attributes are more likely to
    create pure subsets
  • Information gain is biased towards choosing
    attributes with a large number of values
  • this will result in overfitting

45
Highly-Branching Attributes - Example
46
Highly-Branching Attributes - Example cont.
  • split based on IDcode
  • the weighted sum of entropies
  • entropy at the root
  • Gain

47
Gain Ratio
  • Gain ratio a modification of the Gain that
    reduces its bias towards highly branching
    attributes
  • it takes the number and size of branches into
    account when choosing an attribute
  • it penalizes highly-branching attributes by
    incorporating SplitInformation
  • SplitInformation is the entropy of S wrt the
    values of A
  • Gain ratio

48
Gain Ratio
  • Gain ratios for tennis data
  • outlook - Gain0.247, InformationSplit1.577,
    GainRatio0.156
  • temperature - Gain0.029, InformationSplit1.362,
    GainRatio0.021
  • humidity - Gain0.152, InformationSplit1.000,
    GainRatio0.152
  • windy - Gain0.048, InformationSplit0.985,
    GainRatio0.049
  • gt outlook still comes out on top but humidity is
    now much closer as it splits the data into 2
    subsets instead of 3
  • however, IDcode will still be preferred (although
    its advantage is greatly reduced)

49
Gain Ratio - Problem
  • Problem with GainRatio may overcompensate
  • may choose an attribute just because its
    SplitInformation is much lower than for the other
    attribute
  • standard fix only consider attributes with
    greater Gain than the average Gain (for all the
    attributes examined)

50
Handling Examples with Missing Values
  • Missing attribute values in the training data
  • (x,class(x)) is a training example in S
  • attribute value A for example x A(x) is unknown
  • When building DT, what to do with the missing
    attr.value A(x)?
  • Gain(S,A) has to be calculated at node n to
    evaluate if to split on A
  • treat missing values as simply another possible
    value of the attribute this assumes that the
    absence of a value is significant
  • ignore all instances with missing attribute value
    - tempting solution! But
  • instances with missing values often provide a
    good deal of information
  • sometimes the attributes whose values are missing
    play no part in the decision, in which case these
    instances are as good as any other

51
Handling Examples with Missing Values - 2
  • 3) A(x)most common value for A among the
    training examples at n
  • 4) A(x)most common value among the training
    examples at n with class(x)

52
Handling Examples with Missing Values - 3
  • more sophisticated solution (used in C4.5)
  • assign probability to each of the possible values
    of A calculate these prob. using the frequencies
    of the values of A among the examples at n
  • example A is the Boolean attribute wind
  • instance x with missing value for wind
  • node n contains 6 examples with windtrue and
    4 with windfalse
  • gt P(A(x)true)0.6, P(A(x)false)0.4
  • 0.6 of instance x is distributed down the
    branch for windtrue and
  • 0.4 of instance x down the branch for
    windfalse
  • these fractional examples are used to compute
    Gain and can be further subdivided at subsequent
    branches of the tree if another missing attribute
    value must be tested
  • the same fractioning strategy can be used for
    classification of new instances with missing
    values

53
Handling Attributes with Different (External)
Costs
  • Consider medical diagnosis and the following
    attributes
  • temperature, biopsyResult, pulse, bloodTestResult
  • attributes vary significantly in their costs
    (monetary cost and patient comfort)
  • prefer DTs that use low-cost attributes where
    possible, relying on high-cost attributes only
    when needed to produce reliable classification
  • How to learn a consistent tree with low expected
    cost?
  • One approach favour low-cost attribute by
    replacing Gain with
  • Tan Schlimmer (90)
  • Nunez (88) where
    w?? 0,1 determines cost importance



54
Components of DT
  • Model (structure)
  • Tree (not pre-specified but derived from data)
  • Preference (score function) preference criteria
    used to measure the quality of the tree
    structures
  • number of misclassifications over all examples
    (loss function)
  • Search method (how the data is searched by the
    algorithm)
  • hill climbing search over tree structure (2
    phases grow and prune)

55
DTs - Summary
  • Very popular ML technique
  • Easy to implement
  • Efficient
  • Cost of building the tree O(mn log n),
    n-instances and m attributes
  • Cost of pruning the tree with sub-tree
    replacement O(n)
  • Cost of pruning by subtree lifting O(n (log n)2)
  • gt the total cost of tree induction O(mn log n)
    O(n (log n)2)
  • Reference Witten and Frank pp.167-168
  • The resulting hypothesis is easy to interpret by
    humans
Write a Comment
User Comments (0)
About PowerShow.com