Decision Tree Learning - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Decision Tree Learning

Description:

1. Evaluate impact of pruning each possible node (plus those below it) on the validation set ... Temperature: 40 48 60 72 80 90. PlayTennis: No No Yes Yes Yes No ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 82
Provided by: peopleSab
Category:
Tags: decision | hot | learning | plus | tree

less

Transcript and Presenter's Notes

Title: Decision Tree Learning


1
Decision Tree Learning
  • Machine Learning, T. Mitchell
  • Chapter 3

2
Decision Trees
  • One of the most widely used and practical methods
    for inductive
  • inference
  • Approximates discrete-valued functions (including
    disjunctions)
  • Can be used for classification (most common) or
    regression
  • problems

3
Decision Tree for PlayTennis
4
Decision Tree
  • If (OSunny AND HNormal) OR (OOvercast) OR
    (ORain AND WWeak)
  • then YES
  • A disjunction of conjunctions of constraints on
    attribute values
  • Larger hypothesis space than Candidate-Elimination

5
Decision tree representation
  • Each internal node corresponds to a test
  • Each branch corresponds to a result of the test
  • Each leaf node assigns a classification
  • Once the tree is trained, a new instance is
    classified by starting at the root and following
    the path as dictated by the test results for this
    instance.

6
Decision Regions
7
Divide and Conquer
  • Internal decision nodes
  • Univariate Uses a single attribute, xi
  • Discrete xi n-way split for n possible values
  • Continuous xi Binary split xi gt wm
  • Multivariate Uses more than one attributes
  • Leaves
  • Classification Class labels, or proportions
  • Regression Numeric r average, or local fit
  • Learning is greedy find the best split
    recursively (Breiman et al, 1984 Quinlan, 1986,
    1993)

8
  • If the decisions are binary, then in the best
    case, each
  • decision eliminates half of the regions (leaves).
  • If there are b regions, the correct region can be
    found in
  • log2b decisions, in the best case.

9
Multivariate Trees
10
Expressiveness
  • A decision tree can represent a disjunction of
    conjunctions
  • of constraints on the attribute values of
    instances.
  • Each path corresponds to a conjunction
  • The tree itself corresponds to a disjunction
  • How expressive is this representation?
  • How would we represent
  • (A AND B) OR C
  • M of N
  • A XOR B

11
Decision tree learning algorithm
  • For a given training set, there are many trees
    that code it without any error
  • Finding the smallest tree is NP-complete (Quinlan
    1986), hence we are forced to use some (local)
    search algorithm to find reasonable solutions

12
The basic decision tree learning algorithm
  • A decision tree can be constructed by considering
    attributes of instances one by one.
  • Which attribute should be considered first?
  • The height of a decision tree depends on the
    order attributes are considered.

13
Top-Down Induction of Decision Trees
14
(No Transcript)
15
Entropy
  • Measure of uncertainty
  • Expected number of bits to resolve uncertainty
  • Entropy measures the information amount in a
    message
  • High school form example

16
Entropy
  • Important quantity in
  • coding theory
  • statistical physics
  • machine learning

17
Entropy
  • Coding theory x discrete with 8 possible states
  • how many bits are needed to transmit the state
    of x?
  • All states equally likely

18
(No Transcript)
19
  • Entropy measures the impurity of S
  • Entropy(S) -p log2 p - log2 (1-p)
  • (Here pp-positive and 1-p p_negative from the
    previous slide)

20
Entropy
  • Suppose PrX 0 1/8
  • If other events are all equally likely, the
    number of events is 8.
  • To indicate one out of so many events, one needs
    lg2 8 bits.
  • Consider a binary random variable X s.t. PrX
    0 0.1.
  • The expected number of bits
  • In general, if a random variable X has c values
    with prob. p_c
  • The expected number of bits

21
Entropy
  • What if we have the following distribution for x?
  • In order to save on transmission costs, we would
    design codes that
  • reflect this distribution

22
Entropy
23
Use of Entropy in Choosing the Next Attribute
24
(No Transcript)
25
Other measures of impurity
  • Entropy is not the only measure of impurity. If a
    function satisfies certain criteria, it can be
    used as a measure of impurity.
  • Gini index 2p(p-1)
  • Misclassification error 1 max(p,1-p)

26
Training Examples
27
Selecting the Next Attribute
28
Selecting the Next Attribute
  • Computing the information gain for each
    attribute, we selected the
  • Outlook attribute as the first test, resulting
    in the following partially ,
  • learned tree

29
Partially learned tree
30
  • Until stopped
  • Select one of the unused attributes to partition
    the remaining
  • examples at each non-terminal node
  • using only the training samples associated with
    that node
  • Stopping criteria
  • each leaf-node contains examples of one type
  • algorithm ran out of attributes

31
(No Transcript)
32
Inductive Bias of ID3
33
Hypothesis Space Search by ID3
  • Hypothesis space is complete
  • every finite discrete function can be represented
    by a decision tree
  • Outputs a single hypothesis (which one?)
  • Cant play 20 questions...
  • No back tracking
  • Local minima...
  • Statistically-based search choices
  • Uses all available training samples

34
  • Note H is the power set of instances X
  • Unbiased?
  • Preference for short trees, and for those with
    high information gain
  • attributes near the root
  • Bias is a preference for some hypotheses, rather
    than a restriction of hypothesis space H
  • Occams razor prefer the shortest hypothesis
    that fits the data

35
Occams razor
  • Prefer the shortest hypothesis that fits the data
  • Occam 1320
  • While this idea is intuitive, it is more
    difficult to prove it formally.
  • Support 1
  • Shorter hypotheses have better generalization
    ability
  • Support 2
  • The number of short hypotheses are small, and
    therefore it is less likely a coincidence
  • if data fits a short hypothesis
  • There may be counter arguments for this there
    are other hypotheses with small numbers, why not
    choose those but the small ones
  • Different internal representations may arrive to
    different length of hypothesis
  • We will consider an optimal encoding

36
Overfitting
37
Over fitting in Decision Trees
  • Why over-fitting?
  • A model can become more complex than the true
    target function (concept) when it tries to
    satisfy noisy data as well.
  • Definition of overfitting
  • A hypothesis is said to overfit the training data
    if there exists some other hypothesis that has
    larger error over the training data but smaller
    error over the entire instances.

38
  • Consider adding the following training example
    which is
  • incorrectly labeled as negative
  • Sky Temp Humidity Wind PlayTennis
  • Sunny Hot Normal Strong PlayTennis
    No
  • Or consider the Oranges and Tangerines with Size
    and Texture attributes and the orange that is
    misclassified as tangerine (I will add a figure
    later)

39
(No Transcript)
40
  • ID3 will make a new split and will classify
    future examples following
  • the new path as negative.
  • Problem is due to overfitting the training
    data.
  • Overfitting may result due to
  • noise
  • coincidental regularities in the training data
  • What is the formal description of overfitting?

41
(No Transcript)
42
Curse of Dimensionality - A related concept
  • Imagine a learning task, such as recognizing
    printed characters.
  • Intuitively, adding more attributes would help
    the learner, as more
  • information never hurts, right?
  • In fact, sometimes it does, due to what is called
  • curse of dimensionality.

43
Curse of Dimensionality
44
Curse of Dimensionality
Polynomial curve fitting, M 3
Number of independent coefficients grows
proportionally to D3 where D is the number of
variables More generally, for an M dimensional
polynomial DM The polynomial becomes unwieldy
very quickly.
45
Polynomial Curve Fitting
46
Sum-of-Squares Error Function
47
0th Order Polynomial
48
1st Order Polynomial
49
3rd Order Polynomial
50
9th Order Polynomial
51
Over-fitting
Root-Mean-Square (RMS) Error
52
Polynomial Coefficients
53
Data Set Size
9th Order Polynomial
54
Data Set Size
9th Order Polynomial
55
Regularization
  • Penalize large coefficient values

56
Regularization
57
Regularization
58
Regularization vs.
59
Polynomial Coefficients
60
  • Although the curse of dimensionality is an
    important issue, we can
  • still find effective techniques applicable to
    high-dimensional spaces
  • Real data will often be confined to a region of
    the space having
  • lower effective dimensionality
  • example of planar objects on a conveyor belt
  • 3 dimensional manifold within the high
    dimensional picture pixel space
  • Real data will typically exhibit smoothness
    properties

61
Back to Decision Trees
62
Over fitting in Decision Trees
63
Avoiding over-fitting the data
  • How can we avoid overfitting? There are 2
    approaches
  • stop growing the tree before it perfectly
    classifies the training data
  • grow full tree, then post-prune
  • Reduced error pruning
  • Rule post-pruning
  • the 2nd approach is found more useful in
    practice.

64
  • Whether we are pre or post-pruning, the important
    question is how
  • to select best tree
  • Measure performance over separate validation data
    set
  • Measure performance over training data
  • apply a statistical test to see if expanding or
    pruning would produce an
  • improvement beyond the training set (Quinlan
    1986)
  • MDL minimize size(tree) size(misclassifications
    (tree))

65
  • MDL
  • length(h) length(additional information to
    encode D given h)
  • length(h) length(misclassifications)
  • since we only need to send a message when the
    data sample is not in
  • agreement with h hence, only for
    misclassifications.

66
Reduced-Error Pruning (Quinlan 1987)
  • Split data into training and validation set
  • Do until further pruning is harmful
  • 1. Evaluate impact of pruning each possible node
    (plus those below it)
  • on the validation set
  • 2. Greedily remove the one that most improves
    validation set accuracy
  • Produces smallest version of the (most accurate)
    tree
  • What if data is limited?
  • We would not want to separate a validation set.

67
Reduced error pruning
  • Examine each decision node to see if pruning
    decreases the trees performance over the
    evaluation data.
  • Pruning here means replacing a subtree with a
    leaf with the most common classification in the
    subtree.

68
Rule post-pruning
  • Algorithm
  • Build a complete decision tree.
  • Convert the tree to set of rules.
  • Prune each rule
  • Remove any preconditions if any improvement in
    accuracy
  • Sort the pruned rules by accuracy and use them in
    that order.
  • Perhaps most frequently used method (e.g., in
    C4.5)
  • More details can be found in http//www2.cs.uregin
    a.ca/hamilton/courses/831/notes/ml/dtrees/4_dtree
    s3.html
  • (read only if interested, presentation of
    advanced decision tree algorithms such as this
    may be added as part of a class project)

69
(No Transcript)
70
  • IF (Outlook Sunny) (Humidity High)
  • THEN PlayTennis No
  • IF (Outlook Sunny) (Humidity Normal)
  • THEN PlayTennis Y es
  • . . .

71
Rule Extraction from Trees
C4.5Rules (Quinlan, 1993)
72
  • Converting a decision tree to rules before
    pruning has three main
  • advantages
  • Converting to rules allows distinguishing among
    the different contexts in
  • which a decision node is used.
  • Since each distinct path through the decision
    tree node produces a distinct rule,
  • the pruning decision regarding that attribute
    test can be made differently for each
  • path.
  • In contrast, if the tree itself were pruned, the
    only two choices would be
  • Remove the decision node completely, or
  • Retain it in its original form.
  • Converting to rules removes the distinction
    between attribute tests that
  • occur near the root of the tree and those that
    occur near the leaves.
  • We thus avoid messy bookkeeping issues such as
    how to reorganize the tree if the root node is
    pruned while retaining part of the subtree below
    this test.
  • Converting to rules improves readability.
  • Rules are often easier for people to understand.

73
Rule Simplification Overview
  • Eliminate unecessary rule antecedents to simplify
    the rules.
  • Construct contingency tables for each rule
    consisting of more than one
  • antecedent.
  • Rules with only one antecedent cannot be further
    simplified, so we only consider those with two or
    more.
  • To simplify a rule, eliminate antecedents that
    have no effect on the conclusion
  • reached by the rule.
  • A conclusion's independence from an antecendent
    is verified using a test for independency, which
    is
  • a chi-square test if the expected cell
    frequencies are greater than 10.
  • Yates' Correction for Continuity when the
    expected frequencies are between 5 and 10.
  • Fisher's Exact Test for expected frequencies less
    than 5.
  • Once individual rules have been simplified by
    eliminating redundant antecedents, simplify the
    entire set by eliminating unnecessary rules.
  • Attempt to replace those rules that share the
    most common consequent by a
  • default rule that is triggered when no other
    rule is triggered.
  • In the event of a tie, use some heuristic tie
    breaker to choose a default rule.

74
Continuous Valued Attributes
  • Create a discrete attribute to test continuous
  • Temperature 825
  • (Temperature gt 723) t f
  • How to find the threshold?
  • Temperature 40 48 60 72 80 90
  • PlayTennis No No Yes Yes Yes No

75
Incorporating continuous-valued attributes
  • Where to cut?

Continuous valued attribute
76
Split Information?
  • In each tree, the leaves contain samples of only
    one kind (e.g. 50, 10, 10- etc).
  • Hence, the remaining entropy is 0 in each one.
  • Which is better?
  • In terms of information gain
  • In terms of gain ratio

100 examples
100 examples
A2
A1
10 positive
50 positive
50 negative
10 positive
10 negative
10 positive
77
Attributes with Many Values
  • One way to penalize such attributes is to use the
    following alternative measure

S
Entropy of the attribute A Experimentally
determined by the training samples
78
Handling training examples with missing attribute
values
  • What if an example x is missing the value an
    attribute A?
  • Simple solution
  • Use the most common value among examples at node
    n.
  • Or use the most common value among examples at
    node n that have classification c(x)
  • More complex, probabilistic approach
  • Assign a probability to each of the possible
    values of A based on the observed frequencies of
    the various values of A
  • Then, propagate examples down the tree with these
    probabilities.
  • The same probabilities can be used in
    classification of new instances (used in C4.5)

79
Handling attributes with differing costs
  • Sometimes, some attribute values are more
    expensive or difficult to prepare.
  • medical diagnosis, BloodTest has cost 150
  • In practice, it may be desired to postpone
    acquisition of such attribute values until they
    become necessary.
  • To this purpose, one may modify the attribute
    selection measure to penalize expensive
    attributes.
  • Tan and Schlimmer (1990)
  • Nunez (1988)

80
C4.5
  • By Ross Quinlan
  • Latest code available at http//www.cse.unsw.edu.a
    u/quinlan/
  • How to use it?
  • Download it
  • Unpack it
  • Make it (make all)
  • Read accompanying manual files
  • groff T ps c4.5.1 gt c4.5.ps
  • Use it
  • c4.5 tree generator
  • c4.5rules rule generator
  • consult use a generated tree to classify an
    instance
  • consultr use a generated set of rules to
    classify an instance

81
Model Selection in Trees
82
Strengths and Advantages of Decision Trees
  • Rule extraction from trees
  • A decision tree can be used for feature
    extraction (e.g. seeing which
  • features are useful)
  • Interpretability human experts may verify and/or
    discover patterns
  • It is a compact and fast classification method
Write a Comment
User Comments (0)
About PowerShow.com