Decision tree - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Decision tree

Description:

training data: a set of classified examples in attribute-value representation ... Sort the examples according to the continuous attribute. ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 43
Provided by: facultyWa9
Category:

less

Transcript and Presenter's Notes

Title: Decision tree


1
Decision tree
  • LING 572
  • Fei Xia
  • 1/10/06

2
Outline
  • Basic concepts
  • Main issues
  • Advanced topics

3
Basic concepts
4
A classification problem
District House type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Semi-detached High Yes Respond
Rural Semi-detached Low No Respond
Urban Detached Low Yes Nothing

5
Classification and estimation problems
  • Given
  • a finite set of (input) attributes features
  • Ex District, House type, Income, Previous
    customer
  • a target attribute the goal
  • Ex Outcome Nothing, Respond
  • training data a set of classified examples in
    attribute-value representation
  • Ex the previous table
  • Predict the value of the goal given the values of
    input attributes
  • The goal is a discrete variable ? classification
    problem
  • The goal is a continuous variable ? estimation
    problem

6
Decision tree
7
Decision tree representation
  • Each internal node is a test
  • Theoretically, a node can test multiple
    attributes
  • In most systems, a node tests exactly one
    attribute
  • Each branch corresponds to test results
  • A branch corresponds to an attribute value or a
    range of attribute values
  • Each leaf node assigns
  • a class decision tree
  • a real value regression tree

8
Whats a best decision tree?
  • Best You need a bias (e.g., prefer the
    smallest tree) least depth? Fewest nodes?
    Which trees are the best predictors of unseen
    data?
  • Occam's Razor we prefer the simplest hypothesis
    that fits the data.
  • ? Find a decision tree that is as small as
    possible and fits the data

9
Finding a smallest decision tree
  • A decision tree can represent any discrete
    function of the inputs yf(x1, x2, , xn)
  • The space of decision trees is too big for
    systemic search for a smallest decision tree.
  • Solution greedy algorithm

10
Basic algorithm top-down induction
  • Find the best decision attribute, A, and assign
    A as decision attribute for node
  • For each value of A, create new branch, and
    divide up training examples
  • Repeat the process until the gain is small enough

11
Major issues
12
Major issues
  • Q1 Choosing best attribute what quality measure
    to use?
  • Q2 Determining when to stop splitting avoid
    overfitting
  • Q3 Handling continuous attributes
  • Q4 Handling training data with missing attribute
    values
  • Q5 Handing attributes with different costs
  • Q6 Dealing with continuous goal attribute

13
Q1 What quality measure
  • Information gain
  • Gain Ratio

14
Entropy of a training set
  • S is a sample of training examples
  • Entropy is one way of measuring the impurity of S
  • Pc is the proportion of examples in S whose
    target attribute has value c.

15
Information Gain
  • Gain(S,A)expected reduction in entropy due to
    sorting on A.
  • Choose the A with the max information gain.
  • (a.k.a. choose the A with the min average
    entropy)

16
An example
E0.985
E0.592
E0.811
E1.00
InfoGain(S, Humidity) 0.940-(7/14)0.985-(7/14)0.
592 0.151
InfoGain(S, Wind) 0.940-(8/14)0.811-(6/14)1.0
0.048
17
Other quality measures
  • Problem of information gain
  • Information Gain prefers attributes with many
    values.
  • An alternative Gain Ratio
  • Where Si is subset of S for which A has value
    vi.

18
Q2 avoiding overfitting
  • Overfitting occurs when our decision tree
    characterizes too much detail, or noise in our
    training data.
  • Consider error of hypothesis h over
  • Training data ErrorTrain(h)
  • Entire distribution D of data ErrorD(h)
  • A hypothesis h overfits training data if there is
    an alternative hypothesis h, such that
  • ErrorTrain(h) lt ErrorTrain(h), and
  • ErrorD(h) gt errorD(h)

19
How to avoiding Overfitting
  • Stop growing the tree earlier
  • Ex InfoGain lt threshold
  • Ex Size of examples in a node lt threshold
  • Grow full tree, then post-prune
  • ? In practice, the latter works better than the
    former.

20
Post-pruning
  • Split data into training and validation set
  • Do until further pruning is harmful
  • Evaluate impact on validation set of pruning each
    possible node (plus those below it)
  • Greedily remove the ones that dont improve the
    performance on validation set
  • Produces a smaller tree with best performance
    measure

21
Performance measure
  • Accuracy
  • on validation data
  • K-fold cross validation
  • Misclassification cost Sometimes more accuracy
    is desired for some classes than others.
  • MDL size(tree) errors(tree)

22
Rule post-pruning
  • Convert tree to equivalent set of rules
  • Prune each rule independently of others
  • Sort final rules into desired sequence for use
  • Perhaps most frequently used method (e.g., C4.5)

23
Q3 handling numeric attributes
  • Continuous attribute ? discrete attribute
  • Example
  • Original attribute Temperature 82.5
  • New attribute (temperature gt 72.3) t, f
  • ? Question how to choose thresholds?

24
Choosing thresholds for a continuous attribute
  • Sort the examples according to the continuous
    attribute.
  • Identify adjacent examples that differ in their
    target classification ? a set of candidate
    thresholds
  • Choose the candidate with the highest information
    gain.

25
Q4 Unknown attribute values
  • Assume an attribute can take the value blank.
  • Assign most common value of A among training data
    at node n.
  • Assign most common value of A among training data
    at node n which have the same target class.
  • Assign prob pi to each possible value vi of A
  • Assign a fraction (pi) of example to each
    descendant in tree
  • This method is used in C4.5.

26
Q5 Attributes with cost
  • Consider medical diagnosis (e.g., blood test) has
    a cost
  • Question how to learn a consistent tree with low
    expected cost?
  • One approach replace gain by
  • Tan and Schlimmer (1990)

27
Q6 Dealing with continuous goal attribute ?
Regression tree
  • A variant of decision trees
  • Estimation problem approximate real-valued
    functions e.g., the crime rate
  • A leaf node is marked with a real value or a
    linear function e.g., the mean of the target
    values of the examples at the node.
  • Measure of impurity e.g., variance, standard
    deviation,

28
Summary of Major issues
  • Q1 Choosing best attribute different quality
    measure.
  • Q2 Determining when to stop splitting stop
    earlier or post-pruning
  • Q3 Handling continuous attributes find the
    breakpoints

29
Summary of major issues (cont)
  • Q4 Handling training data with missing attribute
    values blank value, most common value, or
    fractional count
  • Q5 Handing attributes with different costs use
    a quality measure that includes the cost factors.
  • Q6 Dealing with continuous goal attribute
    various ways of building regression trees.

30
Common algorithms
  • ID3
  • C4.5
  • CART

31
ID3
  • Proposed by Quinlan (so is C4.5)
  • Can handle basic cases discrete attributes, no
    missing information, etc.
  • Information gain as quality measure

32
C4.5
  • An extension of ID3
  • Several quality measures
  • Incomplete information (missing attribute values)
  • Numerical (continuous) attributes
  • Pruning of decision trees
  • Rule derivation
  • Random mood and batch mood

33
CART
  • CART (classification and regression tree)
  • Proposed by Breiman et. al. (1984)
  • Constant numerical values in leaves
  • Variance as measure of impurity

34
Strengths of decision tree methods
  • Ability to generate understandable rules
  • Ease of calculation at classification time
  • Ability to handle both continuous and categorical
    variables
  • Ability to clearly indicate best attributes

35
The weaknesses of decision tree methods
  • Greedy algorithm no global optimization
  • Error-prone with too many classes numbers of
    training examples become smaller quickly in a
    tree with many levels/branches.
  • Expensive to train sorting, combination of
    attributes, calculating quality measures, etc.
  • Trouble with non-rectangular regions the
    rectangular classification boxes that may not
    correspond well with the actual distribution of
    records in the decision space.

36
Advanced topics
37
Combining multiple models
  • The inherent instability of top-down decision
    tree induction different training datasets from
    a given problem domain will produce quite
    different trees.
  • Techniques
  • Bagging
  • Boosting

38
Bagging
  • Introduced by Breiman
  • It first creates multiple decision trees based on
    different training sets.
  • Then, it compares each tree and incorporates the
    best features of each.
  • This addresses some of the problems inherent in
    regular ID3.

39
Boosting
  • Introduced by Freund and Schapire
  • It examines the trees that incorrectly classify
    an instance and assign them a weight.
  • These weights are used to eliminate hypotheses or
    refocus the algorithm on the hypotheses that are
    performing well.

40
Summary
  • Basic case
  • Discrete input attributes
  • Discrete goal attribute
  • No missing attribute values
  • Same cost for all tests and all kinds of
    misclassification.
  • Extended cases
  • Continuous attributes
  • Real-valued goal attribute
  • Some examples miss some attribute values
  • Some tests are more expensive than others.
  • Some misclassifications are more serious than
    others.

41
Summary (cont)
  • Basic algorithm
  • greedy algorithm
  • top-down induction
  • Bias for small trees
  • Major issues

42
Uncovered issues
  • Incremental decision tree induction?
  • How can a decision relate to other decisions?
    what's the order of making the decisions? (e.g.,
    POS tagging)
  • What's the difference between decision tree and
    decision list?
Write a Comment
User Comments (0)
About PowerShow.com