Decision Trees - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Decision Trees

Description:

We are discrimination machines. What plays will work in sports. Who should a bank ... Do a statistical analysis, like Chi-Square pruning, to see how relevant an ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: dwil4
Category:
Tags: decision | trees

less

Transcript and Presenter's Notes

Title: Decision Trees


1
Decision Trees
  • Russell, Norvig
  • Chapter 18

2
Decision Trees (Supervised Learning)
  • We are discrimination machines
  • What plays will work in sports.
  • Who should a bank give loans to.
  • What graduate student characteristics will lead
    to successful PhD process.
  • Not remotely obvious or easy.

3
The Problem
  • Find discrete, measurable attributes.
  • Do classification with them using some function,
    hopefully to predict the real value
  • i.e. f(attributes) returns a decision that
    hopefully predicts the real situation.
  • This is a recurring them in AI.

4
A Decision Tree
  • Look at figure 18.2.
  • Should you wait for a table in a restaurant.
  • It represents Stuart Russells tree that he
    manually generated.
  • Lets go over how it works.
  • Notice it doesnt use price or test

5
Should we wait at a restaurant?Inductive Learning
  • 10 attributes in fig 18.3
  • 6 are 2-valued (alternative, bar, fri/sat,
    hungry, rain, reservation)
  • 2 are 3-valued (patrons, price)
  • 2 are 4-valued (type, estimate waiting time)
  • To do a full table would have over 92,000 rows.
    (26 32 42)
  • Each row has to be analyzed to give the Yes or No
    answer.

6
Inductive Learning How to build a decision
tree with limited data
  • We start with the learning vector e.g table 18.3
  • (xi, yi) (xn, yn)
  • where xi is the vector of values
  • yi is the output.
  • Generate them all and pick the best? Intractable.
  • With n boolean characteristics,
    22n possible trees.
  • For 6 boolean attributes, 18.4x1018 different
    trees.
  • We need a heuristic algorithm to find small tree.

7
A Decision Tree Learning Algorithm
  • Choose the BEST attribute that splits the data
    (Well come back to the first 3 lines later).
  • Type doesnt, but Patrons does.
  • If that choice splits the data, return the
    choice.
  • Much more to say about this later.
  • If you need to recurse, notice that you have a
    smaller problem with one less attribute and fewer
    example.
  • Lets look at Figure 18.4 carefully.

8
3 Special Cases No Examples
  • If there are no examples, return a default value.
    (You must decide it.)
  • This means youve gone down a branch that hasnt
    been observed yet.
  • Suppose FULL was not yet observed in 18.4b.
  • That means that NONE and SOME appeared in all of
    the test cases, but not FULL.
  • We might see Full later in some real data.
  • We must return something, hence the default.

9
3 Special Cases Same class
  • If all the examples have the same class just
    return it.
  • All the outcomes are all the same.
  • No more splitting on new attributes is needed.
  • This is what we are hoping for.

10
3 Special Cases No Attributes
  • If no attributes are left, return majority vote.
  • This happens because, either
  • Noise (bad data)
  • Need more attributes (problem under-modeled)
  • Non-deterministic choices (sometimes you said
    yes and sometimes no to same situation)

11
Remember
  • This is one of many trees that fit the data
    somehow.
  • The generated tree may not match your intuition.
  • Non-determinism may kill you, no matter what.
    Sometimes you wait at the restaurant, sometimes
    you dont. Based on whatluck, unknown attribute?
  • Note that Type did not split data at top level,
    but did at 3rd level.
  • Fig. 18.6 is simpler that 18.2, but is bound to
    be wrong because there are so many relevant cases
    it hasnt seen yet (e.g. Full 0-10)
  • Getting right attributes is a key problem of
    knowledge engineering. Most computer bugs because
    code is not sophisticated enough. Think about
    politics.pro

12
Best Attributes?
  • Algorithm asked for best attribute, i.e. the
    one that splits the data.
  • Good early choices make for smaller trees.
  • What makes an attribute good is whether it
    provides information or not.
  • 1 if it splits perfectly
  • 0 if it splits nothing
  • 0,1 if it splits some.
  • Book provides formulas for computing this.

13
Another Problem - Gathering Data
  • Consider problem of choosing a mate.
  • Characteristics are looks, sense of humor,
    money, car, intelligence,
  • Idea is you meet a person and want to know if you
    should go out on a date?
  • Young (overly discriminating) people with no
    experience have the following tree

14
(No Transcript)
15
How Good is your Tree?
  • Try it on new examples see if it predicts
    correctly.
  • Remember the restaurant had 9200 cases.
  • Problem is you want to train on n ltlt number
    of possible cases.
  • Need a methodology.

16
Training
  • Collect n examples (the hard part)
  • Divide them into disjoint training and testing
    sets.
  • Use training set to generate the hypothesis.
  • Try the tree on the testing data.
  • Change test and training set until you get great
    results.

17
What you want
  • A domain whose predictive quality increases as
    training set grows.

18
Overfitting - A common problem
  • Should you make a bet on the Dodgers?
  • You study the quality of the opponent
  • Who is the starting pitcher
  • Etc.
  • You also notice
  • You have won all your Saturday bets.
  • You have lost all your Sunday bets.
  • Should you add DayOfTheWeek to your set of
    attributes, since it seems to split the data?

19
Pruning Irrelevant Attributes
  • Superstitions are irrelevant attributes that
    accidentally split the data.
  • It is important prune away bad attributes.
  • Do a statistical analysis, like Chi-Square
    pruning, to see how relevant an attribute is,
    i.e. do they correlate?

20
Cross Validation for Overfitting
  • Estimate how each hypothesis predicts unseen
    data.
  • Run K experiments, each time setting aside 1/K of
    the test data.
  • Average the results (K is usually 5 or 10)
  • Idea is to randomize the test data, eliminating
    the superstitious correlations.
  • More on this when we do Neural Nets.
  • Look at fig 18.6. Didnt us Price, Rain, etc.
  • With cross validation, we may find these to be
    irrelevant.

21
Weve just scratched the surface
  • Missing data How do you handle it?
  • Continuous or integer valued attributes.
  • Input/Output may an amount, not just Yes/No.
    (Need to do regression trees.)
  • How do you combine hypotheses? Suppose you have
    3. Can you use them all? Ensemble learning.
  • Details about how many examples are needed.
  • Etc, etc,
Write a Comment
User Comments (0)
About PowerShow.com