Decision Trees - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Decision Trees

Description:

What does a decision tree do? How do you prepare training data? How do you use a decision tree? The ... Q: Are you Tiger Woods A: Yes. Decision trees ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 36
Provided by: lingOhi
Category:
Tags: decision | tiger | trees | woods

less

Transcript and Presenter's Notes

Title: Decision Trees


1
Decision Trees
  • Data Intensive Linguistics
  • Spring 2003
  • Ling 684.02

2
Decision Trees
  • What does a decision tree do?
  • How do you prepare training data?
  • How do you use a decision tree?
  • The traditional example is a tiny data set about
    weather.
  • Here I use Wagon, many other similar packages
    exist.

3
Decision processes
  • Challenge Who am I?
  • QAre you alive? A Yes
  • Q Are you famous? A Yes
  • Q Are you a tennis player? A No
  • Q Are you a golfer? A Yes
  • Q Are you Tiger Woods A Yes

4
Decision trees
  • Played rationally, this game has the property
    that each binary question partitions the space of
    possible entities.
  • Thus, the structure of the search can be seen as
    a tree.
  • Decision trees are encodings of a similar search
    process. Usually, a wider range of questions is
    allowed.

5
Decision trees
  • In a problem solving setup, were not dealing
    with people, but with a finite number of classes
    for the predicted variable.
  • But the task is essentially the same, given a set
    of available questions, narrow down the
    possibilities till you are confident about the
    class.

6
How to choose a question?
  • Look for the question that most increases your
    knowledge of the class
  • We cant tell ahead of time which answer will
    arise.
  • So take the average over all possible answers,
    weighted by how probable each answer seems to be.
  • The maths behind this is either information
    theory or an approximation to it.

7
How to be confident
  • If a simple majority classifier would achieve
    acceptable performance on the data in the current
    partition.
  • Obvious generalization (Kohavi) be confident if
    some other baseline classifier would perform well
    enough.

8
Input data
9
Data format
  • Each row of the table is an instance
  • Each column of the table is an attribute (or
    feature)
  • You also have to say which attribute is the
    predictee or class variable. In this case we
    choose Playable.

10
Attribute types
  • We also need to understand the types of the
    attributes.
  • For the weather data
  • Windy and Playable look boolean
  • Temperature and Humidity look as if they can take
    any numerical value
  • Cloudy looks as if it can take any of
    sunny,overcast,rainy,sunny

11
Wagon description files
  • Because guessing the range of an attribute is
    tricky, Wagon instead requires you to have a
    description file
  • Fortunately (especially if you have big data
    files), Wagon also provides make_wagon_desc which
    makes a reasonable guess at the desc file.

12
For the weather data
  • (
  • (outlook overcast rainy sunny)
  • (temperature float)
  • (humidity float)
  • ( windy FALSE TRUE)
  • ( play no yes)
  • )
  • (needed a little help replacing lists of numbers
    with float)

13
Commands for Wagon
  • wagon data weather.dat desc weather.desc o
    weather.tree
  • This produces unsatisfying results, because we
    need to tell it that the data set is small by
    setting stop 2 (or else it notices that there
    are doesnt build a tree)

14
Using the stopping criterion
  • wagon data weather.dat \
  • desc weather.desc \
  • o weather.tree \
  • stop 1
  • This allows the system to learn the exact
    circumstances under which Play takes on
    particular values.

15
Using Wagon to classify

wagon_test -data weather.dat \
-tree weather.tree \ -desc
weather.desc \ -predict play
16
Output data
17
Over-training
  • -stop1 is over-confident, because it might build
    a leaf for every quirky example.
  • There will be other quirky examples once we move
    to new data. Unless we are very lucky, what is
    learnt from the training set will be too
    detailed.

18
Over-training 2
  • The bigger -stop is, the more errors that the
    system will commit on the training data.
  • Conversely, the smaller -stop is, the more likely
    that the tree will learn irrelevant detail.
  • The risk of overtraining grows as -stop shrinks,
    and as the set of available attributes increases.

19
Why over-training hurts
  • If you have a complex attribute space, your
    training data will not cover everything.
  • Unless you learn general rules, new instances
    will not be correctly classified.
  • Also, the system's estimates of how well it is
    doing will be very optimistic.
  • This is like doing Linguistics but only on
    written, academic English...

20
Setting -stop automatically
  • Split training data in two. Use first half to
    train, trying several different values for -stop
  • Use second half for cross-validation measure
    performance of the various trees learnt.

21
Cross-validation
  • If performance gain generalizes to
    cross-validation half, then probably also to
    unseen data.
  • Any problems?

22
Data efficiency
  • Train/tune split is wasteful.
  • Reduce tuning part to 10 of data. Train on 90.
  • Rotate the 10 through the training data.

23
Cross-validation
24
Cross-validation
  • Because the tuning set was 10, this is 10-fold
    cross-validation. 20-fold would be 5
  • In the limit (very expensive or small training
    data) we have leave one out cross-validation.

25
Clustering with decision trees
  • The standard stopping criterion is purity of the
    classes at the leaves of the trees.
  • Another criterion uses a distance matrix
    measuring the dissimilarity of instances. Stop
    when the groups of instances at the leaves from
    tight clusters.

26
What are decision trees?
  • A decision tree is a classifer. Given an input
    instance it inspects the features and delivers a
    selected class.
  • But it knows slightly more than this. The set of
    instances grouped at the leaves may not be a pure
    class. This set defines a probability
    distribution over the classes. So a decision tree
    is a distribution classifier.
  • There are many other varieties of classifier.

27
Nearest neighbour(s)
  • If you have a distance measure, and you have a
    labelled training set, you can assign a class by
    finding the class of the nearest labelled
    instance.
  • Relying on just one labelled data point could be
    risky, so an alternative is to consider the
    classes of k neighbours.
  • You need to find a suitable distance measure.
  • You might use cross-validation to set an
    appropriate value of k

28
Bellman's curse
  • Nearest neighbour classifiers make sense if
    classes are well-localized in the space defined
    by the distance measure.
  • As you move from lines to planes, volumes and
    high-dimensional hyperplanes, the chance that you
    will find enough labelled data points close
    enough decreases.
  • This is a general problem, not specific to
    nearest-neighbour, and is known as Bellman's
    curse of dimensionality

29
Dimensionality
  • If we had uniformly spread data, and we wanted to
    catch 10 of the data, we would need 10 of the
    range of x in a 1-D space, but 31 of the range
    of each x and y in a 2-D space and 46 of the
    range of x,y,z in a cube. In 10 dimensions you
    need to cover 80 of the ranges.
  • In high dimensional spaces, most data points are
    closer to the boundaries of the space than they
    are to any other data point
  • Text problems are very often high-dimensional

30
Decision trees in high-D space
  • Decision trees work by picking on an important
    dimension and using it to split the instance
    space into slices of lower dimensionality.
  • They typically don't use all the dimensions of
    the input space.
  • Different branches of the tree may select
    different dimensions as relevant.
  • Once the subspace is pure enough, or well enough
    clustered, the DT is finished.

31
Cues to class variables
  • If we have many features, any one could be a
    useful cue to the class variable. (If the token
    is a single upper case letter followed by a ., it
    might be part of A. Name)
  • If cues conflict, we need to decide which ones to
    trust. ... S. p. A. In other news
  • In general, we may need to take account of
    combinations of features (The A-Z\. feature
    is relevant only if we haven't found abbreviations

32
Compound cues
  • Unfortunately there are very many potential
    compound cues. Training them all separately will
    throw us into a very high-D space.
  • The naive Bayes classifier deals with this by
    adopting very strong assumptions about the
    relation between feature and the underlying
    class.
  • Assumption Each feature is independently
    affected by the class, nothing else matters.

33
The naïve Bayes classifier
C
F1
F2
Fn
...
  • P(F1,F2...FnC) P(F1c)P(F2c)...P(Fnc)
  • Classify by finding class with highest score
    given features and this (crass) assumption.
  • Nice property easy to train, just count number
    of times that Fi and class co-occur

34
Comments on naïve Bayes
  • Clearly, the independence assumption is false.
  • All features, relevant or not, get the same
    chance to contribute. If there are many
    irrelevant features, they may swamp the real
    effects we are after.
  • But it is very simple and efficient, so can be
    used in schemes such as boosting that rely on
    combinations of many slightly different
    classifiers.
  • In that context, even simpler classifiers
    (majority classifier, single rule) can be useful

35
Decision trees and classifiers
  • Attributes and instances
  • Learning from instances
  • Over-training
  • Cross-validation
  • Dimensionality
  • Independence assumptions
Write a Comment
User Comments (0)
About PowerShow.com