Decision Trees - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Decision Trees

Description:

What does a decision tree do? How do you prepare training data? How do you use a decision tree? The ... Q: Are you Tiger Woods A: Yes. Decision trees ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 36

Provided by: lingOhi

Category:

more less

Transcript and Presenter's Notes

Title: Decision Trees

1
Decision Trees

Data Intensive Linguistics
Spring 2003
Ling 684.02

2
Decision Trees

What does a decision tree do?
How do you prepare training data?
How do you use a decision tree?
The traditional example is a tiny data set about
weather.
Here I use Wagon, many other similar packages
exist.

3
Decision processes

Challenge Who am I?
QAre you alive? A Yes
Q Are you famous? A Yes
Q Are you a tennis player? A No
Q Are you a golfer? A Yes
Q Are you Tiger Woods A Yes

4
Decision trees

Played rationally, this game has the property
that each binary question partitions the space of
possible entities.
Thus, the structure of the search can be seen as
a tree.
Decision trees are encodings of a similar search
process. Usually, a wider range of questions is
allowed.

5
Decision trees

In a problem solving setup, were not dealing
with people, but with a finite number of classes
for the predicted variable.
But the task is essentially the same, given a set
of available questions, narrow down the
possibilities till you are confident about the
class.

6
How to choose a question?

Look for the question that most increases your
knowledge of the class
We cant tell ahead of time which answer will
arise.
So take the average over all possible answers,
weighted by how probable each answer seems to be.
The maths behind this is either information
theory or an approximation to it.

7
How to be confident

If a simple majority classifier would achieve
acceptable performance on the data in the current
partition.
Obvious generalization (Kohavi) be confident if
some other baseline classifier would perform well
enough.

8
Input data
9
Data format

Each row of the table is an instance
Each column of the table is an attribute (or
feature)
You also have to say which attribute is the
predictee or class variable. In this case we
choose Playable.

10
Attribute types

We also need to understand the types of the
attributes.
For the weather data
Windy and Playable look boolean
Temperature and Humidity look as if they can take
any numerical value
Cloudy looks as if it can take any of
sunny,overcast,rainy,sunny

11
Wagon description files

Because guessing the range of an attribute is
tricky, Wagon instead requires you to have a
description file
Fortunately (especially if you have big data
files), Wagon also provides make_wagon_desc which
makes a reasonable guess at the desc file.

12
For the weather data

(
(outlook overcast rainy sunny)
(temperature float)
(humidity float)
( windy FALSE TRUE)
( play no yes)
)
(needed a little help replacing lists of numbers
with float)

13
Commands for Wagon

wagon data weather.dat desc weather.desc o
weather.tree
This produces unsatisfying results, because we
need to tell it that the data set is small by
setting stop 2 (or else it notices that there
are doesnt build a tree)

14
Using the stopping criterion

wagon data weather.dat \
desc weather.desc \
o weather.tree \
stop 1
This allows the system to learn the exact
circumstances under which Play takes on
particular values.

15
Using Wagon to classify

wagon_test -data weather.dat \
-tree weather.tree \ -desc
weather.desc \ -predict play
16
Output data
17
Over-training

-stop1 is over-confident, because it might build
a leaf for every quirky example.
There will be other quirky examples once we move
to new data. Unless we are very lucky, what is
learnt from the training set will be too
detailed.

18
Over-training 2

The bigger -stop is, the more errors that the
system will commit on the training data.
Conversely, the smaller -stop is, the more likely
that the tree will learn irrelevant detail.
The risk of overtraining grows as -stop shrinks,
and as the set of available attributes increases.

19
Why over-training hurts

If you have a complex attribute space, your
training data will not cover everything.
Unless you learn general rules, new instances
will not be correctly classified.
Also, the system's estimates of how well it is
doing will be very optimistic.
This is like doing Linguistics but only on
written, academic English...

20
Setting -stop automatically

Split training data in two. Use first half to
train, trying several different values for -stop
Use second half for cross-validation measure
performance of the various trees learnt.

21
Cross-validation

If performance gain generalizes to
cross-validation half, then probably also to
unseen data.
Any problems?

22
Data efficiency

Train/tune split is wasteful.
Reduce tuning part to 10 of data. Train on 90.
Rotate the 10 through the training data.

23
Cross-validation
24
Cross-validation

Because the tuning set was 10, this is 10-fold
cross-validation. 20-fold would be 5
In the limit (very expensive or small training
data) we have leave one out cross-validation.

25
Clustering with decision trees

The standard stopping criterion is purity of the
classes at the leaves of the trees.
Another criterion uses a distance matrix
measuring the dissimilarity of instances. Stop
when the groups of instances at the leaves from
tight clusters.

26
What are decision trees?

A decision tree is a classifer. Given an input
instance it inspects the features and delivers a
selected class.
But it knows slightly more than this. The set of
instances grouped at the leaves may not be a pure
class. This set defines a probability
distribution over the classes. So a decision tree
is a distribution classifier.
There are many other varieties of classifier.

27
Nearest neighbour(s)

If you have a distance measure, and you have a
labelled training set, you can assign a class by
finding the class of the nearest labelled
instance.
Relying on just one labelled data point could be
risky, so an alternative is to consider the
classes of k neighbours.
You need to find a suitable distance measure.
You might use cross-validation to set an
appropriate value of k

28
Bellman's curse

Nearest neighbour classifiers make sense if
classes are well-localized in the space defined
by the distance measure.
As you move from lines to planes, volumes and
high-dimensional hyperplanes, the chance that you
will find enough labelled data points close
enough decreases.
This is a general problem, not specific to
nearest-neighbour, and is known as Bellman's
curse of dimensionality

29
Dimensionality

If we had uniformly spread data, and we wanted to
catch 10 of the data, we would need 10 of the
range of x in a 1-D space, but 31 of the range
of each x and y in a 2-D space and 46 of the
range of x,y,z in a cube. In 10 dimensions you
need to cover 80 of the ranges.
In high dimensional spaces, most data points are
closer to the boundaries of the space than they
are to any other data point
Text problems are very often high-dimensional

30
Decision trees in high-D space

Decision trees work by picking on an important
dimension and using it to split the instance
space into slices of lower dimensionality.
They typically don't use all the dimensions of
the input space.
Different branches of the tree may select
different dimensions as relevant.
Once the subspace is pure enough, or well enough
clustered, the DT is finished.

31
Cues to class variables

If we have many features, any one could be a
useful cue to the class variable. (If the token
is a single upper case letter followed by a ., it
might be part of A. Name)
If cues conflict, we need to decide which ones to
trust. ... S. p. A. In other news
In general, we may need to take account of
combinations of features (The A-Z\. feature
is relevant only if we haven't found abbreviations

32
Compound cues

Unfortunately there are very many potential
compound cues. Training them all separately will
throw us into a very high-D space.
The naive Bayes classifier deals with this by
adopting very strong assumptions about the
relation between feature and the underlying
class.
Assumption Each feature is independently
affected by the class, nothing else matters.

33
The naïve Bayes classifier
C
F1
F2
Fn
...

P(F1,F2...FnC) P(F1c)P(F2c)...P(Fnc)
Classify by finding class with highest score
given features and this (crass) assumption.
Nice property easy to train, just count number
of times that Fi and class co-occur

34
Comments on naïve Bayes

Clearly, the independence assumption is false.
All features, relevant or not, get the same
chance to contribute. If there are many
irrelevant features, they may swamp the real
effects we are after.
But it is very simple and efficient, so can be
used in schemes such as boosting that rely on
combinations of many slightly different
classifiers.
In that context, even simpler classifiers
(majority classifier, single rule) can be useful

35
Decision trees and classifiers