Decision Trees - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Decision Trees

Description:

Number of Views:19

Avg rating:3.0/5.0

Slides: 20

Provided by: alext8

Category:

Tags: decision | purer | trees

Transcript and Presenter's Notes

Title: Decision Trees

1
Decision Trees
2
Bits

We are watching a set of independent random
samples of X
We see that X has four possible values
So we might see BAACBADCDADDDA
We transmit data over a binary serial link. We
can encode each reading with two bits (e.g. A00,
B01, C10, D 11)
0100001001001110110011111100

3
Fewer Bits

4
General Case

Suppose X can have one of m values
Whats the smallest possible number of bits, on
average, per symbol, needed to transmit a stream
of symbols drawn from Xs distribution? Its
H(X) is the entropy of X
Well, Shannon got to this formula by setting down
several desirable properties for uncertainty, and
then finding it.

5
Constructing decision trees

Normal procedure top down in recursive
divide-and-conquer fashion
First an attribute is selected for root node and
a branch is created for each possible attribute
value
Then the instances are split into subsets (one
for each branch extending from the node)
Finally the same procedure is repeated
recursively for each branch, using only instances
that reach the branch
Process stops if all instances have the same class

6
Which attribute to select?
(b)
(a)
(c)
(d)
7
A criterion for attribute selection

8
Example attribute Outlook
9
Information gain

Usually people dont use directly the entropy of
a node. Rather the information gain is being
used.

Clearly, greater the information gain better the
purity of a node. So, we choose Outlook for the
root.

10
Continuing to split
11
The final decision tree

Note not all leaves need to be pure sometimes
identical instances have different classes
Þ Splitting stops when data cant be split any
further

12
Highly-branching attributes

13
Tree stump for ID code attribute
14
Highly-branching attributes

So,
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

15
The gain ratio

Gain ratio a modification of the information
gain that reduces its bias
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account
Intrinsic information entropy (with respect to
the attribute on focus) of node to be split.

16
Computing the gain ratio
17
Gain ratios for weather data
18
More on the gain ratio

Outlook still comes out top but Humidity is
now a much closer contender because it splits the
data into two subsets instead of three.
However ID code has still greater gain ratio.
But its advantage is greatly reduced.
Problem with gain ratio it may overcompensate
May choose an attribute just because its
intrinsic information is very low
Standard fix choose an attribute that maximizes
the gain ratio, provided the information gain for
that attribute is at least as great as the
average information gain for all the attributes
examined.

19
Discussion

Algorithm for top-down induction of decision
trees (ID3) was developed by Ross Quinlan
(University of Sydney Australia)
Gain ratio is just one modification of this basic
algorithm
Led to development of C4.5, which can deal with
numeric attributes, missing values, and noisy
data
There are many other attribute selection
criteria! (But almost no difference in accuracy
of result.)