Decision Tree Learning

About This Presentation

Title:

Decision Tree Learning

Description:

1. Evaluate impact of pruning each possible node (plus those below it) on the validation set ... Temperature: 40 48 60 72 80 90. PlayTennis: No No Yes Yes Yes No ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 82

Provided by: peopleSab

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree Learning

1
Decision Tree Learning

Machine Learning, T. Mitchell
Chapter 3

2
Decision Trees

One of the most widely used and practical methods
for inductive
inference
Approximates discrete-valued functions (including
disjunctions)
Can be used for classification (most common) or
regression
problems

3
Decision Tree for PlayTennis
4
Decision Tree

If (OSunny AND HNormal) OR (OOvercast) OR
(ORain AND WWeak)
then YES
A disjunction of conjunctions of constraints on
attribute values
Larger hypothesis space than Candidate-Elimination

5
Decision tree representation

Each internal node corresponds to a test
Each branch corresponds to a result of the test
Each leaf node assigns a classification
Once the tree is trained, a new instance is
classified by starting at the root and following
the path as dictated by the test results for this
instance.

6
Decision Regions
7
Divide and Conquer

Internal decision nodes
Univariate Uses a single attribute, xi
Discrete xi n-way split for n possible values
Continuous xi Binary split xi gt wm
Multivariate Uses more than one attributes
Leaves
Classification Class labels, or proportions
Regression Numeric r average, or local fit
Learning is greedy find the best split
recursively (Breiman et al, 1984 Quinlan, 1986,
1993)

If the decisions are binary, then in the best
case, each
decision eliminates half of the regions (leaves).
If there are b regions, the correct region can be
found in
log2b decisions, in the best case.

9
Multivariate Trees
10
Expressiveness

A decision tree can represent a disjunction of
conjunctions
of constraints on the attribute values of
instances.
Each path corresponds to a conjunction
The tree itself corresponds to a disjunction
How expressive is this representation?
How would we represent
(A AND B) OR C
M of N
A XOR B

11
Decision tree learning algorithm

For a given training set, there are many trees
that code it without any error
Finding the smallest tree is NP-complete (Quinlan
1986), hence we are forced to use some (local)
search algorithm to find reasonable solutions

12
The basic decision tree learning algorithm

A decision tree can be constructed by considering
attributes of instances one by one.
Which attribute should be considered first?
The height of a decision tree depends on the
order attributes are considered.

13
Top-Down Induction of Decision Trees
14
(No Transcript)
15
Entropy

Measure of uncertainty
Expected number of bits to resolve uncertainty
Entropy measures the information amount in a
message
High school form example

16
Entropy

Important quantity in
coding theory
statistical physics
machine learning

17
Entropy

Coding theory x discrete with 8 possible states
how many bits are needed to transmit the state
of x?
All states equally likely

18
(No Transcript)
19

Entropy measures the impurity of S
Entropy(S) -p log2 p - log2 (1-p)
(Here pp-positive and 1-p p_negative from the
previous slide)

20
Entropy

Suppose PrX 0 1/8
If other events are all equally likely, the
number of events is 8.
To indicate one out of so many events, one needs
lg2 8 bits.
Consider a binary random variable X s.t. PrX
0 0.1.
The expected number of bits
In general, if a random variable X has c values
with prob. p_c
The expected number of bits

21
Entropy

What if we have the following distribution for x?
In order to save on transmission costs, we would
design codes that
reflect this distribution

22
Entropy
23
Use of Entropy in Choosing the Next Attribute
24
(No Transcript)
25
Other measures of impurity

Entropy is not the only measure of impurity. If a
function satisfies certain criteria, it can be
used as a measure of impurity.
Gini index 2p(p-1)
Misclassification error 1 max(p,1-p)

26
Training Examples
27
Selecting the Next Attribute
28
Selecting the Next Attribute

Computing the information gain for each
attribute, we selected the
Outlook attribute as the first test, resulting
in the following partially ,
learned tree

29
Partially learned tree
30

Until stopped
Select one of the unused attributes to partition
the remaining
examples at each non-terminal node
using only the training samples associated with
that node
Stopping criteria
each leaf-node contains examples of one type
algorithm ran out of attributes

31
(No Transcript)
32
Inductive Bias of ID3
33
Hypothesis Space Search by ID3

Hypothesis space is complete
every finite discrete function can be represented
by a decision tree
Outputs a single hypothesis (which one?)
Cant play 20 questions...
No back tracking
Local minima...
Statistically-based search choices
Uses all available training samples

Note H is the power set of instances X
Unbiased?
Preference for short trees, and for those with
high information gain
attributes near the root
Bias is a preference for some hypotheses, rather
than a restriction of hypothesis space H
Occams razor prefer the shortest hypothesis
that fits the data

35
Occams razor

Prefer the shortest hypothesis that fits the data
Occam 1320
While this idea is intuitive, it is more
difficult to prove it formally.
Support 1
Shorter hypotheses have better generalization
ability
Support 2
The number of short hypotheses are small, and
therefore it is less likely a coincidence
if data fits a short hypothesis
There may be counter arguments for this there
are other hypotheses with small numbers, why not
choose those but the small ones
Different internal representations may arrive to
different length of hypothesis
We will consider an optimal encoding

36
Overfitting
37
Over fitting in Decision Trees

Why over-fitting?
A model can become more complex than the true
target function (concept) when it tries to
satisfy noisy data as well.
Definition of overfitting
A hypothesis is said to overfit the training data
if there exists some other hypothesis that has
larger error over the training data but smaller
error over the entire instances.

Consider adding the following training example
which is
incorrectly labeled as negative
Sky Temp Humidity Wind PlayTennis
Sunny Hot Normal Strong PlayTennis
No
Or consider the Oranges and Tangerines with Size
and Texture attributes and the orange that is
misclassified as tangerine (I will add a figure
later)

39
(No Transcript)
40

ID3 will make a new split and will classify
future examples following
the new path as negative.
Problem is due to overfitting the training
data.
Overfitting may result due to
noise
coincidental regularities in the training data
What is the formal description of overfitting?

41
(No Transcript)
42
Curse of Dimensionality - A related concept

Imagine a learning task, such as recognizing
printed characters.
Intuitively, adding more attributes would help
the learner, as more
information never hurts, right?
In fact, sometimes it does, due to what is called
curse of dimensionality.

43
Curse of Dimensionality
44
Curse of Dimensionality
Polynomial curve fitting, M 3
Number of independent coefficients grows
proportionally to D3 where D is the number of
variables More generally, for an M dimensional
polynomial DM The polynomial becomes unwieldy
very quickly.
45
Polynomial Curve Fitting
46
Sum-of-Squares Error Function
47
0th Order Polynomial
48
1st Order Polynomial
49
3rd Order Polynomial
50
9th Order Polynomial
51
Over-fitting
Root-Mean-Square (RMS) Error
52
Polynomial Coefficients
53
Data Set Size
9th Order Polynomial
54
Data Set Size
9th Order Polynomial
55
Regularization

Penalize large coefficient values

56
Regularization
57
Regularization
58
Regularization vs.
59
Polynomial Coefficients
60

Although the curse of dimensionality is an
important issue, we can
still find effective techniques applicable to
high-dimensional spaces
Real data will often be confined to a region of
the space having
lower effective dimensionality
example of planar objects on a conveyor belt
3 dimensional manifold within the high
dimensional picture pixel space
Real data will typically exhibit smoothness
properties

61
Back to Decision Trees
62
Over fitting in Decision Trees
63
Avoiding over-fitting the data

How can we avoid overfitting? There are 2
approaches
stop growing the tree before it perfectly
classifies the training data
grow full tree, then post-prune
Reduced error pruning
Rule post-pruning
the 2nd approach is found more useful in
practice.

Whether we are pre or post-pruning, the important
question is how
to select best tree
Measure performance over separate validation data
set
Measure performance over training data
apply a statistical test to see if expanding or
pruning would produce an
improvement beyond the training set (Quinlan
1986)
MDL minimize size(tree) size(misclassifications
(tree))

MDL
length(h) length(additional information to
encode D given h)
length(h) length(misclassifications)
since we only need to send a message when the
data sample is not in
agreement with h hence, only for
misclassifications.

66
Reduced-Error Pruning (Quinlan 1987)

Split data into training and validation set
Do until further pruning is harmful
1. Evaluate impact of pruning each possible node
(plus those below it)
on the validation set
2. Greedily remove the one that most improves
validation set accuracy
Produces smallest version of the (most accurate)
tree
What if data is limited?
We would not want to separate a validation set.

67
Reduced error pruning

Examine each decision node to see if pruning
decreases the trees performance over the
evaluation data.
Pruning here means replacing a subtree with a
leaf with the most common classification in the
subtree.

68
Rule post-pruning

Algorithm
Build a complete decision tree.
Convert the tree to set of rules.
Prune each rule
Remove any preconditions if any improvement in
accuracy
Sort the pruned rules by accuracy and use them in
that order.
Perhaps most frequently used method (e.g., in
C4.5)
More details can be found in http//www2.cs.uregin
a.ca/hamilton/courses/831/notes/ml/dtrees/4_dtree
s3.html
(read only if interested, presentation of
advanced decision tree algorithms such as this
may be added as part of a class project)

69
(No Transcript)
70

IF (Outlook Sunny) (Humidity High)
THEN PlayTennis No
IF (Outlook Sunny) (Humidity Normal)
THEN PlayTennis Y es
. . .

71
Rule Extraction from Trees
C4.5Rules (Quinlan, 1993)
72

Converting a decision tree to rules before
pruning has three main
advantages
Converting to rules allows distinguishing among
the different contexts in
which a decision node is used.
Since each distinct path through the decision
tree node produces a distinct rule,
the pruning decision regarding that attribute
test can be made differently for each
path.
In contrast, if the tree itself were pruned, the
only two choices would be
Remove the decision node completely, or
Retain it in its original form.
Converting to rules removes the distinction
between attribute tests that
occur near the root of the tree and those that
occur near the leaves.
We thus avoid messy bookkeeping issues such as
how to reorganize the tree if the root node is
pruned while retaining part of the subtree below
this test.
Converting to rules improves readability.
Rules are often easier for people to understand.

73
Rule Simplification Overview

Eliminate unecessary rule antecedents to simplify
the rules.
Construct contingency tables for each rule
consisting of more than one
antecedent.
Rules with only one antecedent cannot be further
simplified, so we only consider those with two or
more.
To simplify a rule, eliminate antecedents that
have no effect on the conclusion
reached by the rule.
A conclusion's independence from an antecendent
is verified using a test for independency, which
is
a chi-square test if the expected cell
frequencies are greater than 10.
Yates' Correction for Continuity when the
expected frequencies are between 5 and 10.
Fisher's Exact Test for expected frequencies less
than 5.
Once individual rules have been simplified by
eliminating redundant antecedents, simplify the
entire set by eliminating unnecessary rules.
Attempt to replace those rules that share the
most common consequent by a
default rule that is triggered when no other
rule is triggered.
In the event of a tie, use some heuristic tie
breaker to choose a default rule.

74
Continuous Valued Attributes

Create a discrete attribute to test continuous
Temperature 825
(Temperature gt 723) t f
How to find the threshold?
Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No

75
Incorporating continuous-valued attributes

Where to cut?

Continuous valued attribute
76
Split Information?

In each tree, the leaves contain samples of only
one kind (e.g. 50, 10, 10- etc).
Hence, the remaining entropy is 0 in each one.
Which is better?
In terms of information gain
In terms of gain ratio

100 examples
100 examples
A2
A1
10 positive
50 positive
50 negative
10 positive
10 negative
10 positive
77
Attributes with Many Values

One way to penalize such attributes is to use the
following alternative measure

S
Entropy of the attribute A Experimentally
determined by the training samples
78
Handling training examples with missing attribute
values

What if an example x is missing the value an
attribute A?
Simple solution
Use the most common value among examples at node
n.
Or use the most common value among examples at
node n that have classification c(x)
More complex, probabilistic approach
Assign a probability to each of the possible
values of A based on the observed frequencies of
the various values of A
Then, propagate examples down the tree with these
probabilities.
The same probabilities can be used in
classification of new instances (used in C4.5)

79
Handling attributes with differing costs

Sometimes, some attribute values are more
expensive or difficult to prepare.
medical diagnosis, BloodTest has cost 150
In practice, it may be desired to postpone
acquisition of such attribute values until they
become necessary.
To this purpose, one may modify the attribute
selection measure to penalize expensive
attributes.
Tan and Schlimmer (1990)
Nunez (1988)

80
C4.5

By Ross Quinlan
Latest code available at http//www.cse.unsw.edu.a
u/quinlan/
How to use it?
Download it
Unpack it
Make it (make all)
Read accompanying manual files
groff T ps c4.5.1 gt c4.5.ps
Use it
c4.5 tree generator
c4.5rules rule generator
consult use a generated tree to classify an
instance
consultr use a generated set of rules to
classify an instance

81
Model Selection in Trees
82
Strengths and Advantages of Decision Trees

Rule extraction from trees
A decision tree can be used for feature
extraction (e.g. seeing which
features are useful)
Interpretability human experts may verify and/or
discover patterns
It is a compact and fast classification method

Write a Comment

User Comments (0)

About PowerShow.com

Decision Tree Learning - PowerPoint PPT Presentation

Decision Tree Learning

1. Evaluate impact of pruning each possible node (plus those below it) on the validation set ... Temperature: 40 48 60 72 80 90. PlayTennis: No No Yes Yes Yes No ... – PowerPoint PPT presentation