Decision Tree Learning - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Decision Tree Learning

Description:

A decision tree is a tree where each node of the tree is associated with an ... The decision trees represent a disjunction of conjunctions of constraints on the ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 36

Provided by: leeki9

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree Learning

1
Decision Tree Learning

2003. 8. 7
Lee, Ki-joong

2
Outline

Decision Tree Representation
Decision Tree Learning
Entropy, Information Gain
Overfitting

3
Definition of Decision Trees

A decision tree is a tree where each node of the
tree is associated with an attribute and each
branch is associated with the value of the
attribute. Each path from the root to a leaf
corresponds to a conjunction of attribute tests
and is labeled with a target value. The decision
trees represent a disjunction of conjunctions of
constraints on the attribute values

4
Computation in Decision Trees

An instance is classified by starting at the root
node of a decision tree, testing the attribute
specified by the node, then moving down the tree
branch corresponding to the value of the
attribute in the given example.

5
Overview of Decision Tree Learning

How to find (search) a decision tree (hypothesis)
that best fits a given set of training examples?
Construct a decision tree from a root node by a
greedy search process
At each node, select the attribute the best
classifies the local training examples.

6
Decision Tree Learning Algorithm
7
How to Select The Best Attribute?
8
Training Examples
9
Entropy - 1

Measure of purity of an arbitrary collection of
examples.

10
Entropy - 2

Entropy specifies the expected minimum number of
bits for an arbitrary message.
Entropy can be used to measure the information in
an arbitrary message

11
Change in Information
12
Information Gain

Average reduction in entropy caused by
partitioning the examples according to an
attribute
The information provided about the target
function value by knowing the value of attribute a

13
Information Gain Examples
14
Training Examples
15
ID3 Trace 1
16
ID3 Trace 2
17
Review of ID3

The hypothesis space a set of all finite
discrete-valued functions
ID3 is a simple-to-complex hill-climbing search
through hypothesis space
ID3 is susceptible to converging to a locally
optimal solution

18
Inductive Bias of ID3

BFS-ID3
Shorter trees are preferred over longer trees.
ID3
Shorter trees are likely to be preferred over
longer trees
Trees that place high information gain attributes
close to the root are preferred over those that
do not

19
ID3 vs Candidate-Elimination

ID3
Complete hypothesis space
Incomplete search (suboptimal)
Inductive bias is the search order of hypotheses
preference bias, search bias
Candidate-Elimination
Incomplete hypothesis space
Complete search (VS)
Inductive bias is the search space restriction
bias, language bias
Preference bias / restriction bias / hybrid

20
Why Shorter Trees?

Occams razor Prefer the simplest hypothesis
that fits data
Fewer short hypotheses ? less likely coincidence
A long hypothesis that fits data might be
coincidence
Argument opposed
There are many ways to define small set of
hypotheses
Whats so special about small sets based on size
of hypothesis?

21
Overfitting
22
Errors In The Training Examples
23
An Overfit Decision Tree
24
Insufficient Training Examples
25
An Overfit Decision Tree
26
Avoiding Overfitting

Cross-validation
Split data into
Stop growing a tree when the error rate on the
validation set increase
Overfit the data, and then post-prune the tree

27
Reduced-Error Pruning
28
Rule Post-Pruning
29
Rule Post-Pruning Examples
30
Reduce-Error Pruning vs Rule Post-Pruning

Since each distinct path through the decision
tree node produces a distinct rule, the pruning
decision regarding an attribute test can be made
differently in rule post-pruning.

31
Continuous-Valued Attributes

Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals

32
Gain Ratio

Information gain is biased to favor attributes
with many values

33
Missing Attribute Values

Some attribute of ltx,c(x)gt in a node is missing
Majority of training examples at the node
Majority of c(x) training examples at the node
Fractional examples according to estimated
distribution is used

34
Attributes with Differing Costs

Low cost attributes can be preferred by dividing
the information gain by the cost of the attribute.

35
Summary of Decision Tree Learning

Capable of learning disjunctive expressions?
Expressive hypothesis space
Instances nominal-valued vectors ? Can be
extended to real-valued vectors
Target function boolean-valued output (binary
classes) ? Can be extended to n-ary classes
ID3 uses all training examples at each step to
compute statistical properties such as
information gain robust to noisy training data
? Less sensitive to errors in training examples
? Can handle errors in classifications (target
values) ? Can handle errors in attribute values
(input vectors) ? Can handle missing attributes
in training examples