Title: Decision Tree Learning
1Decision Tree Learning
2Outline
- Decision Tree Representation
- Decision Tree Learning
- Entropy, Information Gain
- Overfitting
3Definition of Decision Trees
- A decision tree is a tree where each node of the
tree is associated with an attribute and each
branch is associated with the value of the
attribute. Each path from the root to a leaf
corresponds to a conjunction of attribute tests
and is labeled with a target value. The decision
trees represent a disjunction of conjunctions of
constraints on the attribute values
4Computation in Decision Trees
- An instance is classified by starting at the root
node of a decision tree, testing the attribute
specified by the node, then moving down the tree
branch corresponding to the value of the
attribute in the given example.
5Overview of Decision Tree Learning
- How to find (search) a decision tree (hypothesis)
that best fits a given set of training examples? - Construct a decision tree from a root node by a
greedy search process - At each node, select the attribute the best
classifies the local training examples.
6Decision Tree Learning Algorithm
7How to Select The Best Attribute?
8Training Examples
9Entropy - 1
- Measure of purity of an arbitrary collection of
examples.
10Entropy - 2
- Entropy specifies the expected minimum number of
bits for an arbitrary message. - Entropy can be used to measure the information in
an arbitrary message
11Change in Information
12Information Gain
- Average reduction in entropy caused by
partitioning the examples according to an
attribute - The information provided about the target
function value by knowing the value of attribute a
13Information Gain Examples
14Training Examples
15ID3 Trace 1
16ID3 Trace 2
17Review of ID3
- The hypothesis space a set of all finite
discrete-valued functions - ID3 is a simple-to-complex hill-climbing search
through hypothesis space - ID3 is susceptible to converging to a locally
optimal solution
18Inductive Bias of ID3
- BFS-ID3
- Shorter trees are preferred over longer trees.
- ID3
- Shorter trees are likely to be preferred over
longer trees - Trees that place high information gain attributes
close to the root are preferred over those that
do not
19ID3 vs Candidate-Elimination
- ID3
- Complete hypothesis space
- Incomplete search (suboptimal)
- Inductive bias is the search order of hypotheses
preference bias, search bias - Candidate-Elimination
- Incomplete hypothesis space
- Complete search (VS)
- Inductive bias is the search space restriction
bias, language bias - Preference bias / restriction bias / hybrid
20Why Shorter Trees?
- Occams razor Prefer the simplest hypothesis
that fits data - Fewer short hypotheses ? less likely coincidence
- A long hypothesis that fits data might be
coincidence - Argument opposed
- There are many ways to define small set of
hypotheses - Whats so special about small sets based on size
of hypothesis?
21Overfitting
22Errors In The Training Examples
23An Overfit Decision Tree
24Insufficient Training Examples
25An Overfit Decision Tree
26Avoiding Overfitting
- Cross-validation
- Split data into
- Stop growing a tree when the error rate on the
validation set increase - Overfit the data, and then post-prune the tree
27Reduced-Error Pruning
28Rule Post-Pruning
29Rule Post-Pruning Examples
30Reduce-Error Pruning vs Rule Post-Pruning
- Since each distinct path through the decision
tree node produces a distinct rule, the pruning
decision regarding an attribute test can be made
differently in rule post-pruning.
31Continuous-Valued Attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals -
32Gain Ratio
- Information gain is biased to favor attributes
with many values
33Missing Attribute Values
- Some attribute of ltx,c(x)gt in a node is missing
- Majority of training examples at the node
- Majority of c(x) training examples at the node
- Fractional examples according to estimated
distribution is used
34Attributes with Differing Costs
- Low cost attributes can be preferred by dividing
the information gain by the cost of the attribute.
35Summary of Decision Tree Learning
- Capable of learning disjunctive expressions?
Expressive hypothesis space - Instances nominal-valued vectors ? Can be
extended to real-valued vectors - Target function boolean-valued output (binary
classes) ? Can be extended to n-ary classes - ID3 uses all training examples at each step to
compute statistical properties such as
information gain robust to noisy training data
? Less sensitive to errors in training examples
? Can handle errors in classifications (target
values) ? Can handle errors in attribute values
(input vectors) ? Can handle missing attributes
in training examples