Title: Decision Trees
1Decision Trees
2(No Transcript)
3General Learning Task
- DEFINE
- Set X of Instances (of n-tuples x ltx1, ...,
xngt) - E.g., days decribed by attributes (or features)
- Sky, Temp, Humidity, Wind, Water, Forecast
- Target function y, e.g.
- EnjoySport X ? Y 0,1 (example of concept
learning) - WhichSport X ? Y Tennis, Soccer, Volleyball
- InchesOfRain X ? Y 0, 10
- GIVEN
- Training examples D
- positive and negative examples of the target
function ltx , y(x)gt - FIND
- A hypothesis h such that h(x) approximates y(x).
4Hypothesis Spaces
- Hypothesis space H is a subset of all y X ? Y
e.g. - MC2, conjunction of literals lt Sunny ? ?
Strong ? Same gt - Decision trees, any function
- 2-level decision trees (any function of two
attributes, some of three) - Candidate-Elimination Algorithm
- Search H for a hypothesis that matches the
training data - Exploits general-to-specific ordering of
hypotheses - Decision Trees
- Incrementally grow tree by splitting training
examples on attribute values - Can be thought of as looping for i 1,...,n
- Search Hi i-level trees for hypothesis h that
matches data
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Decision Trees represent disjunctions of
conjunctions
(Sunny Normal) v Overcast v (Rain Weak)
18Decision Trees vs. MC2
MC2 cant represent (Sunny v Cloudy) MC2
hypotheses must constrain to a single attribute
value if at all Vs. Decision Trees
Yes Yes No
19(No Transcript)
20Learning Parity with D-Trees
- How to solve 2-bit parity
- Two step look-ahead
- Split on pairs of attributes at once
- For k attributes, why not just do k-step look
ahead? Or split on k attribute values? - gtParity functions are the victims of the
decision trees inductive bias.
21(No Transcript)
22 I(Y xi)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Overfitting is due to noise
- Sources of noise
- Erroneous training data
- concept variable incorrect (annotator error)
- Attributes mis-measured
- Much more significant
- Irrelevant attributes
- Target function not deterministic in attributes
27Irrelevant attributes
- If many attributes are noisy, information gains
can be spurious, e.g. - 20 noisy attributes
- 10 training examples
- gtExpected of depth-3 trees that split the
training data perfectly using only noisy
attributes 13.4 - Potential solution statistical significance
tests (e.g., chi-square)
28Non-determinism
- In general
- We cant measure all the variables we need to do
perfect prediction. - gt Target function is not uniquely determined by
attribute values
29Non-determinism Example
Decent hypothesis Humidity gt 0.70 ? No
Otherwise ? Yes Overfit hypothesis Humidity gt
0.89 ? No Humidity gt 0.80 Humidity lt 0.89 ?
Yes Humidity gt 0.70 Humidity lt 0.80 ?
No Humidity lt 0.70 ? Yes
30Rule 2 of Machine Learning
- The best hypothesis almost never achieves 100
accuracy on the training data. - (Rule 1 was you cant learn anything without
inductive bias)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Hypothesis Space comparisons
Task concept learning with k binary attributes
40Decision Trees Strengths
- Very Popular Technique
- Fast
- Useful when
- Instances are attribute-value pairs
- Target Function is discrete
- Concepts are likely to be disjunctions
- Attributes may be noisy
41Decision Trees Weaknesses
- Less useful for continuous outputs
- Can have difficulty with continuous input
features as well - E.g., what if your target concept is a circle in
the x1, x2 plane? - Hard to represent with decision trees
- Very simple with instance-based methods well
discuss later
42(No Transcript)
43decision tree learning algorithm along the lines
of ID3