Title: CC282 Decision trees
1CC282 Decision trees
2Lecture 2 - Outline
- More ML principles
- Concept learning
- Hypothesis space
- Generalisation and overfitting
- Model (hypothesis) evaluation
- Inductive learning
- Inductive bias
- Decision trees
- ID3 algorithm (entropy, information gain)
3Concept learning
- Concept, c is the problem to be learned
- Example
- Classification problem by an optician
- Concept - whether to fit or not to fit contact
lenses based on users budget, users eye
condition, users environment etc - Inputs, x users budget, users eye condition,
users environment - Output, y to fit or not to fit
- A learning model is needed to learn a concept
- The learning model should ideally
- Capture the training data, ltx, ygt -gt
descriptive ability - Generalise to unseen test data, ltxnew ,?gt -gt
predictive ability - Provide plausible explanation on the learned
concept, c -gt explanatory ability - But descriptive and predictive abilities are
generally considered sufficient
4Learning a concept
- Concept learning
- Given many examples - ltinput, outputgt of what c
does, find a function h that approximates c - The number of examples is usually a small subset
of all possible ltinput, outputgt pairs - h is known as a hypothesis (i.e. learning model)
- There might be a number of h that are candidate
solutions -we select h from a hypothesis space H - If the hypothesis matches the behaviour of the
target concept for all training data, then it is
a consistent hypothesis - Occams razor
- Simpler hypothesis that fits c is preferred
- Simpler h means shorter, smaller h
- Simpler h is unlikely to be an effect of
coincidence - Learning search in the H for an appropriate h
- Realisable task H contains the h that fits the
concept - Unreliasable task H does not contain the h
that fits the concept
5More terms - Generalisation, overfitting,
induction, deduction
- Generalisation
- The ability of the trained model to perform well
on test data - Overfitting
- If the model learns the training data well but
performs poorly on the test data - Inductive learning (induction)
- learning a hypothesis by example, where a system
tries to induce a general rule/model from a set
of observed instances/samples - Inductive bias
- Since many choices of h exist in H, any
preference of one hypothesis over another without
prior knowledge is called bias - Any hypothesis consistent with the training
examples is likely to generalise to unseen
examples - the trick is to find the right bias - An unbiased learner
- Can never generalise so not practically useful
- Deduction
- ML gives an output (prediction, classification
etc) based on the previously acquired learning
6Generalisation and overfitting example
- Assume, we have the inputs, x and corresponding
outputs, y and we wish to have concept, c that
matches x to y - Examples of hypotheses
- h1 will give good generalisation
- h2 is overfitted
7Model (hypothesis) evaluation
- We need to have some performance measure to
estimate how the model h approximates c, i.e. how
good is h? - Possible evaluation methods
- Explanatory, gives qualitative evaluation
- Descriptive, gives quantitative (numerical)
evaluation - Explanatory evaluation
- Does the model provide a plausible description of
the learned concept - Classification does it base its classification
on plausible rules? - Association does it discover plausible
relationships in the data? - Clustering does it come up with plausible
clusters? - The meaning of plausible to be defined by the
human expert - Hence, not popular in ML
8Descriptive evaluation
- Example bowel cancer classification problem
- True positives (TP) - diseased patients
identified as with cancer - True negatives (TN) - healthy subjects identified
as healthy - False negatives (FN)- test identifies cancer
patient as healthy - False positives (FP) test identifies healthy
subject as with cancer - Precision
- Sensitivity (Recall)
- F measure (balanced F score)
- Simple classification accuracy
Source Wikipedia
9Descriptive evaluation (contd)
- For prediction problems, mean square error (MSE)
is used - where
- di is the desired output in the data set
- ai is the actual output from the model
- n is the number of instances in the data set
- If N2, d11.0, a10.5, d20, a21.0
- MSE1.25
- Sometimes, root mean square is used instead
sqrt(MSE)
10Decision trees (DT)
- Simple form of inductive learning
- Yet successful form of learning algorithm
- Consider an example of playing tennis
- Attributes (features)
- Outlook, temp, humidity, wind
- Values
- Description of features
- Eg Outlook values - sunny, cloudy, rainy
- Target
- Play
- Represents the output of the model
- Instances
- Examples D1 to D14 of the dataset
- Concept
- Learn to decide whether to play tennis i.e. find
h from given data set
Adapted from Mitchell, 1997
11Decision trees (DT)
- Decision tree takes a set of properties as input
and provides a decision as output - each row of table corresponds to a path in the
tree - decision tree may form more compact
representation, especially if many attributes are
irrelevant - DT could be considered as the learning method
when - Instances describable by attribute-value pairs
- Target function is discrete valued (eg YES, NO)
- Possibly noisy training data
- It is not suitable (needs further adaptation)
- When attribute values and/or target are numerical
values - Eg Attribute values Temp22 C, Windy25 mph
- Target function70, 30
- Some functions require exponentially large
decision tree - parity function
12Forming rules from DT
- Example of concept Should I play tennis today
- Takes inputs (set of attributes)
- Outputs a decision (say YES/NO)
- Each non-leaf node is an attribute
- The first non-leaf node is root node
- Each leaf node is either Yes or No
- Each link (branch) is labeled with
- possible values of the associated attribute
- Rule formation
- A decision tree can be expressed as a disjunction
of conjunctions - PLAY tennis IF (Outlook sunny) ? (Humidity
normal) ? (OutlookCloudy) ? (Outlook Rainy) ?
(WindWeak) - ? is disjunction operator (OR)
- ? is conjunction operator (AND)
Outlook
Rainy
Cloudy
Sunny
Humidity
Wind
Yes
Normal
High
Strong
Weak
No
Yes
No
Yes
13Another DT example
- Another example (from Lecture 1)
- Reading the tree on the right
If the parents visitingyes, then go to the
cinemaorIf the parents visitingno and
weathersunny, then play tennisorIf the parents
visitingno and weatherwindy and moneyrich,
then go shoppingorIf the parents visitingno
and weatherwindy and moneypoor, then go to
cinemaorIf the parents visitingno and
weatherrainy, then stay in.
Source http//wwwhomes.doc.ic.ac.uk/sgc/teaching
/v231/lecture10.html
14Obtaining DT through top-down induction
- How can we obtain a DT?
- Perform a top-down search, through the space of
possible decision trees - Determine the attribute that best classifies the
training data - Use this attribute as the root of the tree
- Repeat this process for each branch from left to
right - Proceed to the next level and determine the next
best feature - Repeat until a leaf is reached.
- How to choose the best attribute?
- Choose the attribute that will yield more
information (i.e. the attribute with the highest
information gain)
14
15Information gain
- Information gain - gt a reduction of Entropy, E
- But what is Entropy?
- Is the amount of energy that cannot be used to do
work - Measured in bits
- A measure of disorder in a system (high entropy
disorder) - where
- S is the training data set
- c is the number of target classes
- pi is the proportion of examples in S belonging
to target class i - Note if your calculator doesn't do log2, use
log2(x)1.443 ln(x) or 3.322 log10(x). For even
better accuracy, use log2(x)ln(x)/ln(2) or
log2(x)log10(x)/log10(2)
16Entropy example
- A coin is flipped
- If the coin was fair -gt 50 chance of head
- Now, let us rig the coin -gt so that 99 of the
time head comes up - Lets look at this in terms of entropy
- Two outcomes head, tail
- Probability phead, ptail
- E(0.5, 0.5) 0.5 log2 (0.5) (0.5) log2 (0.5)
1 bit - E(0.01, 0.99) 0.01 log2 (0.01) 0.99 log2
(0.99) 0.08 bit - If the probability of heads 1, then entropy0
- E(0, 1.0) 0 log2 (0) 1.0 log2 (1.0) 0 bit
17Information Gain
- Information Gain, G will be defined as
- where
- Values (A) is the set of all possible values of
attribute A - Sv is the subset of S for which A has a value v
- S is the size of S and Sv is the size of Sv
- The information gain is the expected reduction in
entropy caused by knowing the value of attribute
A
18Example entropy calculation
- Compute the entropy of the play-tennis example
- We have two classes, YES and NO
- We have 14 instances with 9 classified as YES and
5 as NO - i.e. no. of classes, c2
- EYES - (9/14) log2 (9/14) 0.41
- ENO - (5/14) log2 (5/14) 0.53
- E(S) EYES ENO 0.94
19Example information gain calculation
- Compute the information gain for the attributes
wind in the play-tennis data set - S14
- Attribute wind
- Two values weak and strong
- Sweak 8
- Sstrong 6
20Example information gain calculation
- Now, let us determine E(Sweak)
- Instances8, YES6, NO2
- 6,2-
- E(Sweak)-(6/8)log2(6/8)-(2/8)log2(2/8)0.81
21Example information gain calculation
- Now, let us determine E(Sstrong)
- Instances6, YES3, NO3
- 3,3-
- E(Sstrong)-(3/6)log2(3/6)-(3/6)log2(3/6)1.0
- Note, do not waste time if pYESpNO
Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
21
22Example information gain calculation
- Going back to information gain computation for
the attribute wind - 0.94 - (8/14) 0.81 - (6/14)1.00
- 0.048
Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
22
23Example information gain calculation
- Now, compute the information gain for
- the attribute humidity in the play-tennis data
set - S14
- Attribute humidity
- Two values high and normal
- Shigh 7
- Snormal 7
- For value high gt 3,4-
- For value normal-gt6,1-
24Example information gain calculation
- Now, compute the information gain for
- the attribute humidity in the play-tennis
- S14
- Attribute humidity
- Two values high and normal
- Shigh 7
- Snormal 7
- For value high gt 3,4-
- For value normal-gt6,1-
- 0.94 - (7/14) 0.98 - (7/14)0.59
- 0.15
E(Shigh)-(3/7)log2(3/7)-(4/7)log2(4/7)0.98
E(Snormal)-(6/7)log2(6/7)-(1/7)log2(1/7)0.59
So, humidity provides GREATER information gain
than wind
Lecture 2 slides for CC282 Machine Learning, R.
Palaniappan, 2008
24
25Example information gain calculation
- Now, compute the information gain for the
attribute outlook and temperature in the
play-tennis data set - Attribute outlook
- Attribute temperature
- Gain(S, outlook)0.25
- Gain(S, temp)0.03
- Gain(S, humidity)0.15
- Gain(S, wind)0.048
- So, attribute with highest info. gain
- OUTLOOK, therefore use outlook as the root node
Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
25
26DT next level
- After determining OUTLOOK as the root node, we
need to expand the tree - E(Ssunny)-(2/5)log2(2/5)-(3/5)log2(3/5)0.97
- Entropy (Ssunny)0.97
27DT next level
- Gain(Ssunny, Humidity)0.97-(3/5) 0.0 (2/5)
0.00.97 - Gain (Ssunny, Wind) 0.97 (3/5) 0.918 (2/5)
1.0 0.019 - Gain(Ssunny, Temperature)0.97-(2/5) 0.0 (2/5)
1.0 (1/5) 0.0 0.57 - Highest information gain is humidity, so use this
attribute
28Continue .. and Final DT
- Continue until all the examples are classified
- Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain
(Srainy, Temp) - Gain (Srainy, Wind) is the highest
- All leaf nodes are associated with training
examples from the same class (entropy0) - The attribute temperature is not used
29ID3 algorithm pseudocode
30ID3 algorithm pseudocode (Mitchell)
- From Mitchell (1997) not important for exam
31Search strategy in ID3
- Complete hypothesis space any finite
discrete-valued function can be expressed - Incomplete search searches incompletely through
the hypothesis space until the tree is consistent
with the data - Single hypothesis only one current hypothesis
(simplest one) is maintained - No backtracking one an attribute is selected,
this cannot be changed. Problem might not be the
optimum solution (globally) - Full training set attributes are selected by
computing information gain on the full training
set. Advantage Robustness to errors. Problem
Non-incremental
32Lecture 2 summary
- From this lecture, you should be able to
- Define concept, learning model, hypothesis,
hypothesis space, consistent hypothesis,
induction learning bias, reliasable
unreliasable tasks, Occams razor in view of ML - Differentiate between generalisation and
overfitting - Define entropy information gain and know how to
calculate them for a given data set - Explain the ID3 algorithm, how it works and
describe it in pseudo-code - Apply ID3 algorithm on a given data set