CC282 Decision trees

About This Presentation

Title:

CC282 Decision trees

Description:

Concept, c is the problem to be learned. Example: Classification problem by an optician ... Inputs, x: user's budget, user's eye condition, user's environment ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 33

Provided by: scie205

Category:

more less

Transcript and Presenter's Notes

Title: CC282 Decision trees

1
CC282 Decision trees
2
Lecture 2 - Outline

More ML principles
Concept learning
Hypothesis space
Generalisation and overfitting
Model (hypothesis) evaluation
Inductive learning
Inductive bias
Decision trees
ID3 algorithm (entropy, information gain)

3
Concept learning

Concept, c is the problem to be learned
Example
Classification problem by an optician
Concept - whether to fit or not to fit contact
lenses based on users budget, users eye
condition, users environment etc
Inputs, x users budget, users eye condition,
users environment
Output, y to fit or not to fit
A learning model is needed to learn a concept
The learning model should ideally
Capture the training data, ltx, ygt -gt
descriptive ability
Generalise to unseen test data, ltxnew ,?gt -gt
predictive ability
Provide plausible explanation on the learned
concept, c -gt explanatory ability
But descriptive and predictive abilities are
generally considered sufficient

4
Learning a concept

Concept learning
Given many examples - ltinput, outputgt of what c
does, find a function h that approximates c
The number of examples is usually a small subset
of all possible ltinput, outputgt pairs
h is known as a hypothesis (i.e. learning model)
There might be a number of h that are candidate
solutions -we select h from a hypothesis space H
If the hypothesis matches the behaviour of the
target concept for all training data, then it is
a consistent hypothesis
Occams razor
Simpler hypothesis that fits c is preferred
Simpler h means shorter, smaller h
Simpler h is unlikely to be an effect of
coincidence
Learning search in the H for an appropriate h
Realisable task H contains the h that fits the
concept
Unreliasable task H does not contain the h
that fits the concept

5
More terms - Generalisation, overfitting,
induction, deduction

Generalisation
The ability of the trained model to perform well
on test data
Overfitting
If the model learns the training data well but
performs poorly on the test data
Inductive learning (induction)
learning a hypothesis by example, where a system
tries to induce a general rule/model from a set
of observed instances/samples
Inductive bias
Since many choices of h exist in H, any
preference of one hypothesis over another without
prior knowledge is called bias
Any hypothesis consistent with the training
examples is likely to generalise to unseen
examples - the trick is to find the right bias
An unbiased learner
Can never generalise so not practically useful
Deduction
ML gives an output (prediction, classification
etc) based on the previously acquired learning

6
Generalisation and overfitting example

Assume, we have the inputs, x and corresponding
outputs, y and we wish to have concept, c that
matches x to y
Examples of hypotheses
h1 will give good generalisation
h2 is overfitted

7
Model (hypothesis) evaluation

We need to have some performance measure to
estimate how the model h approximates c, i.e. how
good is h?
Possible evaluation methods
Explanatory, gives qualitative evaluation
Descriptive, gives quantitative (numerical)
evaluation
Explanatory evaluation
Does the model provide a plausible description of
the learned concept
Classification does it base its classification
on plausible rules?
Association does it discover plausible
relationships in the data?
Clustering does it come up with plausible
clusters?
The meaning of plausible to be defined by the
human expert
Hence, not popular in ML

8
Descriptive evaluation

Example bowel cancer classification problem
True positives (TP) - diseased patients
identified as with cancer
True negatives (TN) - healthy subjects identified
as healthy
False negatives (FN)- test identifies cancer
patient as healthy
False positives (FP) test identifies healthy
subject as with cancer
Precision
Sensitivity (Recall)
F measure (balanced F score)
Simple classification accuracy

Source Wikipedia
9
Descriptive evaluation (contd)

For prediction problems, mean square error (MSE)
is used
where
di is the desired output in the data set
ai is the actual output from the model
n is the number of instances in the data set
If N2, d11.0, a10.5, d20, a21.0
MSE1.25
Sometimes, root mean square is used instead
sqrt(MSE)

10
Decision trees (DT)

Simple form of inductive learning
Yet successful form of learning algorithm
Consider an example of playing tennis
Attributes (features)
Outlook, temp, humidity, wind
Values
Description of features
Eg Outlook values - sunny, cloudy, rainy
Target
Play
Represents the output of the model
Instances
Examples D1 to D14 of the dataset
Concept
Learn to decide whether to play tennis i.e. find
h from given data set

Adapted from Mitchell, 1997
11
Decision trees (DT)

Decision tree takes a set of properties as input
and provides a decision as output
each row of table corresponds to a path in the
tree
decision tree may form more compact
representation, especially if many attributes are
irrelevant
DT could be considered as the learning method
when
Instances describable by attribute-value pairs
Target function is discrete valued (eg YES, NO)
Possibly noisy training data
It is not suitable (needs further adaptation)
When attribute values and/or target are numerical
values
Eg Attribute values Temp22 C, Windy25 mph
Target function70, 30
Some functions require exponentially large
decision tree
parity function

12
Forming rules from DT

Example of concept Should I play tennis today
Takes inputs (set of attributes)
Outputs a decision (say YES/NO)
Each non-leaf node is an attribute
The first non-leaf node is root node
Each leaf node is either Yes or No
Each link (branch) is labeled with
possible values of the associated attribute
Rule formation
A decision tree can be expressed as a disjunction
of conjunctions
PLAY tennis IF (Outlook sunny) ? (Humidity
normal) ? (OutlookCloudy) ? (Outlook Rainy) ?
(WindWeak)
? is disjunction operator (OR)
? is conjunction operator (AND)

Outlook
Rainy
Cloudy
Sunny
Humidity
Wind
Yes
Normal
High
Strong
Weak
No
Yes
No
Yes
13
Another DT example

Another example (from Lecture 1)
Reading the tree on the right

If the parents visitingyes, then go to the
cinemaorIf the parents visitingno and
weathersunny, then play tennisorIf the parents
visitingno and weatherwindy and moneyrich,
then go shoppingorIf the parents visitingno
and weatherwindy and moneypoor, then go to
cinemaorIf the parents visitingno and
weatherrainy, then stay in.
Source http//wwwhomes.doc.ic.ac.uk/sgc/teaching
/v231/lecture10.html
14
Obtaining DT through top-down induction

How can we obtain a DT?
Perform a top-down search, through the space of
possible decision trees
Determine the attribute that best classifies the
training data
Use this attribute as the root of the tree
Repeat this process for each branch from left to
right
Proceed to the next level and determine the next
best feature
Repeat until a leaf is reached.
How to choose the best attribute?
Choose the attribute that will yield more
information (i.e. the attribute with the highest
information gain)

14
15
Information gain

Information gain - gt a reduction of Entropy, E
But what is Entropy?
Is the amount of energy that cannot be used to do
work
Measured in bits
A measure of disorder in a system (high entropy
disorder)
where
S is the training data set
c is the number of target classes
pi is the proportion of examples in S belonging
to target class i
Note if your calculator doesn't do log2, use
log2(x)1.443 ln(x) or 3.322 log10(x). For even
better accuracy, use log2(x)ln(x)/ln(2) or
log2(x)log10(x)/log10(2)

16
Entropy example

A coin is flipped
If the coin was fair -gt 50 chance of head
Now, let us rig the coin -gt so that 99 of the
time head comes up
Lets look at this in terms of entropy
Two outcomes head, tail
Probability phead, ptail
E(0.5, 0.5) 0.5 log2 (0.5) (0.5) log2 (0.5)
1 bit
E(0.01, 0.99) 0.01 log2 (0.01) 0.99 log2
(0.99) 0.08 bit
If the probability of heads 1, then entropy0
E(0, 1.0) 0 log2 (0) 1.0 log2 (1.0) 0 bit

17
Information Gain

Information Gain, G will be defined as
where
Values (A) is the set of all possible values of
attribute A
Sv is the subset of S for which A has a value v
S is the size of S and Sv is the size of Sv
The information gain is the expected reduction in
entropy caused by knowing the value of attribute
A

18
Example entropy calculation

Compute the entropy of the play-tennis example
We have two classes, YES and NO
We have 14 instances with 9 classified as YES and
5 as NO
i.e. no. of classes, c2
EYES - (9/14) log2 (9/14) 0.41
ENO - (5/14) log2 (5/14) 0.53
E(S) EYES ENO 0.94

19
Example information gain calculation

Compute the information gain for the attributes
wind in the play-tennis data set
S14
Attribute wind
Two values weak and strong
Sweak 8
Sstrong 6

20
Example information gain calculation

Now, let us determine E(Sweak)
Instances8, YES6, NO2
6,2-
E(Sweak)-(6/8)log2(6/8)-(2/8)log2(2/8)0.81

21
Example information gain calculation

Now, let us determine E(Sstrong)
Instances6, YES3, NO3
3,3-
E(Sstrong)-(3/6)log2(3/6)-(3/6)log2(3/6)1.0
Note, do not waste time if pYESpNO

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
21
22
Example information gain calculation

Going back to information gain computation for
the attribute wind
0.94 - (8/14) 0.81 - (6/14)1.00
0.048

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
22
23
Example information gain calculation

Now, compute the information gain for
the attribute humidity in the play-tennis data
set
S14
Attribute humidity
Two values high and normal
Shigh 7
Snormal 7
For value high gt 3,4-
For value normal-gt6,1-

24
Example information gain calculation

Now, compute the information gain for
the attribute humidity in the play-tennis
S14
Attribute humidity
Two values high and normal
Shigh 7
Snormal 7
For value high gt 3,4-
For value normal-gt6,1-
0.94 - (7/14) 0.98 - (7/14)0.59
0.15

E(Shigh)-(3/7)log2(3/7)-(4/7)log2(4/7)0.98
E(Snormal)-(6/7)log2(6/7)-(1/7)log2(1/7)0.59
So, humidity provides GREATER information gain
than wind
Lecture 2 slides for CC282 Machine Learning, R.
Palaniappan, 2008
24
25
Example information gain calculation

Now, compute the information gain for the
attribute outlook and temperature in the
play-tennis data set
Attribute outlook
Attribute temperature
Gain(S, outlook)0.25
Gain(S, temp)0.03
Gain(S, humidity)0.15
Gain(S, wind)0.048
So, attribute with highest info. gain
OUTLOOK, therefore use outlook as the root node

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
25
26
DT next level

After determining OUTLOOK as the root node, we
need to expand the tree
E(Ssunny)-(2/5)log2(2/5)-(3/5)log2(3/5)0.97
Entropy (Ssunny)0.97

27
DT next level

Gain(Ssunny, Humidity)0.97-(3/5) 0.0 (2/5)
0.00.97
Gain (Ssunny, Wind) 0.97 (3/5) 0.918 (2/5)
1.0 0.019
Gain(Ssunny, Temperature)0.97-(2/5) 0.0 (2/5)
1.0 (1/5) 0.0 0.57
Highest information gain is humidity, so use this
attribute

28
Continue .. and Final DT

Continue until all the examples are classified
Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain
(Srainy, Temp)
Gain (Srainy, Wind) is the highest
All leaf nodes are associated with training
examples from the same class (entropy0)
The attribute temperature is not used

29
ID3 algorithm pseudocode

Sufficient for exam

30
ID3 algorithm pseudocode (Mitchell)

From Mitchell (1997) not important for exam

31
Search strategy in ID3

Complete hypothesis space any finite
discrete-valued function can be expressed
Incomplete search searches incompletely through
the hypothesis space until the tree is consistent
with the data
Single hypothesis only one current hypothesis
(simplest one) is maintained
No backtracking one an attribute is selected,
this cannot be changed. Problem might not be the
optimum solution (globally)
Full training set attributes are selected by
computing information gain on the full training
set. Advantage Robustness to errors. Problem
Non-incremental

32
Lecture 2 summary

From this lecture, you should be able to
Define concept, learning model, hypothesis,
hypothesis space, consistent hypothesis,
induction learning bias, reliasable
unreliasable tasks, Occams razor in view of ML
Differentiate between generalisation and
overfitting
Define entropy information gain and know how to
calculate them for a given data set
Explain the ID3 algorithm, how it works and
describe it in pseudo-code
Apply ID3 algorithm on a given data set

Write a Comment

User Comments (0)

About PowerShow.com

CC282 Decision trees - PowerPoint PPT Presentation

CC282 Decision trees

Concept, c is the problem to be learned. Example: Classification problem by an optician ... Inputs, x: user's budget, user's eye condition, user's environment ... – PowerPoint PPT presentation