CC282 Decision trees - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

CC282 Decision trees

Description:

Concept, c is the problem to be learned. Example: Classification problem by an optician ... Inputs, x: user's budget, user's eye condition, user's environment ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 33
Provided by: scie205
Category:

less

Transcript and Presenter's Notes

Title: CC282 Decision trees


1
CC282 Decision trees
2
Lecture 2 - Outline
  • More ML principles
  • Concept learning
  • Hypothesis space
  • Generalisation and overfitting
  • Model (hypothesis) evaluation
  • Inductive learning
  • Inductive bias
  • Decision trees
  • ID3 algorithm (entropy, information gain)

3
Concept learning
  • Concept, c is the problem to be learned
  • Example
  • Classification problem by an optician
  • Concept - whether to fit or not to fit contact
    lenses based on users budget, users eye
    condition, users environment etc
  • Inputs, x users budget, users eye condition,
    users environment
  • Output, y to fit or not to fit
  • A learning model is needed to learn a concept
  • The learning model should ideally
  • Capture the training data, ltx, ygt -gt
    descriptive ability
  • Generalise to unseen test data, ltxnew ,?gt -gt
    predictive ability
  • Provide plausible explanation on the learned
    concept, c -gt explanatory ability
  • But descriptive and predictive abilities are
    generally considered sufficient

4
Learning a concept
  • Concept learning
  • Given many examples - ltinput, outputgt of what c
    does, find a function h that approximates c
  • The number of examples is usually a small subset
    of all possible ltinput, outputgt pairs
  • h is known as a hypothesis (i.e. learning model)
  • There might be a number of h that are candidate
    solutions -we select h from a hypothesis space H
  • If the hypothesis matches the behaviour of the
    target concept for all training data, then it is
    a consistent hypothesis
  • Occams razor
  • Simpler hypothesis that fits c is preferred
  • Simpler h means shorter, smaller h
  • Simpler h is unlikely to be an effect of
    coincidence
  • Learning search in the H for an appropriate h
  • Realisable task H contains the h that fits the
    concept
  • Unreliasable task H does not contain the h
    that fits the concept

5
More terms - Generalisation, overfitting,
induction, deduction
  • Generalisation
  • The ability of the trained model to perform well
    on test data
  • Overfitting
  • If the model learns the training data well but
    performs poorly on the test data
  • Inductive learning (induction)
  • learning a hypothesis by example, where a system
    tries to induce a general rule/model from a set
    of observed instances/samples
  • Inductive bias
  • Since many choices of h exist in H, any
    preference of one hypothesis over another without
    prior knowledge is called bias
  • Any hypothesis consistent with the training
    examples is likely to generalise to unseen
    examples - the trick is to find the right bias
  • An unbiased learner
  • Can never generalise so not practically useful
  • Deduction
  • ML gives an output (prediction, classification
    etc) based on the previously acquired learning

6
Generalisation and overfitting example
  • Assume, we have the inputs, x and corresponding
    outputs, y and we wish to have concept, c that
    matches x to y
  • Examples of hypotheses
  • h1 will give good generalisation
  • h2 is overfitted

7
Model (hypothesis) evaluation
  • We need to have some performance measure to
    estimate how the model h approximates c, i.e. how
    good is h?
  • Possible evaluation methods
  • Explanatory, gives qualitative evaluation
  • Descriptive, gives quantitative (numerical)
    evaluation
  • Explanatory evaluation
  • Does the model provide a plausible description of
    the learned concept
  • Classification does it base its classification
    on plausible rules?
  • Association does it discover plausible
    relationships in the data?
  • Clustering does it come up with plausible
    clusters?
  • The meaning of plausible to be defined by the
    human expert
  • Hence, not popular in ML

8
Descriptive evaluation
  • Example bowel cancer classification problem
  • True positives (TP) - diseased patients
    identified as with cancer
  • True negatives (TN) - healthy subjects identified
    as healthy
  • False negatives (FN)- test identifies cancer
    patient as healthy
  • False positives (FP) test identifies healthy
    subject as with cancer
  • Precision
  • Sensitivity (Recall)
  • F measure (balanced F score)
  • Simple classification accuracy

Source Wikipedia
9
Descriptive evaluation (contd)
  • For prediction problems, mean square error (MSE)
    is used
  • where
  • di is the desired output in the data set
  • ai is the actual output from the model
  • n is the number of instances in the data set
  • If N2, d11.0, a10.5, d20, a21.0
  • MSE1.25
  • Sometimes, root mean square is used instead
    sqrt(MSE)

10
Decision trees (DT)
  • Simple form of inductive learning
  • Yet successful form of learning algorithm
  • Consider an example of playing tennis
  • Attributes (features)
  • Outlook, temp, humidity, wind
  • Values
  • Description of features
  • Eg Outlook values - sunny, cloudy, rainy
  • Target
  • Play
  • Represents the output of the model
  • Instances
  • Examples D1 to D14 of the dataset
  • Concept
  • Learn to decide whether to play tennis i.e. find
    h from given data set

Adapted from Mitchell, 1997
11
Decision trees (DT)
  • Decision tree takes a set of properties as input
    and provides a decision as output
  • each row of table corresponds to a path in the
    tree
  • decision tree may form more compact
    representation, especially if many attributes are
    irrelevant
  • DT could be considered as the learning method
    when
  • Instances describable by attribute-value pairs
  • Target function is discrete valued (eg YES, NO)
  • Possibly noisy training data
  • It is not suitable (needs further adaptation)
  • When attribute values and/or target are numerical
    values
  • Eg Attribute values Temp22 C, Windy25 mph
  • Target function70, 30
  • Some functions require exponentially large
    decision tree
  • parity function

12
Forming rules from DT
  • Example of concept Should I play tennis today
  • Takes inputs (set of attributes)
  • Outputs a decision (say YES/NO)
  • Each non-leaf node is an attribute
  • The first non-leaf node is root node
  • Each leaf node is either Yes or No
  • Each link (branch) is labeled with
  • possible values of the associated attribute
  • Rule formation
  • A decision tree can be expressed as a disjunction
    of conjunctions
  • PLAY tennis IF (Outlook sunny) ? (Humidity
    normal) ? (OutlookCloudy) ? (Outlook Rainy) ?
    (WindWeak)
  • ? is disjunction operator (OR)
  • ? is conjunction operator (AND)

Outlook
Rainy
Cloudy
Sunny
Humidity
Wind
Yes
Normal
High
Strong
Weak
No
Yes
No
Yes
13
Another DT example
  • Another example (from Lecture 1)
  • Reading the tree on the right

If the parents visitingyes, then go to the
cinemaorIf the parents visitingno and
weathersunny, then play tennisorIf the parents
visitingno and weatherwindy and moneyrich,
then go shoppingorIf the parents visitingno
and weatherwindy and moneypoor, then go to
cinemaorIf the parents visitingno and
weatherrainy, then stay in.
Source http//wwwhomes.doc.ic.ac.uk/sgc/teaching
/v231/lecture10.html
14
Obtaining DT through top-down induction
  • How can we obtain a DT?
  • Perform a top-down search, through the space of
    possible decision trees
  • Determine the attribute that best classifies the
    training data
  • Use this attribute as the root of the tree
  • Repeat this process for each branch from left to
    right
  • Proceed to the next level and determine the next
    best feature
  • Repeat until a leaf is reached.
  • How to choose the best attribute?
  • Choose the attribute that will yield more
    information (i.e. the attribute with the highest
    information gain)

14
15
Information gain
  • Information gain - gt a reduction of Entropy, E
  • But what is Entropy?
  • Is the amount of energy that cannot be used to do
    work
  • Measured in bits
  • A measure of disorder in a system (high entropy
    disorder)
  • where
  • S is the training data set
  • c is the number of target classes
  • pi is the proportion of examples in S belonging
    to target class i
  • Note if your calculator doesn't do log2, use
    log2(x)1.443 ln(x) or 3.322 log10(x). For even
    better accuracy, use log2(x)ln(x)/ln(2) or
    log2(x)log10(x)/log10(2)

16
Entropy example
  • A coin is flipped
  • If the coin was fair -gt 50 chance of head
  • Now, let us rig the coin -gt so that 99 of the
    time head comes up
  • Lets look at this in terms of entropy
  • Two outcomes head, tail
  • Probability phead, ptail
  • E(0.5, 0.5) 0.5 log2 (0.5) (0.5) log2 (0.5)
    1 bit
  • E(0.01, 0.99) 0.01 log2 (0.01) 0.99 log2
    (0.99) 0.08 bit
  • If the probability of heads 1, then entropy0
  • E(0, 1.0) 0 log2 (0) 1.0 log2 (1.0) 0 bit

17
Information Gain
  • Information Gain, G will be defined as
  • where
  • Values (A) is the set of all possible values of
    attribute A
  • Sv is the subset of S for which A has a value v
  • S is the size of S and Sv is the size of Sv
  • The information gain is the expected reduction in
    entropy caused by knowing the value of attribute
    A

18
Example entropy calculation
  • Compute the entropy of the play-tennis example
  • We have two classes, YES and NO
  • We have 14 instances with 9 classified as YES and
    5 as NO
  • i.e. no. of classes, c2
  • EYES - (9/14) log2 (9/14) 0.41
  • ENO - (5/14) log2 (5/14) 0.53
  • E(S) EYES ENO 0.94

19
Example information gain calculation
  • Compute the information gain for the attributes
    wind in the play-tennis data set
  • S14
  • Attribute wind
  • Two values weak and strong
  • Sweak 8
  • Sstrong 6

20
Example information gain calculation
  • Now, let us determine E(Sweak)
  • Instances8, YES6, NO2
  • 6,2-
  • E(Sweak)-(6/8)log2(6/8)-(2/8)log2(2/8)0.81

21
Example information gain calculation
  • Now, let us determine E(Sstrong)
  • Instances6, YES3, NO3
  • 3,3-
  • E(Sstrong)-(3/6)log2(3/6)-(3/6)log2(3/6)1.0
  • Note, do not waste time if pYESpNO

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
21
22
Example information gain calculation
  • Going back to information gain computation for
    the attribute wind
  • 0.94 - (8/14) 0.81 - (6/14)1.00
  • 0.048

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
22
23
Example information gain calculation
  • Now, compute the information gain for
  • the attribute humidity in the play-tennis data
    set
  • S14
  • Attribute humidity
  • Two values high and normal
  • Shigh 7
  • Snormal 7
  • For value high gt 3,4-
  • For value normal-gt6,1-

24
Example information gain calculation
  • Now, compute the information gain for
  • the attribute humidity in the play-tennis
  • S14
  • Attribute humidity
  • Two values high and normal
  • Shigh 7
  • Snormal 7
  • For value high gt 3,4-
  • For value normal-gt6,1-
  • 0.94 - (7/14) 0.98 - (7/14)0.59
  • 0.15

E(Shigh)-(3/7)log2(3/7)-(4/7)log2(4/7)0.98
E(Snormal)-(6/7)log2(6/7)-(1/7)log2(1/7)0.59
So, humidity provides GREATER information gain
than wind
Lecture 2 slides for CC282 Machine Learning, R.
Palaniappan, 2008
24
25
Example information gain calculation
  • Now, compute the information gain for the
    attribute outlook and temperature in the
    play-tennis data set
  • Attribute outlook
  • Attribute temperature
  • Gain(S, outlook)0.25
  • Gain(S, temp)0.03
  • Gain(S, humidity)0.15
  • Gain(S, wind)0.048
  • So, attribute with highest info. gain
  • OUTLOOK, therefore use outlook as the root node

Lecture 1 slides for CC282 Machine Learning, R.
Palaniappan, 2008
25
26
DT next level
  • After determining OUTLOOK as the root node, we
    need to expand the tree
  • E(Ssunny)-(2/5)log2(2/5)-(3/5)log2(3/5)0.97
  • Entropy (Ssunny)0.97

27
DT next level
  • Gain(Ssunny, Humidity)0.97-(3/5) 0.0 (2/5)
    0.00.97
  • Gain (Ssunny, Wind) 0.97 (3/5) 0.918 (2/5)
    1.0 0.019
  • Gain(Ssunny, Temperature)0.97-(2/5) 0.0 (2/5)
    1.0 (1/5) 0.0 0.57
  • Highest information gain is humidity, so use this
    attribute

28
Continue .. and Final DT
  • Continue until all the examples are classified
  • Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain
    (Srainy, Temp)
  • Gain (Srainy, Wind) is the highest
  • All leaf nodes are associated with training
    examples from the same class (entropy0)
  • The attribute temperature is not used

29
ID3 algorithm pseudocode
  • Sufficient for exam

30
ID3 algorithm pseudocode (Mitchell)
  • From Mitchell (1997) not important for exam

31
Search strategy in ID3
  • Complete hypothesis space any finite
    discrete-valued function can be expressed
  • Incomplete search searches incompletely through
    the hypothesis space until the tree is consistent
    with the data
  • Single hypothesis only one current hypothesis
    (simplest one) is maintained
  • No backtracking one an attribute is selected,
    this cannot be changed. Problem might not be the
    optimum solution (globally)
  • Full training set attributes are selected by
    computing information gain on the full training
    set. Advantage Robustness to errors. Problem
    Non-incremental

32
Lecture 2 summary
  • From this lecture, you should be able to
  • Define concept, learning model, hypothesis,
    hypothesis space, consistent hypothesis,
    induction learning bias, reliasable
    unreliasable tasks, Occams razor in view of ML
  • Differentiate between generalisation and
    overfitting
  • Define entropy information gain and know how to
    calculate them for a given data set
  • Explain the ID3 algorithm, how it works and
    describe it in pseudo-code
  • Apply ID3 algorithm on a given data set
Write a Comment
User Comments (0)
About PowerShow.com