Title: Decision Tree and Concept Learning
1Decision Tree and Concept Learning
2Outline
- Types of learning
- Inductive learning
- non-incremental
- incremental
- Decision trees (non-incremental)
- Current best hypothesis (incremental)
- Candidate elimination (incremental)
3Why should programs learn?
- All the programs seen up to now have been static
- if we run the programs again on the same data,
they will do exactly the same as before - they cannot learn from their experience
- they require us to specify everything they will
ever need to know, at the outset. - We would like programs to learn from their
experience. - We would like the possibility of learning being
continous. - Learning is fundamental to intelligence.
4How do you learn?
- By being told relies on
- someone to do the telling
- something to tell!
- By finding a teacher, who provides a set of
pre-classified examples (i.e. I/O pairs), or
taking action and obtaining correct answer from
observations Supervised learning - By searching for regularities in unclassified
data (e.g. clusters) Unsupervised learning - By trying things, and seeing which outcomes are
desirable (i.e. earn rewards) e.g. learning the
heuristic evaluation function in game-playing
Reinforcement learning
5Inductive learning
- Learning by example (supervised learning)
- teacher provides good training instances and the
learner is expected to generalise - in knowledge acquisition it is often easier to
get an expert to give examples than to give rules - this is how experimental science works (the
'teacher' is the natural world)
6Application of knowledge (deduction)
Output 1
Knowledge
Input 1
Output 2
Knowledge
Input 2
7Inductive learning (induction)
Output 1
Input 1
Knowledge
Output 2
Input 2
Input n
Output n
8Knowledge as a function
- Knowledge (performance element) can be described
as a function - given a description, x, for a given object or
situation, the output is given by f(x) where f
embodies the knowledge contained in the
performance element. - could be
- analytical mathematical function
- lookup table
- rule set (including STRIPS rules)
- neural network
- decision tree
- etc.
9Definition of inductive learning
- Given a set of input/output pairs (examples)
(x, f(x)) - where f is unknown, but the output f(x) is
provided with its corresponding input, x. - find a function, h(x) (hypothesis) which best
approximates f(x). - finding implies searching in a space of
different possible hypotheses.
10Inductive learning
Non-incremental
Input 1
Output 1
Output 2
Knowledge
Input 2
Incremental
Input 1
Knowledge 1
Output 1
Output 2
Knowledge 2
Input 2
Assume that Knowledge 2 is more complete than
Knowledge 1, etc.
11Non-incremental vs. Incremental
- Non-incremental
- learning from all examples at once
- Incremental
- learning by refining from successive examples
- If you havent seen all possible examples, you
can never know that the system is going to give
the correct answer for a previously unknown
example. - You may never see all possible examples.
12Wait for a table at a restaurant
- To avoid arguments, you and your friends want to
have a clear decision procedure for the situation
where you turn up at a restaurant and have to
make a decision as to whether you will wait to
get a table. - In advance you
- specify a number of attributes
- draw up a decision tree of your own preferences
- When you get to the restaurant you use the
decision tree to find the value of the goal
predicate Will wait
13Attributes involved in the decision
- Is there anything else nearby? Near yes / no
- How long will the wait be? Time 0-10 / 10-30 /
30-60 / gt60 - Does the restaurant have a bar? Bar yes / no
- Is it the weekend? W/E yes / no
- Are you hungry? Hun yes / no
- How many tables are occupied? Occ. none / some
/all - Is it raining? Rain yes / no
- Have you booked? Book yes / no
- What type of restaurant? Type Fren / Chin /
Ital / US - How expensive is the restaurant? cheap /OK /
exp
14Agreed Decision TreeWait for a table in a
restaurant?
15Decision treesPerformance element
- Object or situation described by a set of
discrete attributes - Task is to classify the object
- binary (yes/no)
- a member of a discrete set of possible outcomes
- An internal node represents a test on one
attribute. - An arc represents a possible value for that
attribute. - A leaf node indicates the classification
resulting from following the path to that node. - A decision tree can be converted to a set of
rules. - This example represents a groups subjective way
of making a decision about whether or not to wait
for a table at a restaurant.
16Task
- An outsider observes the group on a number of
occasions. - The values of all the attributes are noted.
- Can (s)he learn a decision tree that leads to the
same conclusion for these examples and will
predict future behaviour?
17Set of examples
- Near Bar W/E Hun Occ Rain Book
Type Time Decide - 1 yes no no yes some exp no yes
Fren 0-10 YES - 2 yes no no yes all cheap no no
Chin 30-60 NO - 3 no yes no no some cheap no no US
0-10 YES - 4 yes no yes yes all cheap yes no
Chin 10-30 YES - 5 yes no yes no all exp no yes
Fren gt60 NO - 6 no yes no yes some ok yes yes
Ital 0-10 YES - 7 no yes no no none cheap yes no US
0-10 NO - 8 no no no yes some ok yes yes
Chin 0-10 YES - 9 no yes yes no all cheap yes no US
gt60 NO - 10 yes yes yes yes all exp no yes
Ital 10-30 NO - 11 no no no no none cheap no no
Chin 0-10 NO - 12 yes yes yes yes all cheap no no US
30-60 YES
18Learning decision trees
- Can a program learn a decision tree by looking at
these 12 examples? - It could build a decision tree which covered only
the 12 cases - i.e. had a path to a decision for
those 12 only. - That is not very useful - there are over 9000
possible situations using the attributes I have
given, but the tree is designed to deal with the
12. - The assumption is that there is a simpler
solution. - What we would like the system to do is to come up
with a decision tree that general enough to
predict the outcome in all possible cases
(including the 12). - We would also like the smallest tree which
satisfies this. - Learning is non-incremental.
.
19Which attribute? Take them in order ..
20Discriminating Attributes
We will determine which attribute provides the
most information at each stage, and use that as
the root of a sub-tree.
yes 1,3,4,6,8,12 no 2,5,7,9,10,11
How many tables are occupied?
all
none
some
yes4, 12 no2, 5, 9, 10
yes no 7, 11
yes1, 3, 6, 8 no
What type of restaurant?
French
Italian
US
Chinese
yes1 no 5
yes4, 8 no2, 11
yes3, 12 no7, 9
yes6 no10
21Splitting heuristic
- What do we mean by providing the most
information - Based on information theoretic measure (Shannon)
- Aims to minimise the number of tests needed to
provide a classification - "ID3, An algorithm for learning decision trees"
J. R. Quinlan, 1979.
22Information
- Want a numerical measure for each attribute
- Maximum when attribute is perfect (provides
perfect separation) - Minimum when attribute is useless (no separation)
- Suppose attribute has n possible values the ith
value has prior probability Pi - Information content of the attribute is
- I(P1, P2, Pn ) ?-Pilog2Pi
- Chose attribute with highest information content.
- Bit more complex see RN
-
23Measure of Impurity GINI
- Gini Index for a given node t
- (NOTE p( j t) is the relative frequency of
class j at node t). - Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information - Minimum (0.0) when all records belong to one
class, implying most interesting information
24Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
25Splitting Based on GINI
- Used in CART, SLIQ, SPRINT.
- When a node p is split into k partitions
(children), the quality of split is computed as, -
- where, ni number of records at child i,
- n number of records at node p.
26Binary Attributes Computing GINI Index
- Splits into two partitions
- Effect of Weighing partitions
- Larger and Purer Partitions are sought for.
B?
Yes
No
Node N1
Node N2
Gini(N1) 1 (5/7)2 (2/7)2 0.408
Gini(N2) 1 (1/5)2 (4/5)2 0.320
Gini(Children) 7/12 0.408 5/12
0.320 0.371
27Categorical Attributes Computing Gini Index
- For each distinct value, gather counts for each
class in the dataset - Use the count matrix to make decisions
Multi-way split
28Decision Tree Induction
- Induce(Examples, Attributes, Default) Decision
tree - if (a) Examples return the leaf node leaf
node is labelled with the Default - else if (b) all elements of Examples have same
decision D, return the leaf node - leaf node is labelled with D
- else if (c) Attributes return the leaf
node leaf node is labelled with the Default - else (d) delete A, the element of Attributes
which provides most information. - create a node A
- for each possible value V of A
- let E be the subset of Examples where the value
of attribute A is V - let D be the majority decision in Examples
- let N Induce(E, Attributes,D)
- add a directed arc from A, labelled V, ending
at N - return A (together with the tree rooted on it)
29Learned Decision Tree
30Learned Decision Tree
31Learned Decision Tree
32Learned Decision Tree
33Agreed Decision Tree
34Problems (1)
- Problems with examples
- Missing the value of an attribute for an example
- Incorrect example (errors in data collection)
- Both can affect performance element and learning
- What do we do with continuous (or very many)
valued attributes? (e.g. price) - discretise the attribute (e.g. cheap / OK /
expensive) - normally done by hand, but would be better if it
could be done automatically.
35Continuous Attributes Computing Gini Index
- Use Binary Decisions based on one value
- Several Choices for the splitting value
- Number of possible splitting values Number of
distinct values - Each splitting value has a count matrix
associated with it - Class counts in each of the partitions, A lt v and
A ? v - Simple method to choose best v
- For each v, scan the database to gather count
matrix and compute its Gini index - Computationally Inefficient! Repetition of work.
36Continuous Attributes Computing Gini Index...
- For efficient computation for each attribute,
- Sort the attribute on values
- Linearly scan these values, each time updating
the count matrix and computing gini index - Choose the split position that has the least gini
index
37Problems (2)
- Not enough examples
- branch for value which has no examples (use
default) - Not enough attributes
- leaf nodes with positive and negative
classifications - Over-fitting
- too many degrees of freedom (questions)
- use pruning to eliminate questions with
negligible information gain
38Overfitting
y
y a1x a0 4 data points 2 degrees of freedom
y a5x5 a4x4 a3x3 a2x2 a1 x
a0 4 data points 6 degrees of freedom
x
39How do we assess performance?
- Put the system into production and see how well
it performs? - not safe in any domain where the results are
important - Save some of our examples, and use them to test
results. - 1. Get a set of examples
- 2. Split into two subsets training and test
- 3. Learn using the training set
- 4. Evaluate using the test set
- Only put into production when we are happy
- It is very important that the test set and
training set have no examples in common
40Application GASOIL
- GASOIL is an expert system for designing gas/oil
separation systems stationed off-shore. - The design depends on
- proportions of gas, oil and water, flow rate,
pressure, density, viscosity, temperature, and
others. - To build by hand would take 100 person-months
- Built by decision-tree induction 3
person-months - At the time (early '80s), GASOIL was the biggest
Expert System in the world, containing 2500
rules, and saved BP millions.
41Application Learning to fly
- Learning to fly a Cessna on a flight simulator.
- Three skilled pilots performed an assigned flight
plan 30 times each. - Each control action (e.g. on throttle, flaps)
created an example. - 90,000 examples, each described by 20 state
variables and each categorised by the action
taken - Decision tree created.
- Converted into C and put into the simulator
control loop. - Program flies better than teachers!
- probably because generalisation cleans up
occasional mistakes
42Incremental inductive learning
43Arch
3
1
2
44Not an arch
3
1
2
45Not an arch
3
1
2
46Arch
3
1
2
47Arch
3
1
2
48Not an arch
3
2
1
49Incremental learning
- Restrict ourselves to binary (yes/no) solutions
- Each hypothesis (performance element) predicts
that a certain set of positive examples will
satisfy the goal predicate, and that all other
examples will not satisfy it. - The problem is then to find a hypothesis that is
consistent with the existing set of examples, and
that can be made be consistent with new examples. - Aim to improve our hypothesis for every new
example we get and hope that we eventually get
stability.
50True/False Positive/Negative
- positive
negative - true hypothesis yes hypothesis no
- correct yes correct no
- false hypothesis yes hypothesis no
- correct no correct yes
NB True/false applies to the prediction of the
hypothesis.
51Incremental learning
- Some new examples will be consistent with our
current hypothesis (i.e. true positives and true
negatives), and so provide no more information. - Some new examples will be false positive if the
hypothesis predicts yes but the correct answer is
no. - Some new examples will be false negative, if the
hypothesis predicts no but the correct answer is
yes. - Inductive learning is then the process of
refining a hypothesis by narrowing it to
eliminate false positives, and extending it to
include false negatives.
52Generalisation and specialisation
-
-
-
-
-
-
Discover false negative
-
-
hypothesis says yes
-
-
-
-
-
-
-
-
-
-
Generalise
-
-
correct value means yes - means no
-
-
-
-
-
-
Discover false positive
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Specialise
-
-
-
53Current Best Hypothesis
- Let S the set of examples
- Let O , the set of old examples
- Select E, a positive example, from S
- Move E to O
- Construct H, a hypothesis consistent with E
- While S ? , select another example E
- move E to O
- if E is a false negative with respect to
(w.r.t.) H then - generalise H so that H is consistent with all
members of O - else if E is a false positive w.r.t. H then
- specialise H so that H is consistent with all
members of O - return H
54Learning the definition of an "arch"
Suppose we are allowed to build arrangements of
any three objects from a set.
How can we get a computer system to learn, from
examples like these, the definition of an arch?
55Representation
- We need a language for describing objects and
concepts - attribute descriptions describe a single object
in terms of its features. - relational descriptions describe a composite
object in terms of its components and the
relationships between them.
56The modelling language
(support X Y) X supports Y - X and Y can
take the values 1, 2 or 3 (touches X Y) X
touches Y (and Y touches X)
X and Y can take the values 1, 2 or
3 (shape X S) the shape of X is S X
can take the values 1, 2 or 3
S can be triangle, rectangle, ellipse,
square () allow negated predicates
(shape 1 rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
(shape 1 rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
3
3
2
1
2
1
57The hypothesis (performance element)
The performance element would be a simple
rule for example if (shape 1 rectangle)
(shape 2 rectangle) (shape 3 rectangle)
(support 1 3) (support 2 3) then yes
we have an arch How do we learn it by being
shown successive examples?
58Searching for Current Best
- To begin, we will take the first positive example
to be our hypothesis. - At each stage, we will choose a minimum
generalisation or minimum specialisation of our
current hypothesis that is consistent. - If no such hypothesis is possible, then backtrack
to the last point where we had a choice, and
choose the next generalisation or specialisation. - The problem is now search, but we have to specify
how to generalise and specialise our definitions.
59Generalisation
- We can generalise a positive predicate, e.g.
(shape 1 rectangle) by removing variable
bindings - (shape 1 ?) the shape if 1 doesnt matter
- (shape ? rectangle) everything is a
rectangle - (shape ? ?) shape is irrelevant
- We can generalise a negated predicate, e.g.
(touch 1 ?) by adding variable bindings (i.e.
make the predicate more restrictive) - (touch 1 2)
- We can generalise a hypothesis by generalising
one of its predicates, or by removing a
predicate. - (shape ? ?) is equivalent to removal
60Specialisation
- We can specialise a positive predicate, e.g.
(shape 1 ?) by adding variable bindings - (shape 1 rectangle)
- We can specialise a negated predicate, e.g.
(touch 1 2) by removing variable bindings - (touch 1 ?)
- We can specialise a hypothesis by specialising
one of its predicates, or by adding a predicate.
61Learning the concept of an arch
Example 1 true positive - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
Hypothesis 1 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3)
Prediction
3
2
1
Example
Example 2 false positive - specialise (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3) (touch 1 2)
Hypothesis 2 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
62Learning the concept of an arch
Example 3 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
Hypothesis 3 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
Example 4 false negative - generalise (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 3) (support 2 3)
Hypothesis 4 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch 1 2)
Prediction
3
2
1
Example
63 Learning the concept of an arch
Example 5 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 2) (support 2 3)
Hypothesis 5 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch1 2)
Prediction
3
Example
2
1
64Concept Hierarchies
a_shape
polygon
ellipse
trapezium
rectangle
triangle
square
isosceles
equilateral
(shape X isosceles) generalises to (shape X
triangle) (shape X triangle) generalises
to (shape X polygon) (shape X rectangle)
generalises to (shape X polygon)
65Use of concept hierarchy
E.g. Hypothesis 4 would contain (shape 3
polygon) If this were Example 6 false negative
- generalise
3
Example
1
2
then Hypothesis 6 would contain (shape 3
a_shape)