Decision Tree and Concept Learning - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Decision Tree and Concept Learning

Description:

Splits into two partitions. Effect of ... At the time (early '80s), GASOIL was the biggest Expert System in the world, ... Program flies better than teachers! ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 66

Provided by: georgemacl

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree and Concept Learning

1
Decision Tree and Concept Learning
2
Outline

Types of learning
Inductive learning
non-incremental
incremental
Decision trees (non-incremental)
Current best hypothesis (incremental)
Candidate elimination (incremental)

3
Why should programs learn?

All the programs seen up to now have been static
if we run the programs again on the same data,
they will do exactly the same as before
they cannot learn from their experience
they require us to specify everything they will
ever need to know, at the outset.
We would like programs to learn from their
experience.
We would like the possibility of learning being
continous.
Learning is fundamental to intelligence.

4
How do you learn?

By being told relies on
someone to do the telling
something to tell!
By finding a teacher, who provides a set of
pre-classified examples (i.e. I/O pairs), or
taking action and obtaining correct answer from
observations Supervised learning
By searching for regularities in unclassified
data (e.g. clusters) Unsupervised learning
By trying things, and seeing which outcomes are
desirable (i.e. earn rewards) e.g. learning the
heuristic evaluation function in game-playing
Reinforcement learning

5
Inductive learning

Learning by example (supervised learning)
teacher provides good training instances and the
learner is expected to generalise
in knowledge acquisition it is often easier to
get an expert to give examples than to give rules
this is how experimental science works (the
'teacher' is the natural world)

6
Application of knowledge (deduction)
Output 1
Knowledge
Input 1
Output 2
Knowledge
Input 2

We have
inputs
knowledge

We get
outputs

7
Inductive learning (induction)
Output 1
Input 1
Knowledge
Output 2
Input 2

Input n
Output n

We have
inputs
outputs

We get
knowledge

8
Knowledge as a function

Knowledge (performance element) can be described
as a function
given a description, x, for a given object or
situation, the output is given by f(x) where f
embodies the knowledge contained in the
performance element.
could be
analytical mathematical function
lookup table
rule set (including STRIPS rules)
neural network
decision tree
etc.

9
Definition of inductive learning

Given a set of input/output pairs (examples)
(x, f(x))
where f is unknown, but the output f(x) is
provided with its corresponding input, x.
find a function, h(x) (hypothesis) which best
approximates f(x).
finding implies searching in a space of
different possible hypotheses.

10
Inductive learning
Non-incremental
Input 1
Output 1
Output 2
Knowledge
Input 2

Incremental
Input 1
Knowledge 1
Output 1
Output 2
Knowledge 2
Input 2

Assume that Knowledge 2 is more complete than
Knowledge 1, etc.
11
Non-incremental vs. Incremental

Non-incremental
learning from all examples at once
Incremental
learning by refining from successive examples
If you havent seen all possible examples, you
can never know that the system is going to give
the correct answer for a previously unknown
example.
You may never see all possible examples.

12
Wait for a table at a restaurant

To avoid arguments, you and your friends want to
have a clear decision procedure for the situation
where you turn up at a restaurant and have to
make a decision as to whether you will wait to
get a table.
In advance you
specify a number of attributes
draw up a decision tree of your own preferences
When you get to the restaurant you use the
decision tree to find the value of the goal
predicate Will wait

13
Attributes involved in the decision

Is there anything else nearby? Near yes / no
How long will the wait be? Time 0-10 / 10-30 /
30-60 / gt60
Does the restaurant have a bar? Bar yes / no
Is it the weekend? W/E yes / no
Are you hungry? Hun yes / no
How many tables are occupied? Occ. none / some
/all
Is it raining? Rain yes / no
Have you booked? Book yes / no
What type of restaurant? Type Fren / Chin /
Ital / US
How expensive is the restaurant? cheap /OK /
exp

14
Agreed Decision TreeWait for a table in a
restaurant?
15
Decision treesPerformance element

Object or situation described by a set of
discrete attributes
Task is to classify the object
binary (yes/no)
a member of a discrete set of possible outcomes
An internal node represents a test on one
attribute.
An arc represents a possible value for that
attribute.
A leaf node indicates the classification
resulting from following the path to that node.
A decision tree can be converted to a set of
rules.
This example represents a groups subjective way
of making a decision about whether or not to wait
for a table at a restaurant.

16
Task

An outsider observes the group on a number of
occasions.
The values of all the attributes are noted.
Can (s)he learn a decision tree that leads to the
same conclusion for these examples and will
predict future behaviour?

17
Set of examples

Near Bar W/E Hun Occ Rain Book
Type Time Decide
1 yes no no yes some exp no yes
Fren 0-10 YES
2 yes no no yes all cheap no no
Chin 30-60 NO
3 no yes no no some cheap no no US
0-10 YES
4 yes no yes yes all cheap yes no
Chin 10-30 YES
5 yes no yes no all exp no yes
Fren gt60 NO
6 no yes no yes some ok yes yes
Ital 0-10 YES
7 no yes no no none cheap yes no US
0-10 NO
8 no no no yes some ok yes yes
Chin 0-10 YES
9 no yes yes no all cheap yes no US
gt60 NO
10 yes yes yes yes all exp no yes
Ital 10-30 NO
11 no no no no none cheap no no
Chin 0-10 NO
12 yes yes yes yes all cheap no no US
30-60 YES

18
Learning decision trees

Can a program learn a decision tree by looking at
these 12 examples?
It could build a decision tree which covered only
the 12 cases - i.e. had a path to a decision for
those 12 only.
That is not very useful - there are over 9000
possible situations using the attributes I have
given, but the tree is designed to deal with the
12.
The assumption is that there is a simpler
solution.
What we would like the system to do is to come up
with a decision tree that general enough to
predict the outcome in all possible cases
(including the 12).
We would also like the smallest tree which
satisfies this.
Learning is non-incremental.

.
19
Which attribute? Take them in order ..
20
Discriminating Attributes
We will determine which attribute provides the
most information at each stage, and use that as
the root of a sub-tree.
yes 1,3,4,6,8,12 no 2,5,7,9,10,11
How many tables are occupied?
all
none
some
yes4, 12 no2, 5, 9, 10
yes no 7, 11
yes1, 3, 6, 8 no
What type of restaurant?
French
Italian
US
Chinese
yes1 no 5
yes4, 8 no2, 11
yes3, 12 no7, 9
yes6 no10
21
Splitting heuristic

What do we mean by providing the most
information
Based on information theoretic measure (Shannon)
Aims to minimise the number of tests needed to
provide a classification
"ID3, An algorithm for learning decision trees"
J. R. Quinlan, 1979.

22
Information

Want a numerical measure for each attribute
Maximum when attribute is perfect (provides
perfect separation)
Minimum when attribute is useless (no separation)
Suppose attribute has n possible values the ith
value has prior probability Pi
Information content of the attribute is
I(P1, P2, Pn ) ?-Pilog2Pi
Chose attribute with highest information content.
Bit more complex see RN

23
Measure of Impurity GINI

Gini Index for a given node t
(NOTE p( j t) is the relative frequency of
class j at node t).
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information

24
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
25
Splitting Based on GINI

Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions
(children), the quality of split is computed as,
where, ni number of records at child i,
n number of records at node p.

26
Binary Attributes Computing GINI Index

Splits into two partitions
Effect of Weighing partitions
Larger and Purer Partitions are sought for.

B?
Yes
No
Node N1
Node N2
Gini(N1) 1 (5/7)2 (2/7)2 0.408
Gini(N2) 1 (1/5)2 (4/5)2 0.320
Gini(Children) 7/12 0.408 5/12
0.320 0.371
27
Categorical Attributes Computing Gini Index

For each distinct value, gather counts for each
class in the dataset
Use the count matrix to make decisions

Multi-way split
28
Decision Tree Induction

Induce(Examples, Attributes, Default) Decision
tree
if (a) Examples return the leaf node leaf
node is labelled with the Default
else if (b) all elements of Examples have same
decision D, return the leaf node
leaf node is labelled with D
else if (c) Attributes return the leaf
node leaf node is labelled with the Default
else (d) delete A, the element of Attributes
which provides most information.
create a node A
for each possible value V of A
let E be the subset of Examples where the value
of attribute A is V
let D be the majority decision in Examples
let N Induce(E, Attributes,D)
add a directed arc from A, labelled V, ending
at N
return A (together with the tree rooted on it)

29
Learned Decision Tree
30
Learned Decision Tree
31
Learned Decision Tree
32
Learned Decision Tree
33
Agreed Decision Tree
34
Problems (1)

Problems with examples
Missing the value of an attribute for an example
Incorrect example (errors in data collection)
Both can affect performance element and learning
What do we do with continuous (or very many)
valued attributes? (e.g. price)
discretise the attribute (e.g. cheap / OK /
expensive)
normally done by hand, but would be better if it
could be done automatically.

35
Continuous Attributes Computing Gini Index

Use Binary Decisions based on one value
Several Choices for the splitting value
Number of possible splitting values Number of
distinct values
Each splitting value has a count matrix
associated with it
Class counts in each of the partitions, A lt v and
A ? v
Simple method to choose best v
For each v, scan the database to gather count
matrix and compute its Gini index
Computationally Inefficient! Repetition of work.

36
Continuous Attributes Computing Gini Index...

For efficient computation for each attribute,
Sort the attribute on values
Linearly scan these values, each time updating
the count matrix and computing gini index
Choose the split position that has the least gini
index

37
Problems (2)

Not enough examples
branch for value which has no examples (use
default)
Not enough attributes
leaf nodes with positive and negative
classifications
Over-fitting
too many degrees of freedom (questions)
use pruning to eliminate questions with
negligible information gain

38
Overfitting
y
y a1x a0 4 data points 2 degrees of freedom
y a5x5 a4x4 a3x3 a2x2 a1 x
a0 4 data points 6 degrees of freedom
x
39
How do we assess performance?

Put the system into production and see how well
it performs?
not safe in any domain where the results are
important
Save some of our examples, and use them to test
results.
1. Get a set of examples
2. Split into two subsets training and test
3. Learn using the training set
4. Evaluate using the test set
Only put into production when we are happy
It is very important that the test set and
training set have no examples in common

40
Application GASOIL

GASOIL is an expert system for designing gas/oil
separation systems stationed off-shore.
The design depends on
proportions of gas, oil and water, flow rate,
pressure, density, viscosity, temperature, and
others.
To build by hand would take 100 person-months
Built by decision-tree induction 3
person-months
At the time (early '80s), GASOIL was the biggest
Expert System in the world, containing 2500
rules, and saved BP millions.

41
Application Learning to fly

Learning to fly a Cessna on a flight simulator.
Three skilled pilots performed an assigned flight
plan 30 times each.
Each control action (e.g. on throttle, flaps)
created an example.
90,000 examples, each described by 20 state
variables and each categorised by the action
taken
Decision tree created.
Converted into C and put into the simulator
control loop.
Program flies better than teachers!
probably because generalisation cleans up
occasional mistakes

42
Incremental inductive learning
43
Arch
3
1
2
44
Not an arch
3
1
2
45
Not an arch
3
1
2
46
Arch
3
1
2
47
Arch
3
1
2
48
Not an arch
3
2
1
49
Incremental learning

Restrict ourselves to binary (yes/no) solutions
Each hypothesis (performance element) predicts
that a certain set of positive examples will
satisfy the goal predicate, and that all other
examples will not satisfy it.
The problem is then to find a hypothesis that is
consistent with the existing set of examples, and
that can be made be consistent with new examples.
Aim to improve our hypothesis for every new
example we get and hope that we eventually get
stability.

50
True/False Positive/Negative

positive
negative
true hypothesis yes hypothesis no
correct yes correct no
false hypothesis yes hypothesis no
correct no correct yes

NB True/false applies to the prediction of the
hypothesis.
51
Incremental learning

Some new examples will be consistent with our
current hypothesis (i.e. true positives and true
negatives), and so provide no more information.
Some new examples will be false positive if the
hypothesis predicts yes but the correct answer is
no.
Some new examples will be false negative, if the
hypothesis predicts no but the correct answer is
yes.
Inductive learning is then the process of
refining a hypothesis by narrowing it to
eliminate false positives, and extending it to
include false negatives.

52
Generalisation and specialisation
-
-
-
-
-
-
Discover false negative
-

-

hypothesis says yes
-
-

-
-
-
-
-
-
-

-

Generalise

-
-
correct value means yes - means no

-
-
-
-
-
-
Discover false positive
-

-

-
-
-

-
-
-
-
-
-
-

-

-

Specialise
-

-

-

53
Current Best Hypothesis

Let S the set of examples
Let O , the set of old examples
Select E, a positive example, from S
Move E to O
Construct H, a hypothesis consistent with E
While S ? , select another example E
move E to O
if E is a false negative with respect to
(w.r.t.) H then
generalise H so that H is consistent with all
members of O
else if E is a false positive w.r.t. H then
specialise H so that H is consistent with all
members of O
return H

54
Learning the definition of an "arch"
Suppose we are allowed to build arrangements of
any three objects from a set.
How can we get a computer system to learn, from
examples like these, the definition of an arch?
55
Representation

We need a language for describing objects and
concepts
attribute descriptions describe a single object
in terms of its features.
relational descriptions describe a composite
object in terms of its components and the
relationships between them.

56
The modelling language
(support X Y) X supports Y - X and Y can
take the values 1, 2 or 3 (touches X Y) X
touches Y (and Y touches X)
X and Y can take the values 1, 2 or
3 (shape X S) the shape of X is S X
can take the values 1, 2 or 3
S can be triangle, rectangle, ellipse,
square () allow negated predicates
(shape 1 rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
(shape 1 rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
3
3
2
1
2
1
57
The hypothesis (performance element)
The performance element would be a simple
rule for example if (shape 1 rectangle)
(shape 2 rectangle) (shape 3 rectangle)
(support 1 3) (support 2 3) then yes
we have an arch How do we learn it by being
shown successive examples?
58
Searching for Current Best

To begin, we will take the first positive example
to be our hypothesis.
At each stage, we will choose a minimum
generalisation or minimum specialisation of our
current hypothesis that is consistent.
If no such hypothesis is possible, then backtrack
to the last point where we had a choice, and
choose the next generalisation or specialisation.
The problem is now search, but we have to specify
how to generalise and specialise our definitions.

59
Generalisation

We can generalise a positive predicate, e.g.
(shape 1 rectangle) by removing variable
bindings
(shape 1 ?) the shape if 1 doesnt matter
(shape ? rectangle) everything is a
rectangle
(shape ? ?) shape is irrelevant
We can generalise a negated predicate, e.g.
(touch 1 ?) by adding variable bindings (i.e.
make the predicate more restrictive)
(touch 1 2)
We can generalise a hypothesis by generalising
one of its predicates, or by removing a
predicate.
(shape ? ?) is equivalent to removal

60
Specialisation

We can specialise a positive predicate, e.g.
(shape 1 ?) by adding variable bindings
(shape 1 rectangle)
We can specialise a negated predicate, e.g.
(touch 1 2) by removing variable bindings
(touch 1 ?)
We can specialise a hypothesis by specialising
one of its predicates, or by adding a predicate.

61
Learning the concept of an arch
Example 1 true positive - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
Hypothesis 1 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3)
Prediction
3
2
1
Example
Example 2 false positive - specialise (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3) (touch 1 2)
Hypothesis 2 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
62
Learning the concept of an arch
Example 3 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
Hypothesis 3 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
Example 4 false negative - generalise (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 3) (support 2 3)
Hypothesis 4 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch 1 2)
Prediction
3
2
1
Example
63

Learning the concept of an arch
Example 5 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 2) (support 2 3)
Hypothesis 5 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch1 2)
Prediction
3
Example
2
1
64
Concept Hierarchies
a_shape
polygon
ellipse
trapezium
rectangle
triangle
square
isosceles
equilateral
(shape X isosceles) generalises to (shape X
triangle) (shape X triangle) generalises
to (shape X polygon) (shape X rectangle)
generalises to (shape X polygon)
65
Use of concept hierarchy
E.g. Hypothesis 4 would contain (shape 3
polygon) If this were Example 6 false negative
- generalise
3
Example
1
2
then Hypothesis 6 would contain (shape 3
a_shape)

Write a Comment

User Comments (0)