Decision Tree and Concept Learning - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Decision Tree and Concept Learning

Description:

Splits into two partitions. Effect of ... At the time (early '80s), GASOIL was the biggest Expert System in the world, ... Program flies better than teachers! ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 66
Provided by: georgemacl
Category:

less

Transcript and Presenter's Notes

Title: Decision Tree and Concept Learning


1
Decision Tree and Concept Learning
2
Outline
  • Types of learning
  • Inductive learning
  • non-incremental
  • incremental
  • Decision trees (non-incremental)
  • Current best hypothesis (incremental)
  • Candidate elimination (incremental)

3
Why should programs learn?
  • All the programs seen up to now have been static
  • if we run the programs again on the same data,
    they will do exactly the same as before
  • they cannot learn from their experience
  • they require us to specify everything they will
    ever need to know, at the outset.
  • We would like programs to learn from their
    experience.
  • We would like the possibility of learning being
    continous.
  • Learning is fundamental to intelligence.

4
How do you learn?
  • By being told relies on
  • someone to do the telling
  • something to tell!
  • By finding a teacher, who provides a set of
    pre-classified examples (i.e. I/O pairs), or
    taking action and obtaining correct answer from
    observations Supervised learning
  • By searching for regularities in unclassified
    data (e.g. clusters) Unsupervised learning
  • By trying things, and seeing which outcomes are
    desirable (i.e. earn rewards) e.g. learning the
    heuristic evaluation function in game-playing
    Reinforcement learning

5
Inductive learning
  • Learning by example (supervised learning)
  • teacher provides good training instances and the
    learner is expected to generalise
  • in knowledge acquisition it is often easier to
    get an expert to give examples than to give rules
  • this is how experimental science works (the
    'teacher' is the natural world)

6
Application of knowledge (deduction)
Output 1
Knowledge
Input 1
Output 2
Knowledge
Input 2
  • We have
  • inputs
  • knowledge
  • We get
  • outputs

7
Inductive learning (induction)
Output 1
Input 1
Knowledge
Output 2
Input 2


Input n
Output n
  • We have
  • inputs
  • outputs
  • We get
  • knowledge

8
Knowledge as a function
  • Knowledge (performance element) can be described
    as a function
  • given a description, x, for a given object or
    situation, the output is given by f(x) where f
    embodies the knowledge contained in the
    performance element.
  • could be
  • analytical mathematical function
  • lookup table
  • rule set (including STRIPS rules)
  • neural network
  • decision tree
  • etc.

9
Definition of inductive learning
  • Given a set of input/output pairs (examples)
    (x, f(x))
  • where f is unknown, but the output f(x) is
    provided with its corresponding input, x.
  • find a function, h(x) (hypothesis) which best
    approximates f(x).
  • finding implies searching in a space of
    different possible hypotheses.

10
Inductive learning
Non-incremental
Input 1
Output 1
Output 2
Knowledge
Input 2


Incremental
Input 1
Knowledge 1
Output 1
Output 2
Knowledge 2
Input 2

Assume that Knowledge 2 is more complete than
Knowledge 1, etc.
11
Non-incremental vs. Incremental
  • Non-incremental
  • learning from all examples at once
  • Incremental
  • learning by refining from successive examples
  • If you havent seen all possible examples, you
    can never know that the system is going to give
    the correct answer for a previously unknown
    example.
  • You may never see all possible examples.

12
Wait for a table at a restaurant
  • To avoid arguments, you and your friends want to
    have a clear decision procedure for the situation
    where you turn up at a restaurant and have to
    make a decision as to whether you will wait to
    get a table.
  • In advance you
  • specify a number of attributes
  • draw up a decision tree of your own preferences
  • When you get to the restaurant you use the
    decision tree to find the value of the goal
    predicate Will wait

13
Attributes involved in the decision
  • Is there anything else nearby? Near yes / no
  • How long will the wait be? Time 0-10 / 10-30 /
    30-60 / gt60
  • Does the restaurant have a bar? Bar yes / no
  • Is it the weekend? W/E yes / no
  • Are you hungry? Hun yes / no
  • How many tables are occupied? Occ. none / some
    /all
  • Is it raining? Rain yes / no
  • Have you booked? Book yes / no
  • What type of restaurant? Type Fren / Chin /
    Ital / US
  • How expensive is the restaurant? cheap /OK /
    exp

14
Agreed Decision TreeWait for a table in a
restaurant?
15
Decision treesPerformance element
  • Object or situation described by a set of
    discrete attributes
  • Task is to classify the object
  • binary (yes/no)
  • a member of a discrete set of possible outcomes
  • An internal node represents a test on one
    attribute.
  • An arc represents a possible value for that
    attribute.
  • A leaf node indicates the classification
    resulting from following the path to that node.
  • A decision tree can be converted to a set of
    rules.
  • This example represents a groups subjective way
    of making a decision about whether or not to wait
    for a table at a restaurant.

16
Task
  • An outsider observes the group on a number of
    occasions.
  • The values of all the attributes are noted.
  • Can (s)he learn a decision tree that leads to the
    same conclusion for these examples and will
    predict future behaviour?

17
Set of examples
  • Near Bar W/E Hun Occ Rain Book
    Type Time Decide
  • 1 yes no no yes some exp no yes
    Fren 0-10 YES
  • 2 yes no no yes all cheap no no
    Chin 30-60 NO
  • 3 no yes no no some cheap no no US
    0-10 YES
  • 4 yes no yes yes all cheap yes no
    Chin 10-30 YES
  • 5 yes no yes no all exp no yes
    Fren gt60 NO
  • 6 no yes no yes some ok yes yes
    Ital 0-10 YES
  • 7 no yes no no none cheap yes no US
    0-10 NO
  • 8 no no no yes some ok yes yes
    Chin 0-10 YES
  • 9 no yes yes no all cheap yes no US
    gt60 NO
  • 10 yes yes yes yes all exp no yes
    Ital 10-30 NO
  • 11 no no no no none cheap no no
    Chin 0-10 NO
  • 12 yes yes yes yes all cheap no no US
    30-60 YES

18
Learning decision trees
  • Can a program learn a decision tree by looking at
    these 12 examples?
  • It could build a decision tree which covered only
    the 12 cases - i.e. had a path to a decision for
    those 12 only.
  • That is not very useful - there are over 9000
    possible situations using the attributes I have
    given, but the tree is designed to deal with the
    12.
  • The assumption is that there is a simpler
    solution.
  • What we would like the system to do is to come up
    with a decision tree that general enough to
    predict the outcome in all possible cases
    (including the 12).
  • We would also like the smallest tree which
    satisfies this.
  • Learning is non-incremental.

.
19
Which attribute? Take them in order ..
20
Discriminating Attributes
We will determine which attribute provides the
most information at each stage, and use that as
the root of a sub-tree.
yes 1,3,4,6,8,12 no 2,5,7,9,10,11
How many tables are occupied?
all
none
some
yes4, 12 no2, 5, 9, 10
yes no 7, 11
yes1, 3, 6, 8 no
What type of restaurant?
French
Italian
US
Chinese
yes1 no 5
yes4, 8 no2, 11
yes3, 12 no7, 9
yes6 no10
21
Splitting heuristic
  • What do we mean by providing the most
    information
  • Based on information theoretic measure (Shannon)
  • Aims to minimise the number of tests needed to
    provide a classification
  • "ID3, An algorithm for learning decision trees"
    J. R. Quinlan, 1979.

22
Information
  • Want a numerical measure for each attribute
  • Maximum when attribute is perfect (provides
    perfect separation)
  • Minimum when attribute is useless (no separation)
  • Suppose attribute has n possible values the ith
    value has prior probability Pi
  • Information content of the attribute is
  • I(P1, P2, Pn ) ?-Pilog2Pi
  • Chose attribute with highest information content.
  • Bit more complex see RN

23
Measure of Impurity GINI
  • Gini Index for a given node t
  • (NOTE p( j t) is the relative frequency of
    class j at node t).
  • Maximum (1 - 1/nc) when records are equally
    distributed among all classes, implying least
    interesting information
  • Minimum (0.0) when all records belong to one
    class, implying most interesting information

24
Examples for computing GINI
P(C1) 0/6 0 P(C2) 6/6 1 Gini 1
P(C1)2 P(C2)2 1 0 1 0
P(C1) 1/6 P(C2) 5/6 Gini 1
(1/6)2 (5/6)2 0.278
P(C1) 2/6 P(C2) 4/6 Gini 1
(2/6)2 (4/6)2 0.444
25
Splitting Based on GINI
  • Used in CART, SLIQ, SPRINT.
  • When a node p is split into k partitions
    (children), the quality of split is computed as,
  • where, ni number of records at child i,
  • n number of records at node p.

26
Binary Attributes Computing GINI Index
  • Splits into two partitions
  • Effect of Weighing partitions
  • Larger and Purer Partitions are sought for.

B?
Yes
No
Node N1
Node N2
Gini(N1) 1 (5/7)2 (2/7)2 0.408
Gini(N2) 1 (1/5)2 (4/5)2 0.320
Gini(Children) 7/12 0.408 5/12
0.320 0.371
27
Categorical Attributes Computing Gini Index
  • For each distinct value, gather counts for each
    class in the dataset
  • Use the count matrix to make decisions

Multi-way split
28
Decision Tree Induction
  • Induce(Examples, Attributes, Default) Decision
    tree
  • if (a) Examples return the leaf node leaf
    node is labelled with the Default
  • else if (b) all elements of Examples have same
    decision D, return the leaf node
  • leaf node is labelled with D
  • else if (c) Attributes return the leaf
    node leaf node is labelled with the Default
  • else (d) delete A, the element of Attributes
    which provides most information.
  • create a node A
  • for each possible value V of A
  • let E be the subset of Examples where the value
    of attribute A is V
  • let D be the majority decision in Examples
  • let N Induce(E, Attributes,D)
  • add a directed arc from A, labelled V, ending
    at N
  • return A (together with the tree rooted on it)

29
Learned Decision Tree
30
Learned Decision Tree
31
Learned Decision Tree
32
Learned Decision Tree
33
Agreed Decision Tree
34
Problems (1)
  • Problems with examples
  • Missing the value of an attribute for an example
  • Incorrect example (errors in data collection)
  • Both can affect performance element and learning
  • What do we do with continuous (or very many)
    valued attributes? (e.g. price)
  • discretise the attribute (e.g. cheap / OK /
    expensive)
  • normally done by hand, but would be better if it
    could be done automatically.

35
Continuous Attributes Computing Gini Index
  • Use Binary Decisions based on one value
  • Several Choices for the splitting value
  • Number of possible splitting values Number of
    distinct values
  • Each splitting value has a count matrix
    associated with it
  • Class counts in each of the partitions, A lt v and
    A ? v
  • Simple method to choose best v
  • For each v, scan the database to gather count
    matrix and compute its Gini index
  • Computationally Inefficient! Repetition of work.

36
Continuous Attributes Computing Gini Index...
  • For efficient computation for each attribute,
  • Sort the attribute on values
  • Linearly scan these values, each time updating
    the count matrix and computing gini index
  • Choose the split position that has the least gini
    index

37
Problems (2)
  • Not enough examples
  • branch for value which has no examples (use
    default)
  • Not enough attributes
  • leaf nodes with positive and negative
    classifications
  • Over-fitting
  • too many degrees of freedom (questions)
  • use pruning to eliminate questions with
    negligible information gain

38
Overfitting
y
y a1x a0 4 data points 2 degrees of freedom
y a5x5 a4x4 a3x3 a2x2 a1 x
a0 4 data points 6 degrees of freedom
x
39
How do we assess performance?
  • Put the system into production and see how well
    it performs?
  • not safe in any domain where the results are
    important
  • Save some of our examples, and use them to test
    results.
  • 1. Get a set of examples
  • 2. Split into two subsets training and test
  • 3. Learn using the training set
  • 4. Evaluate using the test set
  • Only put into production when we are happy
  • It is very important that the test set and
    training set have no examples in common

40
Application GASOIL
  • GASOIL is an expert system for designing gas/oil
    separation systems stationed off-shore.
  • The design depends on
  • proportions of gas, oil and water, flow rate,
    pressure, density, viscosity, temperature, and
    others.
  • To build by hand would take 100 person-months
  • Built by decision-tree induction 3
    person-months
  • At the time (early '80s), GASOIL was the biggest
    Expert System in the world, containing 2500
    rules, and saved BP millions.

41
Application Learning to fly
  • Learning to fly a Cessna on a flight simulator.
  • Three skilled pilots performed an assigned flight
    plan 30 times each.
  • Each control action (e.g. on throttle, flaps)
    created an example.
  • 90,000 examples, each described by 20 state
    variables and each categorised by the action
    taken
  • Decision tree created.
  • Converted into C and put into the simulator
    control loop.
  • Program flies better than teachers!
  • probably because generalisation cleans up
    occasional mistakes

42
Incremental inductive learning
43
Arch
3
1
2
44
Not an arch
3
1
2
45
Not an arch
3
1
2
46
Arch
3
1
2
47
Arch
3
1
2
48
Not an arch
3
2
1
49
Incremental learning
  • Restrict ourselves to binary (yes/no) solutions
  • Each hypothesis (performance element) predicts
    that a certain set of positive examples will
    satisfy the goal predicate, and that all other
    examples will not satisfy it.
  • The problem is then to find a hypothesis that is
    consistent with the existing set of examples, and
    that can be made be consistent with new examples.
  • Aim to improve our hypothesis for every new
    example we get and hope that we eventually get
    stability.

50
True/False Positive/Negative
  • positive
    negative
  • true hypothesis yes hypothesis no
  • correct yes correct no
  • false hypothesis yes hypothesis no
  • correct no correct yes

NB True/false applies to the prediction of the
hypothesis.
51
Incremental learning
  • Some new examples will be consistent with our
    current hypothesis (i.e. true positives and true
    negatives), and so provide no more information.
  • Some new examples will be false positive if the
    hypothesis predicts yes but the correct answer is
    no.
  • Some new examples will be false negative, if the
    hypothesis predicts no but the correct answer is
    yes.
  • Inductive learning is then the process of
    refining a hypothesis by narrowing it to
    eliminate false positives, and extending it to
    include false negatives.

52
Generalisation and specialisation
-
-
-
-
-
-
Discover false negative
-

-



hypothesis says yes
-
-



-
-
-
-
-
-
-

-


Generalise

-
-
correct value means yes - means no




-
-
-
-
-
-
Discover false positive
-

-



-
-
-




-
-
-
-
-
-
-

-


-

Specialise
-

-


-

53
Current Best Hypothesis
  • Let S the set of examples
  • Let O , the set of old examples
  • Select E, a positive example, from S
  • Move E to O
  • Construct H, a hypothesis consistent with E
  • While S ? , select another example E
  • move E to O
  • if E is a false negative with respect to
    (w.r.t.) H then
  • generalise H so that H is consistent with all
    members of O
  • else if E is a false positive w.r.t. H then
  • specialise H so that H is consistent with all
    members of O
  • return H

54
Learning the definition of an "arch"
Suppose we are allowed to build arrangements of
any three objects from a set.
How can we get a computer system to learn, from
examples like these, the definition of an arch?
55
Representation
  • We need a language for describing objects and
    concepts
  • attribute descriptions describe a single object
    in terms of its features.
  • relational descriptions describe a composite
    object in terms of its components and the
    relationships between them.

56
The modelling language
(support X Y) X supports Y - X and Y can
take the values 1, 2 or 3 (touches X Y) X
touches Y (and Y touches X)
X and Y can take the values 1, 2 or
3 (shape X S) the shape of X is S X
can take the values 1, 2 or 3
S can be triangle, rectangle, ellipse,
square () allow negated predicates
(shape 1 rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
(shape 1 rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
3
3
2
1
2
1
57
The hypothesis (performance element)
The performance element would be a simple
rule for example if (shape 1 rectangle)
(shape 2 rectangle) (shape 3 rectangle)
(support 1 3) (support 2 3) then yes
we have an arch How do we learn it by being
shown successive examples?
58
Searching for Current Best
  • To begin, we will take the first positive example
    to be our hypothesis.
  • At each stage, we will choose a minimum
    generalisation or minimum specialisation of our
    current hypothesis that is consistent.
  • If no such hypothesis is possible, then backtrack
    to the last point where we had a choice, and
    choose the next generalisation or specialisation.
  • The problem is now search, but we have to specify
    how to generalise and specialise our definitions.

59
Generalisation
  • We can generalise a positive predicate, e.g.
    (shape 1 rectangle) by removing variable
    bindings
  • (shape 1 ?) the shape if 1 doesnt matter
  • (shape ? rectangle) everything is a
    rectangle
  • (shape ? ?) shape is irrelevant
  • We can generalise a negated predicate, e.g.
    (touch 1 ?) by adding variable bindings (i.e.
    make the predicate more restrictive)
  • (touch 1 2)
  • We can generalise a hypothesis by generalising
    one of its predicates, or by removing a
    predicate.
  • (shape ? ?) is equivalent to removal

60
Specialisation
  • We can specialise a positive predicate, e.g.
    (shape 1 ?) by adding variable bindings
  • (shape 1 rectangle)
  • We can specialise a negated predicate, e.g.
    (touch 1 2) by removing variable bindings
  • (touch 1 ?)
  • We can specialise a hypothesis by specialising
    one of its predicates, or by adding a predicate.

61
Learning the concept of an arch
Example 1 true positive - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3)
Hypothesis 1 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3)
Prediction
3
2
1
Example
Example 2 false positive - specialise (shape 1
rectangle) (shape 2 rectangle) (shape 3
rectangle) (support 1 3) (support 2 3) (touch 1 2)
Hypothesis 2 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
62
Learning the concept of an arch
Example 3 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
ellipse) (support 1 3)
Hypothesis 3 (shape 1 rectangle) (shape 2
rectangle) (shape 3 rectangle) (support 1
3) (support 2 3) (touch 1 2)
Prediction
3
1
2
Example
Example 4 false negative - generalise (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 3) (support 2 3)
Hypothesis 4 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch 1 2)
Prediction
3
2
1
Example
63

Learning the concept of an arch
Example 5 true negative - no action (shape 1
rectangle) (shape 2 rectangle) (shape 3
triangle) (support 1 2) (support 2 3)
Hypothesis 5 (shape 1 rectangle) (shape 2
rectangle) (shape 3 ?) (support 1 3) (support 2
3) (touch1 2)
Prediction
3
Example
2
1
64
Concept Hierarchies
a_shape
polygon
ellipse
trapezium
rectangle
triangle
square
isosceles
equilateral
(shape X isosceles) generalises to (shape X
triangle) (shape X triangle) generalises
to (shape X polygon) (shape X rectangle)
generalises to (shape X polygon)
65
Use of concept hierarchy
E.g. Hypothesis 4 would contain (shape 3
polygon) If this were Example 6 false negative
- generalise
3
Example
1
2
then Hypothesis 6 would contain (shape 3
a_shape)
Write a Comment
User Comments (0)
About PowerShow.com