Advanced Artificial Intelligence Lecture 3: Learning - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Advanced Artificial Intelligence Lecture 3: Learning

Description:

Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1. 4. Defining a Learning System ... Transitivity: If A G B and B G C then also A G C ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 55
Provided by: scSn
Category:

less

Transcript and Presenter's Notes

Title: Advanced Artificial Intelligence Lecture 3: Learning


1
Advanced Artificial IntelligenceLecture 3
Learning
  • Bob McKay
  • School of Computer Science and Engineering
  • College of Engineering
  • Seoul National University

2
Outline
  • Defining Learning
  • Kinds of Learning
  • Generalisation and Specialisation
  • Some Simple Learning Algorithms

3
References
  • Mitchell, Tom M Machine Learning, McGraw-Hill,
    1997, ISBN 0 07 115467 1

4
Defining a Learning System (Mitchell)
  • A program is said to learn from experience E
    with respect to some class of tasks T and
    performance measure P, if its performance at
    tasks in T, as measured by P, improves with
    experience E

5
Specifying a Learning System
  • Specifying the task T, the performance P and the
    experience E defines the learning problem.
    Specifying the learning system requires us to
    define
  • Exactly what knowledge is to be learnt
  • How this knowledge is to be represented
  • How this knowledge is to be learnt

6
Specifying What is to be Learnt
  • Usually, the desired knowledge can be represented
    as a target valuation function V I ? D
  • It takes in information about the problem and
    gives back a desired decision
  • Often, it is unrealistic to expect to learn the
    ideal function V
  • All that is required is a good enough
    approximation, V I ? D

7
Specifying How Knowledge is to be Represented
  • The function V must be represented symbolically,
    in some language L
  • The language may be a well-known language
  • Boolean expressions
  • Arithmetic functions
  • .
  • Or for some systems, the language may be defined
    by a grammar

8
Specifying How the Knowledge is to be Learnt
  • If the learning system is to be implemented, we
    must specify an algorithm A, which defines the
    way in which the system is to search the language
    L for an acceptable V
  • That is, we must specify a search algorithm

9
Structure of a Learning System
  • Four modules
  • The Performance System
  • The Critic
  • The Generaliser (or sometimes Specialiser)
  • The Experiment Generator

10
Performance Module
  • This is the system which actually uses the
    function V as we learn it
  • Learning Task
  • Learning to play checkers
  • Performance module
  • System for playing checkers
  • (I.e. makes the checkers moves)

11
Critic Module
  • The critic module evaluates the performance of
    the current V
  • It produces a set of data from which the system
    can learn further

12
Generaliser/Specialiser Module
  • Takes a set of data and produces a new V for the
    system to run again

13
Experiment Generator
  • Takes the new V
  • Maybe also uses the previous history of the
    system
  • Produces a new experiment for the performance
    system to undertake

14
The Importance of Bias
  • Important theoretical results from learning
    theory (PAC learning) tell us that learning
    without some presuppositions is infeasible.
  • Practical experience, of both machine and human
    learning, confirms this.
  • To learn effectively, we must limit the class of
    Vs.
  • Two approaches are used in machine learning
  • Language bias
  • Search Bias
  • Combined Bias
  • Language and search bias are not mutually
    exclusive most learning systems feature both

15
Language Bias
  • The language L is restricted so that it cannot
    represent all possible target functions V
  • This is usually on the basis of some knowledge we
    have about the likely form of V
  • It introduces risk
  • Our system will fail if L does not contain an
    acceptable V

16
Search Bias
  • The order in which the system searches L is
    controlled, so that promising areas for V are
    searched first

17
The DownsideNo Free Lunches
  • Wolpert and MacReadys No Free Lunch Theorem
    states, in effect, that averaged over all
    problems, all biases are equally good (or bad).
  • Conventional view
  • The choice of a learning system cannot be
    universal
  • It must be matched to the problem being solved
  • In most systems, the bias is not explicit
  • The ability to identify the language and search
    biases of a particular system is an important
    aspect of machine learning
  • Some more recent systems permit the explicit and
    flexible specification of both language and
    search biases

18
No Free LunchDoes it Matter?
  • Alternative view
  • We arent interested in all problems
  • We are only interested in prolems which have
    solutions of less than some bounded complexity
  • (so that we can understand the solutions)
  • The No Free Lunch Theorem may not apply in this
    case

19
Some Dimensions of Learning
  • Induction vs Discovery
  • Guided learning vs learning from raw data
  • Learning How vs Learning That (vs Learning a
    Better That)
  • Stochastic vs Deterministic Symbolic vs
    Subsymbolic
  • Clean vs Noisy Data
  • Discrete vs continuous variables
  • Attribute vs Relational Learning
  • The Importance of Background Knowledge

20
Induction vs Discovery
  • Has the target concept been previously
    identified?
  • Pearson cloud classifications from satellite
    data
  • vs
  • Autoclass and H - R diagrams
  • AM and prime numbers
  • BACON and Boyle's Law

21
Guided Learning vs Learning from Raw Data
  • Does the learning system require carefully
    selected examples and counterexamples, as in a
    teacher student situation?
  • (allows fast learning)
  • CIGOL learning sort/merge
  • vs
  • Garvan institute's thyroid data

22
Learning How vs Learning That vs Learning a
Better That
  • Classifying handwritten symbols
  • Distinguishing vowel sounds (Sejnowski
    Rosenberg)
  • Learning to fly a (simulated!) plane
  • vs
  • Michalski learning diagnosis of soy diseases
  • vs
  • Mitchell learning about chess forks

23
Stochastic vs DeterministicSymbolic vs
Subsymbolic
  • Classifying handwritten symbols (stochastic,
    subsymbolic)
  • vs
  • Predicting plant distributions (stochastic,
    symbolic)
  • vs
  • Cloud classification (deterministic, symbolic)
  • vs
  • ? (deterministic, subsymbolic)

24
Clean vs Noisy Data
  • Learning to diagnose errors in programs
  • vs
  • Greater gliders in the Coolangubra

25
Discrete vs Continuous Variables
  • Quinlan's chess end games
  • vs
  • Pearson's clouds (eg cloud heights)

26
Attibute vs Relational Learning
  • Predicting plant distributions
  • vs
  • Predicting animal distributions
  • (because plants cant move, they dont care -
    much - about spatial relationships)

27
The importance of Background Knowledge
  • Learning about faults in a satellite power supply
  • general electric circuit theory
  • knowledge about the particular circuit

28
Generalisation and Learning
  • What do we mean when we say of two propositions,
    S and G, that G is a generalisation of S?
  • Suppose skippy is a grey kangaroo.
  • We would regard Kangaroos are grey as a
    generalisation of Skippy is grey.
  • In any world in which kangaroos are grey is
    true, Skippy is grey will also be true.
  • In other words, if G is a generalisation of
    specialisation S, then G is 'at least as true' as
    S,
  • That is, S is true in all states of the world in
    which G is, and perhaps in other states as well.

29
Generalisation and Inference
  • In logic, we assume that if S is true in all
    worlds in which G is, then
  • G ? S
  • That is, G is a generalisation of S exactly when
    G implies S
  • So we can think of learning from S as a search
    for a suitable G for which G ? S
  • In propositional learning, this is often used as
    a definition
  • G is more general than S if and only if G ? S

30
Issues
  • Equating generalisation and logical implication
    is only useful if the validity of an implication
    can be readily computed
  • In the propositional calculus, validity is an
    exponential problem
  • in the predicate calculus, validity is an
    undecidable problem
  • so the definition is not universally useful
  • (although for some parts of logic - eg learning
    rules - it is perfectly adequate).

31
A Common Misunderstanding
  • Suppose we have two rules,
  • 1) A ? ? ? G
  • 2) A ? ? ? C ? G
  • Clearly, we would want 1 to be a generalisation
    of 2
  • This is OK with our definition, because
  • ((A B ? G) ? (A B C ? G))
  • is valid
  • But the confusing thing is that ((ABC) ? (A??))
    is valid
  • Iif you only look at the hypotheses of the rule,
    rather than the whole rule, the implication is
    the wrong way around
  • Note that some textbooks are themselves confused
    about this

32
Defining Generalisaion
  • We could try to define the properties that
    generalisation must satisfy,
  • So let's write down some axioms. We need some
    notation.
  • We will write 'S ltG G' as shorthand for 'S is
    less general than G'.
  • Axioms
  • Transitivity If A ltG B and B ltG C then also A
    ltG C
  • Antisymmetry If A ltG B then it's not true that B
    ltG A
  • Top there is a unique element, ?, for which it
    is always true that A ltG ?.
  • Bottom there is a unique element, T, for which
    it is always true that T ltG A.

33
Picturing Generalisaion
  • We can draw a 'picture' of a generalisation
    hierarchy satisfying these axioms

34
Specifying Generalisaion
  • In a particular domain, the generalisation
    hierarchy may be defined in either of two ways
  • By giving a general definition of what
    generalisation means in that domain
  • Example our earlier definition in terms of
    implication
  • By directly specifying the specialisation and
    generalisation operators that may be used to
    climb up and down the links in the generalisation
    hierarchy

35
Learning and Generalisaion
  • How does learning relate to generalisation?
  • We can view most learning as an attempt to find
    an appropriate generalisation that generalises
    the examples.
  • In noise free domains, we usually want the
    generalisation to cover all the examples.
  • Once we introduce noise, we want the
    generalisation to cover 'enough' examples, and
    the interesting bit is in defining what 'enough'
    is.
  • In our picture of a generalisation hierarchy,
    most learning algorithms can be viewed as methods
    for searching the hierarchy.
  • The examples can be pictured as locations low
    down in the hierarchy, and the learning algorithm
    attempts to find a location that is above all (or
    'enough') of them in the hierarchy, but usually,
    no higher 'than it needs to be'

36
Searching the Generalisaion Hierarchy
  • The commonest approaches are
  • generalising search
  • the search is upward from the original examples,
    towards the more general hypotheses
  • specialising search
  • the search is downward from the most general
    hypothesis, towards the more special examples
  • Some algorithms use different approaches.
    Mitchell's version space approach, for example,
    tries to 'home in' on the right generalisation
    from both directions at once.

37
Completeness and Generalisaion
  • Many approaches to axiomatising generalisation
    add an extra axiom
  • Completeness For any set S of members of the
    generalisation hierarchy, there is a unique
    'least general generalisation' L, which satisfies
    two properties
  • 1) for every S in S, S ltG L
  • 2) if any other L' satisfies 1), then L ltG L'
  • If this definition is hard to understand, compare
    it with the definition of 'Least Upper Bound' in
    set theory, or of 'Least Common Multiple' in
    arithmetic

38
Restricting Generalisation
  • Let's go back to our original definition of
    generalisation
  • G generalises S iff G ? S
  • In the general predicate calculus case, this
    relation is uncomputable, so it's not very useful
  • One approach to avoiding the problem is to limit
    the implications allowed

39
Generalisation and Substitution
  • Very commonly, the generalisations we want to
    make involve turning a constant into a variable.
  • So we see a particular black crow, fred, so we
    notice
  • crow(fred) ? black(fred)
  • and we may wish to generalise this to
  • ?X(crow(X) ? black(X))
  • Notice that the original proposition can be
    recovered from the generalisation by substituting
    'fred' for the variable 'X'
  • The original is a substitution instance of the
    generalisation
  • So we could define a new, restricted
    generalisation
  • G subsumes S if S is a substitution instance of G
  • An example of our earlier definition, because a
    substitution instance is always implied by the
    original proposition.

40
Learning Algorithms
  • For the rest of this lecture, we will work with a
    specific learning dataset (due to Mitchell)
  • Item Sky AirT Hum Wnd Wtr Fcst Enjy
  • 1 Sun Wrm Nml Str Wrm Sam Yes
  • 2 Sun Wrm High Str Wrm Sam Yes
  • 3 Rain Cold High Str Wrm Chng No
  • 4 Sun Wrm High Str Cool Chng Yes
  • First, we look at a really simple algorithm,
    Maximally Specific Learning

41
Maximally Specific Learning
  • The learning language consists of sets of tuples,
    representing the values of these attributes
  • A ? represents that any value is acceptable for
    this attribute
  • A particular value represents that only that
    value is acceptable for this attribute
  • A f represents that no value is acceptable for
    this attribute
  • Thus (?, Cold, High, ?, ?, ?) represents the
    hypothesis that water sport is enjoyed only on
    cold, moist days.
  • Note that our language is already heavily biased
    only conjunctive hypotheses (hypotheses built
    with ) are allowed.

42
Find-S
  • Find-S is a simple algorithm its initial
    hypothesis is that water sport is never enjoyed
  • It expands the hypothesis as positive data items
    are noted

43
Running Find-S
  • Initial Hypothesis
  • The most specific hypothesis (water sports are
    never enjoyed)
  • h ? (f,f,f,f,f,f)
  • After First Data Item
  • Water sport is enjoyed only under the conditions
    of the first item
  • h ? (Sun,Wrm,Nml,Str,Wrm,Sam)
  • After Second Data Item
  • Water sport is enjoyed only under the common
    conditions of the first two items
  • h ? (Sun,Wrm,?,Str,Wrm,Sam)

44
Running Find-S
  • After Third Data Item
  • Since this item is negative, it has no effect on
    the learning hypothesis
  • h ? (Sun,Wrm,?,Str,Wrm,Sam)
  • After Final Data Item
  • Further generalises the conditions encountered
  • h ? (Sun,Wrm,?,Str,?,?)

45
Discussion
  • We have found the most specific hypothesis
    corresponding to the dataset and the restricted
    (conjunctive) language
  • It is not clear it is the best hypothesis
  • If the best hypothesis is not conjunctive (eg if
    we enjoy swimming if its warm or sunny), it will
    not be found
  • Find-S will not handle noise and inconsistencies
    well.
  • In other languages (not using pure conjunction)
    there may be more than one maximally specific
    hypothesis Find-S will not work well here

46
Version Spaces
  • One possible improvement on Find-S is to search
    many possible solutions in parallel
  • Consistency
  • A hypothesis h is consistent with a dataset D of
    training examples iff h gives the same answer on
    every element of the dataset as the dataset does
  • Version Space
  • The version space with respect to the language L
    and the dataset D is the set of hypotheses h in
    the language L which are consistent with D

47
List-then-Eliminate
  • Obvious algorithm
  • The list-then-eliminate algorithm aims to find
    the version space in L for the given dataset D
  • It can thus return all hypotheses which could
    explain D
  • It works by beginning with L as its set of
    hypotheses H
  • As each item d of the dataset D is examined in
    turn, any hypotheses in H which are inconsistent
    with d are eliminated
  • The language L is usually large, and often
    infinite, so this algorithm is computationally
    infeasible as it stands

48
Version Space Representation
  • One of the problems with the previous algorithm
    is the representation of the search space
  • We need to represent version spaces efficiently
  • General Boundary
  • The general boundary G with respect to language L
    and dataset D is the set of hypotheses h in L
    which are consistent with D, and for which there
    is no more general hypothesis in L which is
    consistent with D
  • Specific Boundary
  • The specific boundary S with respect to language
    L and dataset D is the set of hypotheses h in L
    which are consistent with D, and for which there
    is no more specific hypothesis in L which is
    consistent with D

49
Version Space Representation 2
  • A version space may be represented by its general
    and specific boundary
  • That is, given the general and specific
    boundaries, the whole version space may be
    recovered
  • The Candidate Elimination Algorithm traces the
    general and specific boundaries of the version
    space as more examples and counter-examples of
    the concept are seen
  • Positive examples are used to generalise the
    specific boundary
  • Negative examples permit the general boundary to
    be specialised.

50
Candidate Elimination Algorithm
  • Set G to the set of most general hypotheses in L
  • Set S to the set of most specific hypotheses in L
  • For each example d in D

51
Candidate Elimination Algorithm
  • If d is a positive example
  • Remove from G any hypothesis inconsistent with
    d
  • For each hypothesis s in S that is not
    consistent with d
  • Remove s from S
  • Add to S all minimal generalisations h of s
    such that h is consistent with d, and some
    member of G is more general than h
  • Remove from S any hypothesis that is more
    general than another hypothesis in S

52
Candidate Elimination Algorithm
  • If d is a negative example
  • Remove from S any hypothesis inconsistent with
    d
  • For each hypothesis g in G that is not
    consistent with d
  • Remove g from G
  • Add to G all minimal specialisations h of g
    such that h is consistent with d, and some
    member of S is more specific than h
  • Remove from G any hypothesis that is less
    general than another hypothesis in G

53
Summary
  • Defining Learning
  • Kinds of Learning
  • Generalisation and Specialisation
  • Some Simple Learning Algorithms
  • Find-S
  • Version Spaces
  • List-then-Eliminate
  • Candidate Elimination

54
?????
Write a Comment
User Comments (0)
About PowerShow.com