Title: Advanced Artificial Intelligence Lecture 3: Learning
1Advanced Artificial IntelligenceLecture 3
Learning
- Bob McKay
- School of Computer Science and Engineering
- College of Engineering
- Seoul National University
2Outline
- Defining Learning
- Kinds of Learning
- Generalisation and Specialisation
- Some Simple Learning Algorithms
3References
- Mitchell, Tom M Machine Learning, McGraw-Hill,
1997, ISBN 0 07 115467 1
4Defining a Learning System (Mitchell)
- A program is said to learn from experience E
with respect to some class of tasks T and
performance measure P, if its performance at
tasks in T, as measured by P, improves with
experience E
5Specifying a Learning System
- Specifying the task T, the performance P and the
experience E defines the learning problem.
Specifying the learning system requires us to
define - Exactly what knowledge is to be learnt
- How this knowledge is to be represented
- How this knowledge is to be learnt
6Specifying What is to be Learnt
- Usually, the desired knowledge can be represented
as a target valuation function V I ? D - It takes in information about the problem and
gives back a desired decision - Often, it is unrealistic to expect to learn the
ideal function V - All that is required is a good enough
approximation, V I ? D
7Specifying How Knowledge is to be Represented
- The function V must be represented symbolically,
in some language L - The language may be a well-known language
- Boolean expressions
- Arithmetic functions
- .
- Or for some systems, the language may be defined
by a grammar
8Specifying How the Knowledge is to be Learnt
- If the learning system is to be implemented, we
must specify an algorithm A, which defines the
way in which the system is to search the language
L for an acceptable V - That is, we must specify a search algorithm
9Structure of a Learning System
- Four modules
- The Performance System
- The Critic
- The Generaliser (or sometimes Specialiser)
- The Experiment Generator
10Performance Module
- This is the system which actually uses the
function V as we learn it - Learning Task
- Learning to play checkers
- Performance module
- System for playing checkers
- (I.e. makes the checkers moves)
11Critic Module
- The critic module evaluates the performance of
the current V - It produces a set of data from which the system
can learn further
12Generaliser/Specialiser Module
- Takes a set of data and produces a new V for the
system to run again
13Experiment Generator
- Takes the new V
- Maybe also uses the previous history of the
system - Produces a new experiment for the performance
system to undertake
14The Importance of Bias
- Important theoretical results from learning
theory (PAC learning) tell us that learning
without some presuppositions is infeasible. - Practical experience, of both machine and human
learning, confirms this. - To learn effectively, we must limit the class of
Vs. - Two approaches are used in machine learning
- Language bias
- Search Bias
- Combined Bias
- Language and search bias are not mutually
exclusive most learning systems feature both
15Language Bias
- The language L is restricted so that it cannot
represent all possible target functions V - This is usually on the basis of some knowledge we
have about the likely form of V - It introduces risk
- Our system will fail if L does not contain an
acceptable V
16Search Bias
- The order in which the system searches L is
controlled, so that promising areas for V are
searched first
17The DownsideNo Free Lunches
- Wolpert and MacReadys No Free Lunch Theorem
states, in effect, that averaged over all
problems, all biases are equally good (or bad). - Conventional view
- The choice of a learning system cannot be
universal - It must be matched to the problem being solved
- In most systems, the bias is not explicit
- The ability to identify the language and search
biases of a particular system is an important
aspect of machine learning - Some more recent systems permit the explicit and
flexible specification of both language and
search biases
18No Free LunchDoes it Matter?
- Alternative view
- We arent interested in all problems
- We are only interested in prolems which have
solutions of less than some bounded complexity - (so that we can understand the solutions)
- The No Free Lunch Theorem may not apply in this
case
19Some Dimensions of Learning
- Induction vs Discovery
- Guided learning vs learning from raw data
- Learning How vs Learning That (vs Learning a
Better That) - Stochastic vs Deterministic Symbolic vs
Subsymbolic - Clean vs Noisy Data
- Discrete vs continuous variables
- Attribute vs Relational Learning
- The Importance of Background Knowledge
20Induction vs Discovery
- Has the target concept been previously
identified? - Pearson cloud classifications from satellite
data - vs
- Autoclass and H - R diagrams
- AM and prime numbers
- BACON and Boyle's Law
21Guided Learning vs Learning from Raw Data
- Does the learning system require carefully
selected examples and counterexamples, as in a
teacher student situation? - (allows fast learning)
- CIGOL learning sort/merge
- vs
- Garvan institute's thyroid data
22Learning How vs Learning That vs Learning a
Better That
- Classifying handwritten symbols
- Distinguishing vowel sounds (Sejnowski
Rosenberg) - Learning to fly a (simulated!) plane
- vs
- Michalski learning diagnosis of soy diseases
- vs
- Mitchell learning about chess forks
23Stochastic vs DeterministicSymbolic vs
Subsymbolic
- Classifying handwritten symbols (stochastic,
subsymbolic) - vs
- Predicting plant distributions (stochastic,
symbolic) - vs
- Cloud classification (deterministic, symbolic)
- vs
- ? (deterministic, subsymbolic)
24Clean vs Noisy Data
- Learning to diagnose errors in programs
- vs
- Greater gliders in the Coolangubra
25Discrete vs Continuous Variables
- Quinlan's chess end games
- vs
- Pearson's clouds (eg cloud heights)
26Attibute vs Relational Learning
- Predicting plant distributions
- vs
- Predicting animal distributions
- (because plants cant move, they dont care -
much - about spatial relationships)
27The importance of Background Knowledge
- Learning about faults in a satellite power supply
- general electric circuit theory
- knowledge about the particular circuit
28Generalisation and Learning
- What do we mean when we say of two propositions,
S and G, that G is a generalisation of S? - Suppose skippy is a grey kangaroo.
- We would regard Kangaroos are grey as a
generalisation of Skippy is grey. - In any world in which kangaroos are grey is
true, Skippy is grey will also be true. - In other words, if G is a generalisation of
specialisation S, then G is 'at least as true' as
S, - That is, S is true in all states of the world in
which G is, and perhaps in other states as well.
29Generalisation and Inference
- In logic, we assume that if S is true in all
worlds in which G is, then - G ? S
- That is, G is a generalisation of S exactly when
G implies S - So we can think of learning from S as a search
for a suitable G for which G ? S - In propositional learning, this is often used as
a definition - G is more general than S if and only if G ? S
30Issues
- Equating generalisation and logical implication
is only useful if the validity of an implication
can be readily computed - In the propositional calculus, validity is an
exponential problem - in the predicate calculus, validity is an
undecidable problem - so the definition is not universally useful
- (although for some parts of logic - eg learning
rules - it is perfectly adequate).
31A Common Misunderstanding
- Suppose we have two rules,
- 1) A ? ? ? G
- 2) A ? ? ? C ? G
- Clearly, we would want 1 to be a generalisation
of 2 - This is OK with our definition, because
- ((A B ? G) ? (A B C ? G))
- is valid
- But the confusing thing is that ((ABC) ? (A??))
is valid - Iif you only look at the hypotheses of the rule,
rather than the whole rule, the implication is
the wrong way around - Note that some textbooks are themselves confused
about this
32Defining Generalisaion
- We could try to define the properties that
generalisation must satisfy, - So let's write down some axioms. We need some
notation. - We will write 'S ltG G' as shorthand for 'S is
less general than G'. - Axioms
- Transitivity If A ltG B and B ltG C then also A
ltG C - Antisymmetry If A ltG B then it's not true that B
ltG A - Top there is a unique element, ?, for which it
is always true that A ltG ?. - Bottom there is a unique element, T, for which
it is always true that T ltG A.
33Picturing Generalisaion
- We can draw a 'picture' of a generalisation
hierarchy satisfying these axioms
34Specifying Generalisaion
- In a particular domain, the generalisation
hierarchy may be defined in either of two ways - By giving a general definition of what
generalisation means in that domain - Example our earlier definition in terms of
implication - By directly specifying the specialisation and
generalisation operators that may be used to
climb up and down the links in the generalisation
hierarchy
35Learning and Generalisaion
- How does learning relate to generalisation?
- We can view most learning as an attempt to find
an appropriate generalisation that generalises
the examples. - In noise free domains, we usually want the
generalisation to cover all the examples. - Once we introduce noise, we want the
generalisation to cover 'enough' examples, and
the interesting bit is in defining what 'enough'
is. - In our picture of a generalisation hierarchy,
most learning algorithms can be viewed as methods
for searching the hierarchy. - The examples can be pictured as locations low
down in the hierarchy, and the learning algorithm
attempts to find a location that is above all (or
'enough') of them in the hierarchy, but usually,
no higher 'than it needs to be'
36Searching the Generalisaion Hierarchy
- The commonest approaches are
- generalising search
- the search is upward from the original examples,
towards the more general hypotheses - specialising search
- the search is downward from the most general
hypothesis, towards the more special examples - Some algorithms use different approaches.
Mitchell's version space approach, for example,
tries to 'home in' on the right generalisation
from both directions at once.
37Completeness and Generalisaion
- Many approaches to axiomatising generalisation
add an extra axiom - Completeness For any set S of members of the
generalisation hierarchy, there is a unique
'least general generalisation' L, which satisfies
two properties - 1) for every S in S, S ltG L
- 2) if any other L' satisfies 1), then L ltG L'
- If this definition is hard to understand, compare
it with the definition of 'Least Upper Bound' in
set theory, or of 'Least Common Multiple' in
arithmetic -
38Restricting Generalisation
- Let's go back to our original definition of
generalisation - G generalises S iff G ? S
- In the general predicate calculus case, this
relation is uncomputable, so it's not very useful - One approach to avoiding the problem is to limit
the implications allowed
39Generalisation and Substitution
- Very commonly, the generalisations we want to
make involve turning a constant into a variable. - So we see a particular black crow, fred, so we
notice - crow(fred) ? black(fred)
- and we may wish to generalise this to
- ?X(crow(X) ? black(X))
- Notice that the original proposition can be
recovered from the generalisation by substituting
'fred' for the variable 'X' - The original is a substitution instance of the
generalisation - So we could define a new, restricted
generalisation - G subsumes S if S is a substitution instance of G
- An example of our earlier definition, because a
substitution instance is always implied by the
original proposition.
40Learning Algorithms
- For the rest of this lecture, we will work with a
specific learning dataset (due to Mitchell) - Item Sky AirT Hum Wnd Wtr Fcst Enjy
- 1 Sun Wrm Nml Str Wrm Sam Yes
- 2 Sun Wrm High Str Wrm Sam Yes
- 3 Rain Cold High Str Wrm Chng No
- 4 Sun Wrm High Str Cool Chng Yes
- First, we look at a really simple algorithm,
Maximally Specific Learning
41Maximally Specific Learning
- The learning language consists of sets of tuples,
representing the values of these attributes - A ? represents that any value is acceptable for
this attribute - A particular value represents that only that
value is acceptable for this attribute - A f represents that no value is acceptable for
this attribute - Thus (?, Cold, High, ?, ?, ?) represents the
hypothesis that water sport is enjoyed only on
cold, moist days. - Note that our language is already heavily biased
only conjunctive hypotheses (hypotheses built
with ) are allowed.
42Find-S
- Find-S is a simple algorithm its initial
hypothesis is that water sport is never enjoyed - It expands the hypothesis as positive data items
are noted
43Running Find-S
- Initial Hypothesis
- The most specific hypothesis (water sports are
never enjoyed) - h ? (f,f,f,f,f,f)
- After First Data Item
- Water sport is enjoyed only under the conditions
of the first item - h ? (Sun,Wrm,Nml,Str,Wrm,Sam)
- After Second Data Item
- Water sport is enjoyed only under the common
conditions of the first two items - h ? (Sun,Wrm,?,Str,Wrm,Sam)
44Running Find-S
- After Third Data Item
- Since this item is negative, it has no effect on
the learning hypothesis - h ? (Sun,Wrm,?,Str,Wrm,Sam)
- After Final Data Item
- Further generalises the conditions encountered
- h ? (Sun,Wrm,?,Str,?,?)
45Discussion
- We have found the most specific hypothesis
corresponding to the dataset and the restricted
(conjunctive) language - It is not clear it is the best hypothesis
- If the best hypothesis is not conjunctive (eg if
we enjoy swimming if its warm or sunny), it will
not be found - Find-S will not handle noise and inconsistencies
well. - In other languages (not using pure conjunction)
there may be more than one maximally specific
hypothesis Find-S will not work well here
46Version Spaces
- One possible improvement on Find-S is to search
many possible solutions in parallel - Consistency
- A hypothesis h is consistent with a dataset D of
training examples iff h gives the same answer on
every element of the dataset as the dataset does - Version Space
- The version space with respect to the language L
and the dataset D is the set of hypotheses h in
the language L which are consistent with D
47List-then-Eliminate
- Obvious algorithm
- The list-then-eliminate algorithm aims to find
the version space in L for the given dataset D - It can thus return all hypotheses which could
explain D - It works by beginning with L as its set of
hypotheses H - As each item d of the dataset D is examined in
turn, any hypotheses in H which are inconsistent
with d are eliminated - The language L is usually large, and often
infinite, so this algorithm is computationally
infeasible as it stands
48Version Space Representation
- One of the problems with the previous algorithm
is the representation of the search space - We need to represent version spaces efficiently
- General Boundary
- The general boundary G with respect to language L
and dataset D is the set of hypotheses h in L
which are consistent with D, and for which there
is no more general hypothesis in L which is
consistent with D - Specific Boundary
- The specific boundary S with respect to language
L and dataset D is the set of hypotheses h in L
which are consistent with D, and for which there
is no more specific hypothesis in L which is
consistent with D
49Version Space Representation 2
- A version space may be represented by its general
and specific boundary - That is, given the general and specific
boundaries, the whole version space may be
recovered - The Candidate Elimination Algorithm traces the
general and specific boundaries of the version
space as more examples and counter-examples of
the concept are seen - Positive examples are used to generalise the
specific boundary - Negative examples permit the general boundary to
be specialised.
50Candidate Elimination Algorithm
- Set G to the set of most general hypotheses in L
- Set S to the set of most specific hypotheses in L
- For each example d in D
51Candidate Elimination Algorithm
- If d is a positive example
- Remove from G any hypothesis inconsistent with
d - For each hypothesis s in S that is not
consistent with d - Remove s from S
- Add to S all minimal generalisations h of s
such that h is consistent with d, and some
member of G is more general than h - Remove from S any hypothesis that is more
general than another hypothesis in S
52Candidate Elimination Algorithm
- If d is a negative example
- Remove from S any hypothesis inconsistent with
d - For each hypothesis g in G that is not
consistent with d - Remove g from G
- Add to G all minimal specialisations h of g
such that h is consistent with d, and some
member of S is more specific than h - Remove from G any hypothesis that is less
general than another hypothesis in G
53Summary
- Defining Learning
- Kinds of Learning
- Generalisation and Specialisation
- Some Simple Learning Algorithms
- Find-S
- Version Spaces
- List-then-Eliminate
- Candidate Elimination
54?????