12 Clustering

About This Presentation

Title:

12 Clustering

Description:

Typical example: construct taxonomy of, e.g., animals. Example of 'hierarchical clustering' ... e.g., constructing local models (cf. instance based methods) ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 164

Provided by: hendrikb

Category:

more less

Transcript and Presenter's Notes

Title: 12 Clustering

1
12 Clustering

What is clustering?
Flat clustering
K means clustering, EM algorithm
Hierarchical clustering
Distance-based approaches
Conceptual clustering
The Cobweb algorithm
Using clustering for prediction
?

2
What is clustering?

Find groups (clusters) of instances so that
Instances in the same group are similar
Instances in different groups are different
Compared to classification
Similarity assign classes to examples
Difference classes not known in advance
Hence also called "unsupervised learning"
Classes (even taxonomies) are "invented"

-
Classification problem
.
.
Clustering problem
-

-
.
.
.
.

-
.
.
.
-

-

.
.
.
3
Examples

Typical example construct taxonomy of, e.g.,
animals
Example of "hierarchical clustering"
In machine learning / data mining
Applicable, for instance, in marketing
identify typical customers, e.g., car drivers
Produce products (e.g. cars) that aim at one
specific group
Auxiliary method for other techniques
e.g., constructing local models (cf. instance
based methods), ...

4
Similarity Measures

How to measure similarity between instances?
Similar problem as for instance based methods
Possible options
Distance metric
Euclidean distance
Other...
More general forms of similarity
Do not necessarily satisfy triangle
inequality,symmetry, ...

5
Flat vs. Hierarchical Clustering

Flat clustering
Given data set, return partition
Hierarchical clustering
Combine clusters into larger clusters, etc. until
1 cluster full data set
Gives rise to cluster hierarchy or taxonomy

A
B
C
D
E
F
6
Extensional vs. Conceptual Clustering

Extensional clustering
Clusters defined as sets of examples
lt statistics
Conceptual clustering
Clusters described in some language
Typical criteria for good conceptual clustering
High intra-cluster similarity
Simple conceptual description of clusters

7
Flat Extensional Clustering

Flat clustering
Given set of unlabeled data
Find clusters of similar instances
"similar" close to each other in some space
Number of clusters may be given
Other quality criteria for clusters may be given
Examples of algorithms
LEADER simple, fast, but not very good
K means, EM

8
Leader

Input data, some threshold distance D
Clusters are represented by prototypes
Algorithm
Start with no clusters
For each example e
Find the first cluster prototype p for which
dist(p, e) lt D
If found, add e to cluster of p, otherwise make e
a new prototype
Very fast, but low quality clusters
e.g., results depend on order of examples

9
K means clustering

Input data, number of clusters K
Algorithm
Start with K random seeds for clusters
Repeat until no changes
For all instances e
Add e to cluster of closest seed
Compute centres of all clusters
E.g., for numeric data compute average
These centres are the seeds for the next iteration

10
The EM algorithm

Expectation maximisation
Similar to K means, but
Now we assume e.g. normal distributions
Examples not assigned to 1 cluster, but partially
to different clusters (proportionally to
distribution)
EM actually much more general than just
clustering
Find number of distributions generating data
Building mixture models
See Mitchell for details

11
HierarchicalExtensional Clustering

Top-down (divisive) methods
Start with 1 cluster (whole data set)
Divide it into subsets
Subdivide subsets further, etc.
Bottom-up (agglomerative) methods
Start with singleton clusters
Join closest clusters together
Repeat until 1 cluster

12
Example Agglomerative Methods

A number of well-known agglomerative methods
exist
Variants according to definition of closest
clusters
Distance criterion for examples assumed known
How does it generalise to distance between
clusters?
Options
Single linkage distance between clusters
distance between closest points
Complete linkage cluster distance distance
between furthest points
Average distance, ...

13
Conceptual clustering

Previous methods just form groups of examples
Conceptual return description of groups in some
language L
Quality of clusters depends on properties of
cluster description as well
Typically, simple cluster description in L
preferred
Language defines context in which quality of
clustering is evaluated
Whether 2 elements are in same cluster, may not
only depend on themselves, but also on other data

14
Example

How would you cluster these points?

15
TDIDT for Clustering

Decision tree conceptual hierarchical
clustering
Each node 1 cluster
Tests from root to node conjunctive description
of cluster
Hence, language L of cluster descriptions
conjunctions of attribute tests

16
Cobweb

Well-known clustering algorithm (Fisher, 1987)
Finds conceptual hierarchical clustering
Probabilistic description probability
distribution of attribute values for each cluster
Heuristic
Maximize predictiveness and predictability of
attributes
Predictability given cluster, how well can you
predict attributes?
Predictiveness given attributes, how well can
you predict cluster?
Maximize both

17
Cobweb Algorithm

Incremental algorithm
For each example
For each level (top to bottom) of current
taxonomy
Change current level using one of several
operators
Add example to cluster
Create new cluster (with 1 example)
Merge clusters
Split clusters
Move down to relevant subcluster
Evaluation of clustering try to maximize a
combination of predictiveness P(CkAivij) and
predictability P(AivijCk) (Ckcluster,
Aiattribute, vij value)

18
Using clustering for prediction

Clustering can be used for prediction of any
property
Once clusters found, prediction is made in a 2
step process
Given known attributes of instance, predict
cluster (OK if high predictiveness)
Given cluster, predict unknown attributes (OK if
high predictability)
"flexible prediction" not known in advance
what will be given and what will need to be
predicted

If something is known about what will be
predicted, clustering process can be tuned
Maximize predictiveness of attributes that will
be given
Maximize predictability of attributes that will
need to be predicted
Many learning approaches can be described in this
"predictive clustering" framework
Try, e.g., decision trees, instance-based learning

20
To Remember

Important concepts
Similarity measures, distances
Flat hierarchical clustering
Extensional vs. Conceptual clustering
Clustering algorithms
Leader, K-means, EM
Single/complete linkage
Cobweb
Use of clustering for prediction

21
13 Induction of Rule Sets

Representing theories with decision rules
Induction of predictive rules
Sequential covering approaches
Induction of association rules ?
The Apriori approach
? Ch. 10 (partially)

22
Representing Theories with Decision Rules

Previous representations
decision trees
numerical representations
Popular representation for concept definitions
if-then-rules
IF ltconditionsgt THEN belongs to concept
How can we learn such rules ?
Trees can be converted to rules
Using genetic algorithms
With specific rule-learning methods

23
Sequential Covering Approaches

Or separate-and-conquer approach
General principle learn rules 1 at a time
Learn 1 rule that has
High accuracy
When it predicts something, it should be correct
Any coverage
Does not have to make a prediction for all
examples, just for some of them
Mark covered examples
These have been taken care of, from now on focus
on the rest
Repeat this until all examples covered

24
Sequential Covering

General algorithm for learning rule sets
Based on CN2 algorithm (Clark Niblett)

function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples classified
correctly by Rule Rule
LearnOneRule(Target, Attrs, Examples) sort
LearnedRules according to performance return
LearnedRules
25
Learning One Rule

To learn one rule
Perform greedy search
Could be top-down or bottom-up
Top-down
Start with maximally general rule
Add literals one by one
Bottom-up
Start with maximally specific rule
Remove literals one by one

26
Learning One Rule
function LearnOneRule(Target, Attrs, Examples)
NewRule IF true THEN pos NewRuleNeg
Neg while NewRuleNeg not empty, do
add a new literal to the rule Candidates
generate candidate literals BestLit
argmaxL?Candidates performance(Specialise(NewRule,
L)) NewRule Specialise(NewRule,
BestLit) NewRuleNeg x?Neg x covered by
NewRule return NewRule function
Specialise(Rule, Lit) let Rule IF
conditions THEN pos return IF conditions
and Lit THEN pos
27
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
28
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
IF A B THEN pos
29
Some Options

Options for learning 1 rule
Top-down or Bottom-up?
Example-driven?
Hill-climbing, beam search, ... ?
Learn rules for 1 class at a time, or for
multiple classes?
E.g., first learn ruleset for pos, then one for
neg vs. learning 1 set with pos and neg rules
Learn ordered or unordered set of rules?
Ordered 1st rule that applies will be used
Allows for easy incorporation of exceptions

30
IllustrationBottom-up vs. Top-down
Bottom-up typically more specific rules
-
-

-
-

-

-
-
-

-
-
-
-
-
Top-down typically more general rules
31
Heuristics

Heuristics
When is a rule good?
High accuracy
Less important high coverage
Possible evaluation functions
Accuracy
A variant of accuracy m-estimate
Entropy more symmetry between pos and neg
Post-pruning of rules
Cf. what was done for decision trees

32
Example-driven Rule Induction

Example AQ algorithms (Michalski et al.)
for a given class C
as long as there are uncovered examples for C
pick one such example e
consider He rules that cover this example
search top-down in He to find best rule
Much more efficient search
Hypothesis spaces He much smaller than H (set of
all rules)
Less robust w.r.t. noise
what if noisy example picked?

33
Discovery of Association Rules

An example of descriptive induction as opposed to
predictive induction
Predictive learn a function that predicts for
new instances the value for a certain attribute
(e.g., class)
Descriptive learn patterns in the data
e.g., find groups of similar instances
clustering
e.g., find associations between attributes

34
Predictive Induction vs. Descriptive Induction

a
a

-
a
-
a

-
a
a
c

-
-
b
-
b
c
-
b
b
c
c
-
b
c
Find associations between any properties - e.g.
clusters (assoc. X-Y) - e.g. a in top left (assoc
XY-a)
Find association between 1 specific property
(/-) and any other properties
35

Difference not clear-cut
Many views on relationship between predictive and
descriptive induction
For instance discriminatory induction
Predictive induction learn to discriminate /-
Can be done by performing descriptive induction
on separate classes
Descr find patterns that generally hold in whole
set
Pred find patterns that hold for and not for -

36
Association rules

Association rules
similar to decision rules IF ... THEN ...
describe relationships between sets of boolean
attributes
e.g., market basket analysis learn which
products often bought together

IF bread butter THEN cheese confidence 50
support 5
Client cheese bread butter wine jam ham 1 yes yes
yes yes no yes 2 yes no yes no no no 3 no yes yes
no no yes ... ... ... ... ... ... ...
37
Some Characteristics

Association rule IF a1, ..., an THEN an1, ...,
anm
Is characterised by a
Support how many of all clients actually buy
a1...anm ?
if too low rule not very important
Confidence how many of buyers of a1...an also
buy an1...anm?
need not be close to 100 percent
even small increase w.r.t. normal level may be
interesting

38
Searching for association rules

Often very large databases to be analysed
Efficient algorithm needed
Repeatedly running normal (adapted) rule
induction algorithms is not efficient
Moreover, typical rule algorithms give a minimal
set of rules that is sufficient to define a
concept
But we want all rules satisfying criteria, not a
minimal set
The APRIORI algorithm (Agrawal et al., 1993)
Parameters min. support, min. confidence
Works in 2 steps
Step 1 find frequent sets
Step 2 combine frequent sets into association
rules

39
A Key Observation

Observation
let freq(S) number of examples containing S
consider IF a1...an THEN an1...anm
support freq(a1, ..., anm)/freq()
confidence freq(a1...anm)/freq(a1...an)
gt all association rules with sufficient
confidence and support can be derived from list
of "frequent sets" their frequencies
S is "frequent set" iff freq(S)gtmin_supportfreq(
)

40
Finding Frequent Sets

Step 1 find frequent sets
Observation if a1...ai is not frequent, then
a1...ai1 is not frequent
-gt breadth-first, general-to-specific search
find all frequent sets of cardinality 1
find all frequent sets of cardinality 2
set a1,a2 can be frequent only if a1 and a2
both frequent
many pairs pruned before actually computing their
frequency by looking at data
others "candidates" -gt need to check
frequencies
find all frequent sets of cardinality 3
a1,a2,a3 frequent only if a1,a2, a2,a3 and
a1,a3 frequent...

Example

bread
ham
cheese
wine
butter
jam
breadbutter
breadjam
breadcheese
cheesejam
buttercheese
butterjam
breadbutterjam
breadbuttercheese
Frequent Infrequent
not a candidate
42

Algorithm finding frequent sets

min_freq min_support freq(?) d 0 Q0
? / Qi candidates for level i / F ?
/ F frequent sets / while Qd ? ? do
for all S in Qd find freq(S) / data access
/ delete those S in Qd with freq(S)ltmin_freq F
F ? Qd compute Qd1 d d1 return F
43

Offline computation of new candidates
"offline" without having to look at examples

compute Qd1 from Qd and F Qd1 ?
for each S in Qd for each item x not in S
S' S ? x if each subset of S' obtained
by removing 1 element of S' is in F then add
S' to Qd1
44

Step 2 infer association rules from frequent
sets
if S ? a in F and freq(S ? a) / freq(S) gt
min_confidence
then return rule "IF S THEN a"

40
50
45
breadcheese
breadbutter
buttercheese
20
breadbuttercheese
IF bread and butter THEN cheese confidence
50 (20/40) support 5
45
Post-Processing of Rules

Often many association rules found
How to process them, as a human?
sort according to some criterion
e.g. support or confidence
sometimes used statistical significance of
deviation
p1 P(cheese) 0.6
p2 P(cheese breadbutter) 0.7
rule more interesting if deviation of p2 from p1
is statistically more significant
combines both support and confidence
Other post-processing methods exist
Might even use a query language to query rule
base
Find all rules for which conditions ... Hold
Sort rules according to ...

46
To Remember

Sequential covering approaches
Different variants of basic algorithm
Association rules
What they represent, difference with predictive
rules
Apriori algorithm in detail

47
14 Inductive Logic Programming

Introduction induction in logic
A note on representations
practical motivation for ILP
ILP and how it relates to other methods
Some fundamental notions
Learning methods in ILP
Induction of Horn clauses
First order association rules
Other methods
?

48
A Logical Perspectiveon Induction

Up till now, usually
A set of data was given
Some general model was to be induced
We can generalise this
Given certain knowledge
Can in principle be any type of statement
Induce more general, plausible knowledge from it
Need general method for representing and
reasoning about knowledge

Logic is a suitable language
Often used in practice for knowledge
representation and reasoning
reasoning in logic is typically deductive
reasoning
Need to study inductive reasoning in logic

50
Deduction vs. induction

Deduction reasoning from general to specific
Is "always correct", truth-preserving
Induction reasoning from specific to general
inverse of deduction
Not truth-preserving
But there may be statistical evidence

Deduction All men are mortal Socrates is a
man Socrates is mortal
Induction Socrates is mortal Socrates is a man
All men are mortal
51
A Note on Representations

So logic would allow us to obtain more general
kind of inductive reasoning...
But do we actually need this in practice?
Yes not all problems are equally easy to
describe in previous settings
Lets have a look at a number of different
representational settings

52
The Attribute-Value Framework

Up till now
all data in one single table
each example described by one vector of fixed
attributes (one row in table)
induced hypothesis contains conditions that
compare attribute with specific values
standard setting, or attribute-value
setting
too limited in some cases!

53
More Complicated the Multi-Instance Setting

Example (Dietterich, 1996)
set of molecules, musk / non-musk
each molecule can have many different
conformations
if at least one of these conformations has a
certain property, the molecule is a musk
in other words relate property of a set to
properties of its members
Not easy to handle in standard format

54
Multi-Instance Illustration
Have to relate class of one row to values in
some other row
55
Even More Complicated Settings

What if examples are described by sets, graphs,
strings, ?
Cf. Molecular structures
Standard setting often not usable
Many specific algorithms devised
Alternative
use a sufficiently general representation
mechanism
motivates study of induction in FOL

56
Example Learning to Pass
4
7
5
57
Learning to pass (1)
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
58
Learning to Pass Pattern 1
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass
59
Learning to Pass Pattern 2
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass x passes to
y, z is close to y, z is red gt bad pass
60
Learning to pass representation

Could we represent soccer situations in one
table?
E.g., attribute 1 is close to 3, or somebody
is close to 5,
Yes, but
Many different attributes necessary
Attributes refer to specific player numbers
Hypotheses not expressible at right level of
abstraction
No variables possible in hypotheses

61
Representations Conclusions

Having a good representation is important!
Standard setting
limited expressive power
high efficiency
sufficient in many cases!
Other settings only use when needed
when transformation into standard setting is not
feasible

62
Relating ILP to Other Learning Approaches

Learning methods we earlier discussed can learn
hypotheses that can be expressed in propositional
logic
conditions are chosen from a fixed set of
propositions, e.g., "Sky sunny", ...
ILP learners use framework of first order
predicate logic
predicates, variables, quantifiers
conditions are literals composed of predicate
symbol, variables, constants
more expressive

63
What Exactly Is ILP?

ILP inductive logic programming
logic programming programs are sets of first
order rules ("Horn clauses")
inductive logic programming learn such programs
from examples
more generally learn logic formulae from
examples
-gt study induction in the framework of first
order predicate logic
Up till now only propositional methods
hypotheses can be expressed in propositional logic

64
Logic Programming

Logic program definition of predicates from
existing predicates
practical language Prolog

Definition "a course is an advanced course iff
it is difficult or has an advanced course as a
prerequisite" ?x (Advanced(x) ? Difficult(x) ?
?y Prerequisite(y,x) ? Advanced(y)) Representatio
n as Horn clauses Advanced(x) ?
Difficult(x) Advanced(x) ? Prerequisite(y,x) ?
Advanced(y)) Representation as Prolog
program advanced(X) - difficult(X). advanced(X)
- prerequisite(Y,X), advanced(Y).
65

First order logic formulae can be general
assertions (not necessarily definitions)
can represent any kind of knowledge

Assertion "all people are male or female" ?x
Human(x) ? Male(x) ? Female(x) Using Prolog-like
notation male(X) female(X) - human(X)
66
Some terminology

Terms refer to objects in the world
variables (X, Y, ...) or constants (a, b, 5, ...)
(...)
Predicates properties of / relationships
between objects
predicate human/1, male/1, father/2, ...
atom predicate symbol n arguments (terms)
e.g. human(X), father(luc, soetkin)
evaluate to true or false
literal possibly negated atom
human(luc), not father(soetkin,luc), ...

Clause disjunction of literals in which all
variables universally quantified
e.g., ?x,y father(x,y) ? ?male(x) ? ?parent(x,y)
equivalent to ?x,y father(x,y) ? male(x)
?parent(x,y)
Prolog notation father(X,Y) - male(X),
parent(X,Y)
Horn clause contains max. 1 positive literal
Variable substitution
changing variable into other variable or constant
e.g., ? X/a, Y/b
application of substitution ? to clause c c?
e.g., father(X,Y)? father(a,b)

Ground formula contains no variables
Fact clause consisting of 1 atom
e.g., father(luc,soetkin)
Ground facts are very useful to represent data

69
Inductive logic programming

Induction of first order logic formulae from data

Prolog dataset
Knowledge discovered
male(luc). male(maarten). female(lieve). female(so
etkin). father(luc,soetkin). father(luc,maarten).
mother(lieve,soetkin). mother(lieve,maarten). pare
nt(luc,soetkin). ...
false - male(X), female(X). female(X) -
mother(X,Y). male(X) - father(X,Y). parent(X,Y)
- father(X,Y). parent(X,Y) - mother(X,Y). ...
70

Different settings possible
learn definition for one predicate
predictive induction
e.g., learn definition of "father"
very similar to learning rule sets
learn general patterns
descriptive induction
e.g., learn relationships between male, female,
...
false - male(X),female(X).
similar to discovery of association rules

Representation of data
propositional 1 example is described by listing
true propositions (usually in table format)
size of description is constant
ILP 1 example is essentially described by a set
of related facts
size of this set may vary

72
"Enjoy Sport" example

Description of one example 1 row in table

Day Sky Airtemp Humidity Wind Water
Forecast EnjoySport 1 sunny warm
normal strong cool change yes
Possible propositional representation
Possible first order representation
day(1). sky(1, sunny). airtemp(1,
warm). humidity(1, normal). wind(1,
strong). water(1, cool). forecast(1,
change). enjoy_sport(1, yes).
"Skysunny" true "Skyrainy"
false "Airtempwarm" true ...
73
"Bongard" example

Classify examples based on internal structure

neg
pos
74
Bongard Propositional Representation

How to represent drawing in 1 table?
attributes for each component of drawing
attributes for relationships between components
Several problems with this representation
1. assumption of fixed number of objects
if fewer objects leave attributes blank
2. very large number of attributes
many of which possibly blank or irrelevant
3. meaning of attributes not well-defined
multiple representations for same thing possible

Issue 1 and 2
e.g., max. 5 objects in drawing
attributes Object1, Points1, Object2, Points2,
..., Object5, Points5, Inside12, Inside13,
Inside14, Inside15, Inside21, Inside23, ...,
Inside54
attributes easily superlinear in objects (here
quadratic)
the more objects allowed, the more blank
attributes

Issue 3 consider this example
Possible representations ("Inside" left out)

Fig. Obj1 Points1 Obj2 Points2 Obj3 Points3 Class
1 Circle - Triangle Down - - pos 1 Triangle Down
Circle - - - pos 1 - - Triangle Down Circle - pos
... ... ... ... ... ... ... ...
or
or
or
How to represent concept "contains triangle
pointing down?
IF Object2 triangle AND Points2 down THEN pos
Does not work with each valid representation!
77

Attribute-value table and corresponding rules
provide incorrect level of abstraction
Better representation
1 example multiple rows (possibly in multiple
tables)
Learning algorithms need to be adapted!

78
Bongard First Order Logic Representation

First order logic representation

contains(1, o1). contains(1, o2). triangle(o1). po
ints(o1, down). circle(o2). pos(1).
Drawing 1
any number of objects allowed
pos(X) - contains(X,Y), triangle(Y), points(Y,
down).
use of variables provides right abstraction level
for hypothesis
79

Equivalent representation as relational database
(cf. data mining)
1 example set of tuples instead of 1 tuple
ILP mining in multiple tuples / multiple
relations
important issue in current data mining research

Contains
Objects
Inside
information about example 1
80
Background knowledge

Additional advantage of first order logic
background knowledge about domain can be
expressed concisely

triangle(o1). polygon(o1). square(o2). polygon(o2)
. circle(o3). square(o4). polygon(o4). square(o5).
polygon(o5). ...
Background knowledge
polygon(X) - triangle(X). polygon(X) -
square(X).
triangle(o1). square(o2). circle(o3). square(o4).
square(o5). ...
Data about examples
81
A Real World Example

Find "pharmacophore" in molecules
identify substructure that causes it to "dock"
on certain other molecules
Molecules described by listing for each atom in
it element, 3-D coordinates, ...
Background defines computation of euclidean
distance, ...

82
Background knowledge
Description of molecules
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.4304,-3.1160,0.0489). atm(m1,a2,c,
2,6.0334,-1.7760,0.6795). atm(m1,a3,o,2,7.0265,-2.
0425,0.0232). ... bond(m1,a2,a3,2). bond(m1,a5,a6,
1). bond(m1,a2,a4,1). bond(m1,a6,a7,du). ...
Hypothesis
active(A) - zincsite(A,B), hacc(A,C), hacc(A,D),
hacc(A,E), dist(A,C,B,4.891,0.750),
dist(A,C,D,3.753,0.750), dist(A,
C,E,3.114,0.750), dist(A,D,B,8.475,0.750),
dist(A,D,E, 2.133,0.750),
dist(A,E,B,7.899,0.750).
83

Some example molecules

84
ILP Related to "Explanation Based Learning"

EBL explanation based learning
One of first approaches to use first order logic
Special case of "analytical learning"
Idea
Specify full domain theory explaining examples
Only part of theory actually relevant for
examples
Find an explanation for an example find that
part of the full theory that is relevant for this
example

Also called "speed-up" learning
Once more specific explanation are available,
these can be used to make predictions much more
efficiently than with full theory
Relation to ILP
Language bias in ILP is very similar to domain
theory in EBL
ILP language bias can be seen as specifying a
domain theory that is not necessarily correct
That's what makes it inductive
EBL could be simulated using ILP

86
Some general theory EBL correct, not
necessarily relevant explanations ILP possible,
not necessarily correct explanations
deduction
Hypothesis
induction
Observed data
87
Conclusions

Advantages of using first order logic
More complex data can be represented
Existing background knowledge can be represented
More powerful representation language for
hypotheses
Hence, useful for certain kinds of learning...
"structural" learning examples have complex
structure
"relational" learning relations between objects
are important
Also related to data mining in relational
database
Inductive logic programming provides these

88
Fundamentals of inductive logic programming

Notion of generality (cf. versionspaces)
How to specialise conditions?
How to generalise conditions?
Main concepts
operators for specialisation / generalisation
?-subsumption
inverse resolution
least general generalisation

89
Notion of generality

Remember versionspaces
notion of generality was very important
lt?,?gt ?g ltwarm,?gt ?g ltwarm,sunnygt
In first order logic same question...
when is a concept definition more general than
another one?
answer will allow to implement, e.g.,
general-to-specific search
... but more difficult to answer

90
One Option...

A theory G is more general than a theory S if and
only if G S
G S in every interpretation (set of facts)
for which G is true, S is also true
"G logically implies S"
e.g., "all fruit tastes good" "all apples
taste good" (assuming apples are fruit)
Note talking about general theories here, not
just concepts (lt-gt versionspaces)
generality of concepts is special case of this

Induction inverse of deduction
Deductive operators "-" exist that implement (or
approximate)
E.g., resolution (from logic programming)
So, inverting these operators should yield
inductive operators
basic technique in many inductive logic
programming systems

92
Various frameworks for generality

Depending on form of G and S
1 clause / set of clauses / any first order
theory
Depending on choice of - to invert
theta-subsumption
resolution
implication
Some frameworks easier than others

93
Inverting resolution

Resolution works very well for deductive systems
(e.g., Prolog)
Simple cases of resolution

Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
94
The Resolution Rule

General resolution rule

2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
95
Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
96
Inverting Resolution

Inverse resolution is much more difficult than
resolution itself
different operators needed (see further)
no unique results
making 2 things equal can be done in only one
way, but making things different can be done in
many ways!
hence, very large search space
Turned out to be impractical, unless a human
guides the generalisation process
Interactive learners

97
Inverse resolution operators

Some operators related to inverse resolution (A,
B are conjunctions of literals)
absorption
from q-A and p - A,B
infer p - q,B
identification
from p - q,B and p - A,B
infer q - A

q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
98

Intra-construction
from p - A,B and p - A,C
infer q - B and p - A,q and q - C
Inter-construction
from p - A,B and q - A,C
infer p - r,B and r - A and q - r,C

q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
99
Predicate invention

With intra- and inter-construction, new
predicates are invented
E.g., apply intra-construction on
grandparent(X,Y) - father(X,Z), father(Z,Y)
grandparent(X,Y) - father(X,Z), mother(Z,Y)
What predicate is invented?

100
Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
101
We need something simpler

Inverse resolution
allows to generalise sets of clauses
but most of the time is too complex
Move towards generalisation of single clauses
popular operators will be based on
theta-subsumption

102
Theta-subsumption

1 clause c1 theta-subsumes another clause c2 if
there exists a variable substitution such that it
becomes a subset of the other one
c1 ?? c2 ? ?? c1? ? c2
to check this, first write clauses as
disjunctions
a,b,c ? d,e,f ltgt a ? b ? c ? ?d ? ?e ?
?f
then try to replace variables by constants or
other variables
Intuition c2 is a special case of c1

103
Theta-subsumption examples

c1 father(X,Y) - parent(X,Y)
c2 father(X,Y) - parent(X,Y), male(X)
for ? c1? ? c2 gt c1 ?-subsumes c2
c3 father(luc,Y) - parent(luc,Y)
for ? X/luc c1? c3 gt c1 ?-subsumes c3
c2 and c3 do not ?-subsume one another

104

Try yourself
c1 p(X,Y) - q(X,Y)
c2 p(X,Y) - q(X,Y), q(Y,X)
c3 p(Z,Z) - q(Z,Z)
c4 p(a,a) - q(a,a)
Which clauses ?-subsumed by which?

105
Searching for clauses

Structure hypothesis space (space of clauses)
according to ?-subsumption
Operators for moving in this space
minimal specialisation operator
from a clause c, derive d such that c ?? d and
??e c ?? e ?? d
minimal generalisation operator
usually starts from 2 clauses
from c and d finds e such that e ?? c, e ?? d and
?? f e ?? f and f ?? c and f ?? d

106

Note similarity with propositional refinement
IF Sky sunny THEN EnjoySportsyes
To specialise add 1 condition
IF Skysunny AND Humiditylow THEN
EnjoySportsyes
In first order logic
c1 father(X,Y) - parent(X,Y)
To specialize find clauses ?-subsumed by c1
father(X,Y) - parent(X,Y), male(X)
father(luc,X) - parent(luc,X)
father(X,X) - parent(X,X)
...

107

Properties of ?-subsumption
Sound
if c1 ?-subsumes c2 then c1 c2
Incomplete
possibly c1 c2 without c1 ?-subsuming c2 (but
only for recursive clauses)
c1 p(f(X)) - p(X)
c2 p(f(f(X))) - p(X)
Checking whether c1 ?-subsumes c2 is decidable
but NP-complete
Transitive and reflexive, not anti-symmetric
"semi-order" relation

108

Semi-order generates equivalence classes
partial order on those equivalence classes
equivalence class c1c2 iff c1 ?? c2 and c2 ??
c1
c1 and c2 are then called syntactic variants
c1 is reduced clause of c2 iff c1 contains
minimal subset of literals of c2 that is still
equivalent with c2
if c1 and c2 in different equivalence classes,
either c1 ?? c2 or c2 ?? c1 or neither gt
antisymmetry gt partial order

109
p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y),m(X,Z),m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
110

Since equivalence classes are partially ordered,
they form a lattice
least upper bound / greatest lower bound of two
clauses always exists
Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
h(X,Y) - p(X,Y)
h(X,Y) - p(X,Y), p(Y,Y2)
h(X,Y) - p(X,Y), p(Y,Y2), p(Y2,Y3)
...
h(X,X) - p(X,X)

111
Specialisation operators

How to traverse hypothesis space so that
no hypotheses are generated more than once
no hypotheses are skipped
Shapiro general-to-specific traversal using
refinement operator ?
?(c) yields set of refinements of c
?(c) ? c' c' is a maximally general
specialisation of c
?(c) ? c ? l l is a literal ? c? ? is a
substitution

112
daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(X,Y)
113
A generalisation operator

Start from 2 clauses and compute least general
generalisation (lgg)
i.e., given 2 clauses, return most specific
single clause that is more general than both of
them
Definition of lgg of terms
(let si, tj denote any term, V a variable)
lgg(f(s1,...,sn), f(t1,...,tn))
f(lgg(s1,t1),...,lgg(sn,tn))
lgg(f(s1,...,sn),g(t1,...,tn)) V

114

lgg of literals
lgg(p(s1,...,sn),p(t1,...,tn))p(lgg(s1,t1),...,lg
g(sn,tn))
lgg(?p(...), ?p(...)) ?lgg(p(...),p(...))
lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
undefined
lgg of clauses
lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
lgg(l1,l2) defined

115
Applying lgg

Example
f(t,a) - p(t,a), m(t), f(a)
f(j,p) - p(j,p), m(j), m(p)
lgg f(X,Y) - p(X,Y), m(X), m(Z)
Relative lgg (rlgg) (Plotkin 1971)
relative to "background theory" B (assume B is a
set of facts)
rlgg(e1,e2) lgg(e1 - B, e2 - B)

116

Example Bongard problems

pos(1). pos(2). contains(1,o1
). contains(2,o3). contains(1,o2). triangl
e(o1). triangle(o3). points(o1,down).
points(o3,down). circle(o2).
1
2
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
117
Conclusions

We now have basic operators
?-subsumption-based at single clause level
Specialization operator ?
Generalization operator lgg
Inverse resolution generalize a theory (set of
clauses)
These can be used to build ILP systems
Top-down using specialization operators
Bottom-up using generalization operators

118
Rule induction

Most inductive logic programming systems induce
concept definition in form of rule set
Algorithms similar to propositional algorithms
FOIL -gt CN2
Progol -gt AQ

119
FOIL (Quinlan)

Learns single concept, e.g., p(X,Y) - ...
To learn one clause (hill-climbing search)
start with general clause p(X,Y) - true
repeat
add best literal to clause (i.e., literal that
most improves quality of clause)
new literal can also be unification Xc or XY
applying refinement operator under
?-subsumption
until no further improvement

120
FOIL ExampleLearning One Clause
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
121
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
122
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
123
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
124
FOIL Learning Multiple Clauses

To learn multiple clauses
repeat
learn a single clause c (see previous algorithm)
add c to h
mark positive examples covered by c as covered
until
all positive examples marked covered
or no more good clauses found

125
likes(garfield, lasagna). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie). edible(lasagna). e
dible(birds). subject_to_cruelty(odie). subject_
to_cruelty(jon). subject_to_cruelty(birds).
likes(garfield, X) - edible(X).
3,0-
126
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
127
Some pitfalls

Avoiding infinite recursion
when recursive clauses allowed, e.g.,
ancestor(X,Y) - parent(X,Z), ancestor(Z,Y)
avoid learning parent(X,Y) - parent(X,Y)
won't be useful, even though it's 100 correct
Bonus for introduction of new variables
literal may not yield any direct gain, but may
introduce variables that may be useful later

p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
128
Golem (Muggleton Feng)

Based on rlgg-operator
To build one clause
Look at 2 positive examples, find rlgg,
generalize using yet another example, until no
improvement in quality of clause
bottom-up search
Result very dependent on choice of examples
e.g. what if true theory is p(X) - q(X) , p(X)
- r(X) ?

129

Try this for different couples, pick best clause
found
this reduces dependency on choice of couple (if 1
of them noisy no good clause found)
Remove covered positive examples, restart process
Repeat until no more good clauses found

130
Progol (Muggleton)

Top-down approach, but with seed
To find one clause
Start with 1 positive example e
Generate hypothesis space He that contains only
hypotheses that cover at least this one example
first generate most specific clause c that covers
e
He contains every clause more general than c
Perform exhaustive top-down search in He, looking
for clause that maximizes compaction
Compaction size(covered examples) - size(clause)

131

Repeat process of finding one clause until no
more good ( causing compaction) clauses found
Compaction heuristic in principle allows no
coverage of negatives
can be relaxed (accommodating noise)

132
Generation of bottom clause

Language bias set of all acceptable clauses
(chosen by user)
specification of H (on level of single clauses)
Bottom clause ? for example e most specific
clause in language bias covering e
Constructed using inverse entailment

133

Construction of ?
if B?H e, then B ? ?e ?H
if H is clause, ?H is conjunction of ground
(skolemized) literals
compute ?? all ground literals entailed by B ?
?e
?H must be subset of these
so B ? ?e ?? ?H
hence H ?

134

Some examples (Muggleton, 1995, New Generation
Computing)

?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
135