Title: 12 Clustering
112 Clustering
- What is clustering?
- Flat clustering
- K means clustering, EM algorithm
- Hierarchical clustering
- Distance-based approaches
- Conceptual clustering
- The Cobweb algorithm
- Using clustering for prediction
- ?
2What is clustering?
- Find groups (clusters) of instances so that
- Instances in the same group are similar
- Instances in different groups are different
- Compared to classification
- Similarity assign classes to examples
- Difference classes not known in advance
- Hence also called "unsupervised learning"
- Classes (even taxonomies) are "invented"
-
Classification problem
.
.
Clustering problem
-
-
.
.
.
.
-
.
.
.
-
-
.
.
.
3Examples
- Typical example construct taxonomy of, e.g.,
animals - Example of "hierarchical clustering"
- In machine learning / data mining
- Applicable, for instance, in marketing
- identify typical customers, e.g., car drivers
- Produce products (e.g. cars) that aim at one
specific group - Auxiliary method for other techniques
- e.g., constructing local models (cf. instance
based methods), ...
4Similarity Measures
- How to measure similarity between instances?
- Similar problem as for instance based methods
- Possible options
- Distance metric
- Euclidean distance
- Other...
- More general forms of similarity
- Do not necessarily satisfy triangle
inequality,symmetry, ...
5Flat vs. Hierarchical Clustering
- Flat clustering
- Given data set, return partition
- Hierarchical clustering
- Combine clusters into larger clusters, etc. until
1 cluster full data set - Gives rise to cluster hierarchy or taxonomy
A
B
C
D
E
F
6Extensional vs. Conceptual Clustering
- Extensional clustering
- Clusters defined as sets of examples
- lt statistics
- Conceptual clustering
- Clusters described in some language
- Typical criteria for good conceptual clustering
- High intra-cluster similarity
- Simple conceptual description of clusters
7Flat Extensional Clustering
- Flat clustering
- Given set of unlabeled data
- Find clusters of similar instances
- "similar" close to each other in some space
- Number of clusters may be given
- Other quality criteria for clusters may be given
- Examples of algorithms
- LEADER simple, fast, but not very good
- K means, EM
8Leader
- Input data, some threshold distance D
- Clusters are represented by prototypes
- Algorithm
- Start with no clusters
- For each example e
- Find the first cluster prototype p for which
dist(p, e) lt D - If found, add e to cluster of p, otherwise make e
a new prototype - Very fast, but low quality clusters
- e.g., results depend on order of examples
9K means clustering
- Input data, number of clusters K
- Algorithm
- Start with K random seeds for clusters
- Repeat until no changes
- For all instances e
- Add e to cluster of closest seed
- Compute centres of all clusters
- E.g., for numeric data compute average
- These centres are the seeds for the next iteration
10The EM algorithm
- Expectation maximisation
- Similar to K means, but
- Now we assume e.g. normal distributions
- Examples not assigned to 1 cluster, but partially
to different clusters (proportionally to
distribution) - EM actually much more general than just
clustering - Find number of distributions generating data
- Building mixture models
- See Mitchell for details
11HierarchicalExtensional Clustering
- Top-down (divisive) methods
- Start with 1 cluster (whole data set)
- Divide it into subsets
- Subdivide subsets further, etc.
- Bottom-up (agglomerative) methods
- Start with singleton clusters
- Join closest clusters together
- Repeat until 1 cluster
12Example Agglomerative Methods
- A number of well-known agglomerative methods
exist - Variants according to definition of closest
clusters - Distance criterion for examples assumed known
- How does it generalise to distance between
clusters? - Options
- Single linkage distance between clusters
distance between closest points - Complete linkage cluster distance distance
between furthest points - Average distance, ...
13Conceptual clustering
- Previous methods just form groups of examples
- Conceptual return description of groups in some
language L - Quality of clusters depends on properties of
cluster description as well - Typically, simple cluster description in L
preferred - Language defines context in which quality of
clustering is evaluated - Whether 2 elements are in same cluster, may not
only depend on themselves, but also on other data
14Example
- How would you cluster these points?
15TDIDT for Clustering
- Decision tree conceptual hierarchical
clustering - Each node 1 cluster
- Tests from root to node conjunctive description
of cluster - Hence, language L of cluster descriptions
conjunctions of attribute tests
16Cobweb
- Well-known clustering algorithm (Fisher, 1987)
- Finds conceptual hierarchical clustering
- Probabilistic description probability
distribution of attribute values for each cluster - Heuristic
- Maximize predictiveness and predictability of
attributes - Predictability given cluster, how well can you
predict attributes? - Predictiveness given attributes, how well can
you predict cluster? - Maximize both
17Cobweb Algorithm
- Incremental algorithm
- For each example
- For each level (top to bottom) of current
taxonomy - Change current level using one of several
operators - Add example to cluster
- Create new cluster (with 1 example)
- Merge clusters
- Split clusters
- Move down to relevant subcluster
- Evaluation of clustering try to maximize a
combination of predictiveness P(CkAivij) and
predictability P(AivijCk) (Ckcluster,
Aiattribute, vij value)
18Using clustering for prediction
- Clustering can be used for prediction of any
property - Once clusters found, prediction is made in a 2
step process - Given known attributes of instance, predict
cluster (OK if high predictiveness) - Given cluster, predict unknown attributes (OK if
high predictability) - "flexible prediction" not known in advance
what will be given and what will need to be
predicted
19- If something is known about what will be
predicted, clustering process can be tuned - Maximize predictiveness of attributes that will
be given - Maximize predictability of attributes that will
need to be predicted - Many learning approaches can be described in this
"predictive clustering" framework - Try, e.g., decision trees, instance-based learning
20To Remember
- Important concepts
- Similarity measures, distances
- Flat hierarchical clustering
- Extensional vs. Conceptual clustering
- Clustering algorithms
- Leader, K-means, EM
- Single/complete linkage
- Cobweb
- Use of clustering for prediction
2113 Induction of Rule Sets
- Representing theories with decision rules
- Induction of predictive rules
- Sequential covering approaches
- Induction of association rules ?
- The Apriori approach
- ? Ch. 10 (partially)
22Representing Theories with Decision Rules
- Previous representations
- decision trees
- numerical representations
- Popular representation for concept definitions
if-then-rules - IF ltconditionsgt THEN belongs to concept
- How can we learn such rules ?
- Trees can be converted to rules
- Using genetic algorithms
- With specific rule-learning methods
23Sequential Covering Approaches
- Or separate-and-conquer approach
- General principle learn rules 1 at a time
- Learn 1 rule that has
- High accuracy
- When it predicts something, it should be correct
- Any coverage
- Does not have to make a prediction for all
examples, just for some of them - Mark covered examples
- These have been taken care of, from now on focus
on the rest - Repeat this until all examples covered
24Sequential Covering
- General algorithm for learning rule sets
- Based on CN2 algorithm (Clark Niblett)
function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples classified
correctly by Rule Rule
LearnOneRule(Target, Attrs, Examples) sort
LearnedRules according to performance return
LearnedRules
25Learning One Rule
- To learn one rule
- Perform greedy search
- Could be top-down or bottom-up
- Top-down
- Start with maximally general rule
- Add literals one by one
- Bottom-up
- Start with maximally specific rule
- Remove literals one by one
26Learning One Rule
function LearnOneRule(Target, Attrs, Examples)
NewRule IF true THEN pos NewRuleNeg
Neg while NewRuleNeg not empty, do
add a new literal to the rule Candidates
generate candidate literals BestLit
argmaxL?Candidates performance(Specialise(NewRule,
L)) NewRule Specialise(NewRule,
BestLit) NewRuleNeg x?Neg x covered by
NewRule return NewRule function
Specialise(Rule, Lit) let Rule IF
conditions THEN pos return IF conditions
and Lit THEN pos
27Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
28Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
IF A B THEN pos
29Some Options
- Options for learning 1 rule
- Top-down or Bottom-up?
- Example-driven?
- Hill-climbing, beam search, ... ?
- Learn rules for 1 class at a time, or for
multiple classes? - E.g., first learn ruleset for pos, then one for
neg vs. learning 1 set with pos and neg rules - Learn ordered or unordered set of rules?
- Ordered 1st rule that applies will be used
- Allows for easy incorporation of exceptions
30IllustrationBottom-up vs. Top-down
Bottom-up typically more specific rules
-
-
-
-
-
-
-
-
-
-
-
-
-
Top-down typically more general rules
31Heuristics
- Heuristics
- When is a rule good?
- High accuracy
- Less important high coverage
- Possible evaluation functions
- Accuracy
- A variant of accuracy m-estimate
- Entropy more symmetry between pos and neg
- Post-pruning of rules
- Cf. what was done for decision trees
32Example-driven Rule Induction
- Example AQ algorithms (Michalski et al.)
- for a given class C
- as long as there are uncovered examples for C
- pick one such example e
- consider He rules that cover this example
- search top-down in He to find best rule
- Much more efficient search
- Hypothesis spaces He much smaller than H (set of
all rules) - Less robust w.r.t. noise
- what if noisy example picked?
33Discovery of Association Rules
- An example of descriptive induction as opposed to
predictive induction - Predictive learn a function that predicts for
new instances the value for a certain attribute
(e.g., class) - Descriptive learn patterns in the data
- e.g., find groups of similar instances
clustering - e.g., find associations between attributes
34Predictive Induction vs. Descriptive Induction
a
a
-
a
-
a
-
a
a
c
-
-
b
-
b
c
-
b
b
c
c
-
b
c
Find associations between any properties - e.g.
clusters (assoc. X-Y) - e.g. a in top left (assoc
XY-a)
Find association between 1 specific property
(/-) and any other properties
35- Difference not clear-cut
- Many views on relationship between predictive and
descriptive induction - For instance discriminatory induction
- Predictive induction learn to discriminate /-
- Can be done by performing descriptive induction
on separate classes - Descr find patterns that generally hold in whole
set - Pred find patterns that hold for and not for -
36Association rules
- Association rules
- similar to decision rules IF ... THEN ...
- describe relationships between sets of boolean
attributes - e.g., market basket analysis learn which
products often bought together
IF bread butter THEN cheese confidence 50
support 5
Client cheese bread butter wine jam ham 1 yes yes
yes yes no yes 2 yes no yes no no no 3 no yes yes
no no yes ... ... ... ... ... ... ...
37Some Characteristics
- Association rule IF a1, ..., an THEN an1, ...,
anm - Is characterised by a
- Support how many of all clients actually buy
a1...anm ? - if too low rule not very important
- Confidence how many of buyers of a1...an also
buy an1...anm? - need not be close to 100 percent
- even small increase w.r.t. normal level may be
interesting
38Searching for association rules
- Often very large databases to be analysed
- Efficient algorithm needed
- Repeatedly running normal (adapted) rule
induction algorithms is not efficient - Moreover, typical rule algorithms give a minimal
set of rules that is sufficient to define a
concept - But we want all rules satisfying criteria, not a
minimal set - The APRIORI algorithm (Agrawal et al., 1993)
- Parameters min. support, min. confidence
- Works in 2 steps
- Step 1 find frequent sets
- Step 2 combine frequent sets into association
rules
39 A Key Observation
- Observation
- let freq(S) number of examples containing S
- consider IF a1...an THEN an1...anm
- support freq(a1, ..., anm)/freq()
- confidence freq(a1...anm)/freq(a1...an)
- gt all association rules with sufficient
confidence and support can be derived from list
of "frequent sets" their frequencies - S is "frequent set" iff freq(S)gtmin_supportfreq(
)
40Finding Frequent Sets
- Step 1 find frequent sets
- Observation if a1...ai is not frequent, then
a1...ai1 is not frequent - -gt breadth-first, general-to-specific search
- find all frequent sets of cardinality 1
- find all frequent sets of cardinality 2
- set a1,a2 can be frequent only if a1 and a2
both frequent - many pairs pruned before actually computing their
frequency by looking at data - others "candidates" -gt need to check
frequencies - find all frequent sets of cardinality 3
- a1,a2,a3 frequent only if a1,a2, a2,a3 and
a1,a3 frequent...
41bread
ham
cheese
wine
butter
jam
breadbutter
breadjam
breadcheese
cheesejam
buttercheese
butterjam
breadbutterjam
breadbuttercheese
Frequent Infrequent
not a candidate
42- Algorithm finding frequent sets
min_freq min_support freq(?) d 0 Q0
? / Qi candidates for level i / F ?
/ F frequent sets / while Qd ? ? do
for all S in Qd find freq(S) / data access
/ delete those S in Qd with freq(S)ltmin_freq F
F ? Qd compute Qd1 d d1 return F
43- Offline computation of new candidates
- "offline" without having to look at examples
compute Qd1 from Qd and F Qd1 ?
for each S in Qd for each item x not in S
S' S ? x if each subset of S' obtained
by removing 1 element of S' is in F then add
S' to Qd1
44- Step 2 infer association rules from frequent
sets - if S ? a in F and freq(S ? a) / freq(S) gt
min_confidence - then return rule "IF S THEN a"
40
50
45
breadcheese
breadbutter
buttercheese
20
breadbuttercheese
IF bread and butter THEN cheese confidence
50 (20/40) support 5
45Post-Processing of Rules
- Often many association rules found
- How to process them, as a human?
- sort according to some criterion
- e.g. support or confidence
- sometimes used statistical significance of
deviation - p1 P(cheese) 0.6
- p2 P(cheese breadbutter) 0.7
- rule more interesting if deviation of p2 from p1
is statistically more significant - combines both support and confidence
- Other post-processing methods exist
- Might even use a query language to query rule
base - Find all rules for which conditions ... Hold
- Sort rules according to ...
46To Remember
- Sequential covering approaches
- Different variants of basic algorithm
- Association rules
- What they represent, difference with predictive
rules - Apriori algorithm in detail
4714 Inductive Logic Programming
- Introduction induction in logic
- A note on representations
- practical motivation for ILP
- ILP and how it relates to other methods
- Some fundamental notions
- Learning methods in ILP
- Induction of Horn clauses
- First order association rules
- Other methods
- ?
48A Logical Perspectiveon Induction
- Up till now, usually
- A set of data was given
- Some general model was to be induced
- We can generalise this
- Given certain knowledge
- Can in principle be any type of statement
- Induce more general, plausible knowledge from it
- Need general method for representing and
reasoning about knowledge
49- Logic is a suitable language
- Often used in practice for knowledge
representation and reasoning - reasoning in logic is typically deductive
reasoning - Need to study inductive reasoning in logic
50Deduction vs. induction
- Deduction reasoning from general to specific
- Is "always correct", truth-preserving
- Induction reasoning from specific to general
inverse of deduction - Not truth-preserving
- But there may be statistical evidence
Deduction All men are mortal Socrates is a
man Socrates is mortal
Induction Socrates is mortal Socrates is a man
All men are mortal
51A Note on Representations
- So logic would allow us to obtain more general
kind of inductive reasoning... - But do we actually need this in practice?
- Yes not all problems are equally easy to
describe in previous settings - Lets have a look at a number of different
representational settings
52The Attribute-Value Framework
- Up till now
- all data in one single table
- each example described by one vector of fixed
attributes (one row in table) - induced hypothesis contains conditions that
compare attribute with specific values - standard setting, or attribute-value
setting - too limited in some cases!
53More Complicated the Multi-Instance Setting
- Example (Dietterich, 1996)
- set of molecules, musk / non-musk
- each molecule can have many different
conformations - if at least one of these conformations has a
certain property, the molecule is a musk - in other words relate property of a set to
properties of its members - Not easy to handle in standard format
54Multi-Instance Illustration
Have to relate class of one row to values in
some other row
55Even More Complicated Settings
- What if examples are described by sets, graphs,
strings, ? - Cf. Molecular structures
- Standard setting often not usable
- Many specific algorithms devised
- Alternative
- use a sufficiently general representation
mechanism - motivates study of induction in FOL
56Example Learning to Pass
4
7
5
57Learning to pass (1)
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
58Learning to Pass Pattern 1
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass
59Learning to Pass Pattern 2
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass x passes to
y, z is close to y, z is red gt bad pass
60Learning to pass representation
- Could we represent soccer situations in one
table? - E.g., attribute 1 is close to 3, or somebody
is close to 5, - Yes, but
- Many different attributes necessary
- Attributes refer to specific player numbers
- Hypotheses not expressible at right level of
abstraction - No variables possible in hypotheses
61Representations Conclusions
- Having a good representation is important!
- Standard setting
- limited expressive power
- high efficiency
- sufficient in many cases!
- Other settings only use when needed
- when transformation into standard setting is not
feasible
62Relating ILP to Other Learning Approaches
- Learning methods we earlier discussed can learn
hypotheses that can be expressed in propositional
logic - conditions are chosen from a fixed set of
propositions, e.g., "Sky sunny", ... - ILP learners use framework of first order
predicate logic - predicates, variables, quantifiers
- conditions are literals composed of predicate
symbol, variables, constants - more expressive
63What Exactly Is ILP?
- ILP inductive logic programming
- logic programming programs are sets of first
order rules ("Horn clauses") - inductive logic programming learn such programs
from examples - more generally learn logic formulae from
examples - -gt study induction in the framework of first
order predicate logic - Up till now only propositional methods
- hypotheses can be expressed in propositional logic
64Logic Programming
- Logic program definition of predicates from
existing predicates - practical language Prolog
Definition "a course is an advanced course iff
it is difficult or has an advanced course as a
prerequisite" ?x (Advanced(x) ? Difficult(x) ?
?y Prerequisite(y,x) ? Advanced(y)) Representatio
n as Horn clauses Advanced(x) ?
Difficult(x) Advanced(x) ? Prerequisite(y,x) ?
Advanced(y)) Representation as Prolog
program advanced(X) - difficult(X). advanced(X)
- prerequisite(Y,X), advanced(Y).
65- First order logic formulae can be general
assertions (not necessarily definitions) - can represent any kind of knowledge
Assertion "all people are male or female" ?x
Human(x) ? Male(x) ? Female(x) Using Prolog-like
notation male(X) female(X) - human(X)
66Some terminology
- Terms refer to objects in the world
- variables (X, Y, ...) or constants (a, b, 5, ...)
(...) - Predicates properties of / relationships
between objects - predicate human/1, male/1, father/2, ...
- atom predicate symbol n arguments (terms)
- e.g. human(X), father(luc, soetkin)
- evaluate to true or false
- literal possibly negated atom
- human(luc), not father(soetkin,luc), ...
67- Clause disjunction of literals in which all
variables universally quantified - e.g., ?x,y father(x,y) ? ?male(x) ? ?parent(x,y)
- equivalent to ?x,y father(x,y) ? male(x)
?parent(x,y) - Prolog notation father(X,Y) - male(X),
parent(X,Y) - Horn clause contains max. 1 positive literal
- Variable substitution
- changing variable into other variable or constant
- e.g., ? X/a, Y/b
- application of substitution ? to clause c c?
- e.g., father(X,Y)? father(a,b)
68- Ground formula contains no variables
- Fact clause consisting of 1 atom
- e.g., father(luc,soetkin)
- Ground facts are very useful to represent data
69Inductive logic programming
- Induction of first order logic formulae from data
Prolog dataset
Knowledge discovered
male(luc). male(maarten). female(lieve). female(so
etkin). father(luc,soetkin). father(luc,maarten).
mother(lieve,soetkin). mother(lieve,maarten). pare
nt(luc,soetkin). ...
false - male(X), female(X). female(X) -
mother(X,Y). male(X) - father(X,Y). parent(X,Y)
- father(X,Y). parent(X,Y) - mother(X,Y). ...
70- Different settings possible
- learn definition for one predicate
- predictive induction
- e.g., learn definition of "father"
- very similar to learning rule sets
- learn general patterns
- descriptive induction
- e.g., learn relationships between male, female,
... - false - male(X),female(X).
- similar to discovery of association rules
71- Representation of data
- propositional 1 example is described by listing
true propositions (usually in table format) - size of description is constant
- ILP 1 example is essentially described by a set
of related facts - size of this set may vary
72"Enjoy Sport" example
- Description of one example 1 row in table
Day Sky Airtemp Humidity Wind Water
Forecast EnjoySport 1 sunny warm
normal strong cool change yes
Possible propositional representation
Possible first order representation
day(1). sky(1, sunny). airtemp(1,
warm). humidity(1, normal). wind(1,
strong). water(1, cool). forecast(1,
change). enjoy_sport(1, yes).
"Skysunny" true "Skyrainy"
false "Airtempwarm" true ...
73"Bongard" example
- Classify examples based on internal structure
neg
pos
74Bongard Propositional Representation
- How to represent drawing in 1 table?
- attributes for each component of drawing
- attributes for relationships between components
- Several problems with this representation
- 1. assumption of fixed number of objects
- if fewer objects leave attributes blank
- 2. very large number of attributes
- many of which possibly blank or irrelevant
- 3. meaning of attributes not well-defined
- multiple representations for same thing possible
75- Issue 1 and 2
- e.g., max. 5 objects in drawing
- attributes Object1, Points1, Object2, Points2,
..., Object5, Points5, Inside12, Inside13,
Inside14, Inside15, Inside21, Inside23, ...,
Inside54 - attributes easily superlinear in objects (here
quadratic) - the more objects allowed, the more blank
attributes
76- Issue 3 consider this example
- Possible representations ("Inside" left out)
Fig. Obj1 Points1 Obj2 Points2 Obj3 Points3 Class
1 Circle - Triangle Down - - pos 1 Triangle Down
Circle - - - pos 1 - - Triangle Down Circle - pos
... ... ... ... ... ... ... ...
or
or
or
How to represent concept "contains triangle
pointing down?
IF Object2 triangle AND Points2 down THEN pos
Does not work with each valid representation!
77- Attribute-value table and corresponding rules
provide incorrect level of abstraction - Better representation
- 1 example multiple rows (possibly in multiple
tables) - Learning algorithms need to be adapted!
78Bongard First Order Logic Representation
- First order logic representation
contains(1, o1). contains(1, o2). triangle(o1). po
ints(o1, down). circle(o2). pos(1).
Drawing 1
any number of objects allowed
pos(X) - contains(X,Y), triangle(Y), points(Y,
down).
use of variables provides right abstraction level
for hypothesis
79- Equivalent representation as relational database
(cf. data mining) - 1 example set of tuples instead of 1 tuple
- ILP mining in multiple tuples / multiple
relations - important issue in current data mining research
Contains
Objects
Inside
information about example 1
80Background knowledge
- Additional advantage of first order logic
background knowledge about domain can be
expressed concisely
triangle(o1). polygon(o1). square(o2). polygon(o2)
. circle(o3). square(o4). polygon(o4). square(o5).
polygon(o5). ...
Background knowledge
polygon(X) - triangle(X). polygon(X) -
square(X).
triangle(o1). square(o2). circle(o3). square(o4).
square(o5). ...
Data about examples
81A Real World Example
- Find "pharmacophore" in molecules
- identify substructure that causes it to "dock"
on certain other molecules - Molecules described by listing for each atom in
it element, 3-D coordinates, ... - Background defines computation of euclidean
distance, ...
82Background knowledge
Description of molecules
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.4304,-3.1160,0.0489). atm(m1,a2,c,
2,6.0334,-1.7760,0.6795). atm(m1,a3,o,2,7.0265,-2.
0425,0.0232). ... bond(m1,a2,a3,2). bond(m1,a5,a6,
1). bond(m1,a2,a4,1). bond(m1,a6,a7,du). ...
Hypothesis
active(A) - zincsite(A,B), hacc(A,C), hacc(A,D),
hacc(A,E), dist(A,C,B,4.891,0.750),
dist(A,C,D,3.753,0.750), dist(A,
C,E,3.114,0.750), dist(A,D,B,8.475,0.750),
dist(A,D,E, 2.133,0.750),
dist(A,E,B,7.899,0.750).
83 84ILP Related to "Explanation Based Learning"
- EBL explanation based learning
- One of first approaches to use first order logic
- Special case of "analytical learning"
- Idea
- Specify full domain theory explaining examples
- Only part of theory actually relevant for
examples - Find an explanation for an example find that
part of the full theory that is relevant for this
example
85- Also called "speed-up" learning
- Once more specific explanation are available,
these can be used to make predictions much more
efficiently than with full theory - Relation to ILP
- Language bias in ILP is very similar to domain
theory in EBL - ILP language bias can be seen as specifying a
domain theory that is not necessarily correct - That's what makes it inductive
- EBL could be simulated using ILP
86Some general theory EBL correct, not
necessarily relevant explanations ILP possible,
not necessarily correct explanations
deduction
Hypothesis
induction
Observed data
87Conclusions
- Advantages of using first order logic
- More complex data can be represented
- Existing background knowledge can be represented
- More powerful representation language for
hypotheses - Hence, useful for certain kinds of learning...
- "structural" learning examples have complex
structure - "relational" learning relations between objects
are important - Also related to data mining in relational
database - Inductive logic programming provides these
88Fundamentals of inductive logic programming
- Notion of generality (cf. versionspaces)
- How to specialise conditions?
- How to generalise conditions?
- Main concepts
- operators for specialisation / generalisation
- ?-subsumption
- inverse resolution
- least general generalisation
89Notion of generality
- Remember versionspaces
- notion of generality was very important
- lt?,?gt ?g ltwarm,?gt ?g ltwarm,sunnygt
- In first order logic same question...
- when is a concept definition more general than
another one? - answer will allow to implement, e.g.,
general-to-specific search - ... but more difficult to answer
90One Option...
- A theory G is more general than a theory S if and
only if G S - G S in every interpretation (set of facts)
for which G is true, S is also true - "G logically implies S"
- e.g., "all fruit tastes good" "all apples
taste good" (assuming apples are fruit) - Note talking about general theories here, not
just concepts (lt-gt versionspaces) - generality of concepts is special case of this
91- Induction inverse of deduction
- Deductive operators "-" exist that implement (or
approximate) - E.g., resolution (from logic programming)
- So, inverting these operators should yield
inductive operators - basic technique in many inductive logic
programming systems
92Various frameworks for generality
- Depending on form of G and S
- 1 clause / set of clauses / any first order
theory - Depending on choice of - to invert
- theta-subsumption
- resolution
- implication
- Some frameworks easier than others
93Inverting resolution
- Resolution works very well for deductive systems
(e.g., Prolog) - Simple cases of resolution
Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
94The Resolution Rule
2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
95Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
96Inverting Resolution
- Inverse resolution is much more difficult than
resolution itself - different operators needed (see further)
- no unique results
- making 2 things equal can be done in only one
way, but making things different can be done in
many ways! - hence, very large search space
- Turned out to be impractical, unless a human
guides the generalisation process - Interactive learners
97Inverse resolution operators
- Some operators related to inverse resolution (A,
B are conjunctions of literals) - absorption
- from q-A and p - A,B
- infer p - q,B
- identification
- from p - q,B and p - A,B
- infer q - A
q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
98- Intra-construction
- from p - A,B and p - A,C
- infer q - B and p - A,q and q - C
- Inter-construction
- from p - A,B and q - A,C
- infer p - r,B and r - A and q - r,C
q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
99Predicate invention
- With intra- and inter-construction, new
predicates are invented - E.g., apply intra-construction on
- grandparent(X,Y) - father(X,Z), father(Z,Y)
- grandparent(X,Y) - father(X,Z), mother(Z,Y)
- What predicate is invented?
100Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
101We need something simpler
- Inverse resolution
- allows to generalise sets of clauses
- but most of the time is too complex
- Move towards generalisation of single clauses
- popular operators will be based on
theta-subsumption
102Theta-subsumption
- 1 clause c1 theta-subsumes another clause c2 if
there exists a variable substitution such that it
becomes a subset of the other one - c1 ?? c2 ? ?? c1? ? c2
- to check this, first write clauses as
disjunctions - a,b,c ? d,e,f ltgt a ? b ? c ? ?d ? ?e ?
?f - then try to replace variables by constants or
other variables - Intuition c2 is a special case of c1
103Theta-subsumption examples
- c1 father(X,Y) - parent(X,Y)
- c2 father(X,Y) - parent(X,Y), male(X)
- for ? c1? ? c2 gt c1 ?-subsumes c2
- c3 father(luc,Y) - parent(luc,Y)
- for ? X/luc c1? c3 gt c1 ?-subsumes c3
- c2 and c3 do not ?-subsume one another
104- Try yourself
- c1 p(X,Y) - q(X,Y)
- c2 p(X,Y) - q(X,Y), q(Y,X)
- c3 p(Z,Z) - q(Z,Z)
- c4 p(a,a) - q(a,a)
- Which clauses ?-subsumed by which?
105Searching for clauses
- Structure hypothesis space (space of clauses)
according to ?-subsumption - Operators for moving in this space
- minimal specialisation operator
- from a clause c, derive d such that c ?? d and
??e c ?? e ?? d - minimal generalisation operator
- usually starts from 2 clauses
- from c and d finds e such that e ?? c, e ?? d and
?? f e ?? f and f ?? c and f ?? d
106- Note similarity with propositional refinement
- IF Sky sunny THEN EnjoySportsyes
- To specialise add 1 condition
- IF Skysunny AND Humiditylow THEN
EnjoySportsyes - In first order logic
- c1 father(X,Y) - parent(X,Y)
- To specialize find clauses ?-subsumed by c1
- father(X,Y) - parent(X,Y), male(X)
- father(luc,X) - parent(luc,X)
- father(X,X) - parent(X,X)
- ...
107- Properties of ?-subsumption
- Sound
- if c1 ?-subsumes c2 then c1 c2
- Incomplete
- possibly c1 c2 without c1 ?-subsuming c2 (but
only for recursive clauses) - c1 p(f(X)) - p(X)
- c2 p(f(f(X))) - p(X)
- Checking whether c1 ?-subsumes c2 is decidable
but NP-complete - Transitive and reflexive, not anti-symmetric
- "semi-order" relation
108- Semi-order generates equivalence classes
partial order on those equivalence classes - equivalence class c1c2 iff c1 ?? c2 and c2 ??
c1 - c1 and c2 are then called syntactic variants
- c1 is reduced clause of c2 iff c1 contains
minimal subset of literals of c2 that is still
equivalent with c2 - if c1 and c2 in different equivalence classes,
either c1 ?? c2 or c2 ?? c1 or neither gt
antisymmetry gt partial order
109p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y),m(X,Z),m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
110- Since equivalence classes are partially ordered,
they form a lattice - least upper bound / greatest lower bound of two
clauses always exists - Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
- h(X,Y) - p(X,Y)
- h(X,Y) - p(X,Y), p(Y,Y2)
- h(X,Y) - p(X,Y), p(Y,Y2), p(Y2,Y3)
- ...
- h(X,X) - p(X,X)
111Specialisation operators
- How to traverse hypothesis space so that
- no hypotheses are generated more than once
- no hypotheses are skipped
- Shapiro general-to-specific traversal using
refinement operator ? - ?(c) yields set of refinements of c
- ?(c) ? c' c' is a maximally general
specialisation of c - ?(c) ? c ? l l is a literal ? c? ? is a
substitution
112daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(X,Y)
113A generalisation operator
- Start from 2 clauses and compute least general
generalisation (lgg) - i.e., given 2 clauses, return most specific
single clause that is more general than both of
them - Definition of lgg of terms
- (let si, tj denote any term, V a variable)
- lgg(f(s1,...,sn), f(t1,...,tn))
f(lgg(s1,t1),...,lgg(sn,tn)) - lgg(f(s1,...,sn),g(t1,...,tn)) V
114- lgg of literals
- lgg(p(s1,...,sn),p(t1,...,tn))p(lgg(s1,t1),...,lg
g(sn,tn)) - lgg(?p(...), ?p(...)) ?lgg(p(...),p(...))
- lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
- lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
undefined - lgg of clauses
- lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
lgg(l1,l2) defined
115Applying lgg
- Example
- f(t,a) - p(t,a), m(t), f(a)
- f(j,p) - p(j,p), m(j), m(p)
- lgg f(X,Y) - p(X,Y), m(X), m(Z)
- Relative lgg (rlgg) (Plotkin 1971)
- relative to "background theory" B (assume B is a
set of facts) - rlgg(e1,e2) lgg(e1 - B, e2 - B)
116pos(1). pos(2). contains(1,o1
). contains(2,o3). contains(1,o2). triangl
e(o1). triangle(o3). points(o1,down).
points(o3,down). circle(o2).
1
2
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
117Conclusions
- We now have basic operators
- ?-subsumption-based at single clause level
- Specialization operator ?
- Generalization operator lgg
- Inverse resolution generalize a theory (set of
clauses) - These can be used to build ILP systems
- Top-down using specialization operators
- Bottom-up using generalization operators
118Rule induction
- Most inductive logic programming systems induce
concept definition in form of rule set - Algorithms similar to propositional algorithms
- FOIL -gt CN2
- Progol -gt AQ
119FOIL (Quinlan)
- Learns single concept, e.g., p(X,Y) - ...
- To learn one clause (hill-climbing search)
- start with general clause p(X,Y) - true
- repeat
- add best literal to clause (i.e., literal that
most improves quality of clause) - new literal can also be unification Xc or XY
- applying refinement operator under
?-subsumption - until no further improvement
120FOIL ExampleLearning One Clause
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
121father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
122father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
123father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
124FOIL Learning Multiple Clauses
- To learn multiple clauses
- repeat
- learn a single clause c (see previous algorithm)
- add c to h
- mark positive examples covered by c as covered
- until
- all positive examples marked covered
- or no more good clauses found
125likes(garfield, lasagna). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie). edible(lasagna). e
dible(birds). subject_to_cruelty(odie). subject_
to_cruelty(jon). subject_to_cruelty(birds).
likes(garfield, X) - edible(X).
3,0-
126likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
127Some pitfalls
- Avoiding infinite recursion
- when recursive clauses allowed, e.g.,
ancestor(X,Y) - parent(X,Z), ancestor(Z,Y) - avoid learning parent(X,Y) - parent(X,Y)
- won't be useful, even though it's 100 correct
- Bonus for introduction of new variables
- literal may not yield any direct gain, but may
introduce variables that may be useful later
p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
128Golem (Muggleton Feng)
- Based on rlgg-operator
- To build one clause
- Look at 2 positive examples, find rlgg,
generalize using yet another example, until no
improvement in quality of clause - bottom-up search
- Result very dependent on choice of examples
- e.g. what if true theory is p(X) - q(X) , p(X)
- r(X) ?
129- Try this for different couples, pick best clause
found - this reduces dependency on choice of couple (if 1
of them noisy no good clause found) - Remove covered positive examples, restart process
- Repeat until no more good clauses found
130Progol (Muggleton)
- Top-down approach, but with seed
- To find one clause
- Start with 1 positive example e
- Generate hypothesis space He that contains only
hypotheses that cover at least this one example - first generate most specific clause c that covers
e - He contains every clause more general than c
- Perform exhaustive top-down search in He, looking
for clause that maximizes compaction - Compaction size(covered examples) - size(clause)
131- Repeat process of finding one clause until no
more good ( causing compaction) clauses found - Compaction heuristic in principle allows no
coverage of negatives - can be relaxed (accommodating noise)
132Generation of bottom clause
- Language bias set of all acceptable clauses
(chosen by user) - specification of H (on level of single clauses)
- Bottom clause ? for example e most specific
clause in language bias covering e - Constructed using inverse entailment
133- Construction of ?
- if B?H e, then B ? ?e ?H
- if H is clause, ?H is conjunction of ground
(skolemized) literals - compute ?? all ground literals entailed by B ?
?e - ?H must be subset of these
- so B ? ?e ?? ?H
- hence H ?
134- Some examples (Muggleton, 1995, New Generation
Computing)
?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
135- Example of (part of) Progol run
- learn to classify animals as mammals, reptiles,
...
- generalise(class/2)? Generalising
class(dog,mammal). Most specific clause
is class(A,mammal) - has_milk(A),
has_covering(A,hair