12 Clustering - PowerPoint PPT Presentation

1 / 163
About This Presentation
Title:

12 Clustering

Description:

Typical example: construct taxonomy of, e.g., animals. Example of 'hierarchical clustering' ... e.g., constructing local models (cf. instance based methods) ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 164
Provided by: hendrikb
Category:

less

Transcript and Presenter's Notes

Title: 12 Clustering


1
12 Clustering
  • What is clustering?
  • Flat clustering
  • K means clustering, EM algorithm
  • Hierarchical clustering
  • Distance-based approaches
  • Conceptual clustering
  • The Cobweb algorithm
  • Using clustering for prediction
  • ?

2
What is clustering?
  • Find groups (clusters) of instances so that
  • Instances in the same group are similar
  • Instances in different groups are different
  • Compared to classification
  • Similarity assign classes to examples
  • Difference classes not known in advance
  • Hence also called "unsupervised learning"
  • Classes (even taxonomies) are "invented"


-
Classification problem
.
.
Clustering problem
-


-
.
.
.
.

-
.
.
.
-

-

.
.
.
3
Examples
  • Typical example construct taxonomy of, e.g.,
    animals
  • Example of "hierarchical clustering"
  • In machine learning / data mining
  • Applicable, for instance, in marketing
  • identify typical customers, e.g., car drivers
  • Produce products (e.g. cars) that aim at one
    specific group
  • Auxiliary method for other techniques
  • e.g., constructing local models (cf. instance
    based methods), ...

4
Similarity Measures
  • How to measure similarity between instances?
  • Similar problem as for instance based methods
  • Possible options
  • Distance metric
  • Euclidean distance
  • Other...
  • More general forms of similarity
  • Do not necessarily satisfy triangle
    inequality,symmetry, ...

5
Flat vs. Hierarchical Clustering
  • Flat clustering
  • Given data set, return partition
  • Hierarchical clustering
  • Combine clusters into larger clusters, etc. until
    1 cluster full data set
  • Gives rise to cluster hierarchy or taxonomy

A
B
C
D
E
F
6
Extensional vs. Conceptual Clustering
  • Extensional clustering
  • Clusters defined as sets of examples
  • lt statistics
  • Conceptual clustering
  • Clusters described in some language
  • Typical criteria for good conceptual clustering
  • High intra-cluster similarity
  • Simple conceptual description of clusters

7
Flat Extensional Clustering
  • Flat clustering
  • Given set of unlabeled data
  • Find clusters of similar instances
  • "similar" close to each other in some space
  • Number of clusters may be given
  • Other quality criteria for clusters may be given
  • Examples of algorithms
  • LEADER simple, fast, but not very good
  • K means, EM

8
Leader
  • Input data, some threshold distance D
  • Clusters are represented by prototypes
  • Algorithm
  • Start with no clusters
  • For each example e
  • Find the first cluster prototype p for which
    dist(p, e) lt D
  • If found, add e to cluster of p, otherwise make e
    a new prototype
  • Very fast, but low quality clusters
  • e.g., results depend on order of examples

9
K means clustering
  • Input data, number of clusters K
  • Algorithm
  • Start with K random seeds for clusters
  • Repeat until no changes
  • For all instances e
  • Add e to cluster of closest seed
  • Compute centres of all clusters
  • E.g., for numeric data compute average
  • These centres are the seeds for the next iteration

10
The EM algorithm
  • Expectation maximisation
  • Similar to K means, but
  • Now we assume e.g. normal distributions
  • Examples not assigned to 1 cluster, but partially
    to different clusters (proportionally to
    distribution)
  • EM actually much more general than just
    clustering
  • Find number of distributions generating data
  • Building mixture models
  • See Mitchell for details

11
HierarchicalExtensional Clustering
  • Top-down (divisive) methods
  • Start with 1 cluster (whole data set)
  • Divide it into subsets
  • Subdivide subsets further, etc.
  • Bottom-up (agglomerative) methods
  • Start with singleton clusters
  • Join closest clusters together
  • Repeat until 1 cluster

12
Example Agglomerative Methods
  • A number of well-known agglomerative methods
    exist
  • Variants according to definition of closest
    clusters
  • Distance criterion for examples assumed known
  • How does it generalise to distance between
    clusters?
  • Options
  • Single linkage distance between clusters
    distance between closest points
  • Complete linkage cluster distance distance
    between furthest points
  • Average distance, ...

13
Conceptual clustering
  • Previous methods just form groups of examples
  • Conceptual return description of groups in some
    language L
  • Quality of clusters depends on properties of
    cluster description as well
  • Typically, simple cluster description in L
    preferred
  • Language defines context in which quality of
    clustering is evaluated
  • Whether 2 elements are in same cluster, may not
    only depend on themselves, but also on other data

14
Example
  • How would you cluster these points?

15
TDIDT for Clustering
  • Decision tree conceptual hierarchical
    clustering
  • Each node 1 cluster
  • Tests from root to node conjunctive description
    of cluster
  • Hence, language L of cluster descriptions
    conjunctions of attribute tests

16
Cobweb
  • Well-known clustering algorithm (Fisher, 1987)
  • Finds conceptual hierarchical clustering
  • Probabilistic description probability
    distribution of attribute values for each cluster
  • Heuristic
  • Maximize predictiveness and predictability of
    attributes
  • Predictability given cluster, how well can you
    predict attributes?
  • Predictiveness given attributes, how well can
    you predict cluster?
  • Maximize both

17
Cobweb Algorithm
  • Incremental algorithm
  • For each example
  • For each level (top to bottom) of current
    taxonomy
  • Change current level using one of several
    operators
  • Add example to cluster
  • Create new cluster (with 1 example)
  • Merge clusters
  • Split clusters
  • Move down to relevant subcluster
  • Evaluation of clustering try to maximize a
    combination of predictiveness P(CkAivij) and
    predictability P(AivijCk) (Ckcluster,
    Aiattribute, vij value)

18
Using clustering for prediction
  • Clustering can be used for prediction of any
    property
  • Once clusters found, prediction is made in a 2
    step process
  • Given known attributes of instance, predict
    cluster (OK if high predictiveness)
  • Given cluster, predict unknown attributes (OK if
    high predictability)
  • "flexible prediction" not known in advance
    what will be given and what will need to be
    predicted

19
  • If something is known about what will be
    predicted, clustering process can be tuned
  • Maximize predictiveness of attributes that will
    be given
  • Maximize predictability of attributes that will
    need to be predicted
  • Many learning approaches can be described in this
    "predictive clustering" framework
  • Try, e.g., decision trees, instance-based learning

20
To Remember
  • Important concepts
  • Similarity measures, distances
  • Flat hierarchical clustering
  • Extensional vs. Conceptual clustering
  • Clustering algorithms
  • Leader, K-means, EM
  • Single/complete linkage
  • Cobweb
  • Use of clustering for prediction

21
13 Induction of Rule Sets
  • Representing theories with decision rules
  • Induction of predictive rules
  • Sequential covering approaches
  • Induction of association rules ?
  • The Apriori approach
  • ? Ch. 10 (partially)

22
Representing Theories with Decision Rules
  • Previous representations
  • decision trees
  • numerical representations
  • Popular representation for concept definitions
    if-then-rules
  • IF ltconditionsgt THEN belongs to concept
  • How can we learn such rules ?
  • Trees can be converted to rules
  • Using genetic algorithms
  • With specific rule-learning methods

23
Sequential Covering Approaches
  • Or separate-and-conquer approach
  • General principle learn rules 1 at a time
  • Learn 1 rule that has
  • High accuracy
  • When it predicts something, it should be correct
  • Any coverage
  • Does not have to make a prediction for all
    examples, just for some of them
  • Mark covered examples
  • These have been taken care of, from now on focus
    on the rest
  • Repeat this until all examples covered

24
Sequential Covering
  • General algorithm for learning rule sets
  • Based on CN2 algorithm (Clark Niblett)

function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples classified
correctly by Rule Rule
LearnOneRule(Target, Attrs, Examples) sort
LearnedRules according to performance return
LearnedRules
25
Learning One Rule
  • To learn one rule
  • Perform greedy search
  • Could be top-down or bottom-up
  • Top-down
  • Start with maximally general rule
  • Add literals one by one
  • Bottom-up
  • Start with maximally specific rule
  • Remove literals one by one

26
Learning One Rule
function LearnOneRule(Target, Attrs, Examples)
NewRule IF true THEN pos NewRuleNeg
Neg while NewRuleNeg not empty, do
add a new literal to the rule Candidates
generate candidate literals BestLit
argmaxL?Candidates performance(Specialise(NewRule,
L)) NewRule Specialise(NewRule,
BestLit) NewRuleNeg x?Neg x covered by
NewRule return NewRule function
Specialise(Rule, Lit) let Rule IF
conditions THEN pos return IF conditions
and Lit THEN pos
27
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
28
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
IF A B THEN pos
29
Some Options
  • Options for learning 1 rule
  • Top-down or Bottom-up?
  • Example-driven?
  • Hill-climbing, beam search, ... ?
  • Learn rules for 1 class at a time, or for
    multiple classes?
  • E.g., first learn ruleset for pos, then one for
    neg vs. learning 1 set with pos and neg rules
  • Learn ordered or unordered set of rules?
  • Ordered 1st rule that applies will be used
  • Allows for easy incorporation of exceptions

30
IllustrationBottom-up vs. Top-down
Bottom-up typically more specific rules
-
-

-
-

-









-
-
-

-
-
-
-
-
Top-down typically more general rules
31
Heuristics
  • Heuristics
  • When is a rule good?
  • High accuracy
  • Less important high coverage
  • Possible evaluation functions
  • Accuracy
  • A variant of accuracy m-estimate
  • Entropy more symmetry between pos and neg
  • Post-pruning of rules
  • Cf. what was done for decision trees

32
Example-driven Rule Induction
  • Example AQ algorithms (Michalski et al.)
  • for a given class C
  • as long as there are uncovered examples for C
  • pick one such example e
  • consider He rules that cover this example
  • search top-down in He to find best rule
  • Much more efficient search
  • Hypothesis spaces He much smaller than H (set of
    all rules)
  • Less robust w.r.t. noise
  • what if noisy example picked?

33
Discovery of Association Rules
  • An example of descriptive induction as opposed to
    predictive induction
  • Predictive learn a function that predicts for
    new instances the value for a certain attribute
    (e.g., class)
  • Descriptive learn patterns in the data
  • e.g., find groups of similar instances
    clustering
  • e.g., find associations between attributes

34
Predictive Induction vs. Descriptive Induction


a
a

-
a
-
a



-
a
a
c


-
-
b
-
b
c
-
b
b
c
c
-
b
c
Find associations between any properties - e.g.
clusters (assoc. X-Y) - e.g. a in top left (assoc
XY-a)
Find association between 1 specific property
(/-) and any other properties
35
  • Difference not clear-cut
  • Many views on relationship between predictive and
    descriptive induction
  • For instance discriminatory induction
  • Predictive induction learn to discriminate /-
  • Can be done by performing descriptive induction
    on separate classes
  • Descr find patterns that generally hold in whole
    set
  • Pred find patterns that hold for and not for -

36
Association rules
  • Association rules
  • similar to decision rules IF ... THEN ...
  • describe relationships between sets of boolean
    attributes
  • e.g., market basket analysis learn which
    products often bought together

IF bread butter THEN cheese confidence 50
support 5
Client cheese bread butter wine jam ham 1 yes yes
yes yes no yes 2 yes no yes no no no 3 no yes yes
no no yes ... ... ... ... ... ... ...
37
Some Characteristics
  • Association rule IF a1, ..., an THEN an1, ...,
    anm
  • Is characterised by a
  • Support how many of all clients actually buy
    a1...anm ?
  • if too low rule not very important
  • Confidence how many of buyers of a1...an also
    buy an1...anm?
  • need not be close to 100 percent
  • even small increase w.r.t. normal level may be
    interesting

38
Searching for association rules
  • Often very large databases to be analysed
  • Efficient algorithm needed
  • Repeatedly running normal (adapted) rule
    induction algorithms is not efficient
  • Moreover, typical rule algorithms give a minimal
    set of rules that is sufficient to define a
    concept
  • But we want all rules satisfying criteria, not a
    minimal set
  • The APRIORI algorithm (Agrawal et al., 1993)
  • Parameters min. support, min. confidence
  • Works in 2 steps
  • Step 1 find frequent sets
  • Step 2 combine frequent sets into association
    rules

39
A Key Observation
  • Observation
  • let freq(S) number of examples containing S
  • consider IF a1...an THEN an1...anm
  • support freq(a1, ..., anm)/freq()
  • confidence freq(a1...anm)/freq(a1...an)
  • gt all association rules with sufficient
    confidence and support can be derived from list
    of "frequent sets" their frequencies
  • S is "frequent set" iff freq(S)gtmin_supportfreq(
    )

40
Finding Frequent Sets
  • Step 1 find frequent sets
  • Observation if a1...ai is not frequent, then
    a1...ai1 is not frequent
  • -gt breadth-first, general-to-specific search
  • find all frequent sets of cardinality 1
  • find all frequent sets of cardinality 2
  • set a1,a2 can be frequent only if a1 and a2
    both frequent
  • many pairs pruned before actually computing their
    frequency by looking at data
  • others "candidates" -gt need to check
    frequencies
  • find all frequent sets of cardinality 3
  • a1,a2,a3 frequent only if a1,a2, a2,a3 and
    a1,a3 frequent...

41
  • Example

bread
ham
cheese
wine
butter
jam
breadbutter
breadjam
breadcheese
cheesejam
buttercheese
butterjam
breadbutterjam
breadbuttercheese
Frequent Infrequent
not a candidate
42
  • Algorithm finding frequent sets

min_freq min_support freq(?) d 0 Q0
? / Qi candidates for level i / F ?
/ F frequent sets / while Qd ? ? do
for all S in Qd find freq(S) / data access
/ delete those S in Qd with freq(S)ltmin_freq F
F ? Qd compute Qd1 d d1 return F
43
  • Offline computation of new candidates
  • "offline" without having to look at examples

compute Qd1 from Qd and F Qd1 ?
for each S in Qd for each item x not in S
S' S ? x if each subset of S' obtained
by removing 1 element of S' is in F then add
S' to Qd1
44
  • Step 2 infer association rules from frequent
    sets
  • if S ? a in F and freq(S ? a) / freq(S) gt
    min_confidence
  • then return rule "IF S THEN a"

40
50
45
breadcheese
breadbutter
buttercheese
20
breadbuttercheese
IF bread and butter THEN cheese confidence
50 (20/40) support 5
45
Post-Processing of Rules
  • Often many association rules found
  • How to process them, as a human?
  • sort according to some criterion
  • e.g. support or confidence
  • sometimes used statistical significance of
    deviation
  • p1 P(cheese) 0.6
  • p2 P(cheese breadbutter) 0.7
  • rule more interesting if deviation of p2 from p1
    is statistically more significant
  • combines both support and confidence
  • Other post-processing methods exist
  • Might even use a query language to query rule
    base
  • Find all rules for which conditions ... Hold
  • Sort rules according to ...

46
To Remember
  • Sequential covering approaches
  • Different variants of basic algorithm
  • Association rules
  • What they represent, difference with predictive
    rules
  • Apriori algorithm in detail

47
14 Inductive Logic Programming
  • Introduction induction in logic
  • A note on representations
  • practical motivation for ILP
  • ILP and how it relates to other methods
  • Some fundamental notions
  • Learning methods in ILP
  • Induction of Horn clauses
  • First order association rules
  • Other methods
  • ?

48
A Logical Perspectiveon Induction
  • Up till now, usually
  • A set of data was given
  • Some general model was to be induced
  • We can generalise this
  • Given certain knowledge
  • Can in principle be any type of statement
  • Induce more general, plausible knowledge from it
  • Need general method for representing and
    reasoning about knowledge

49
  • Logic is a suitable language
  • Often used in practice for knowledge
    representation and reasoning
  • reasoning in logic is typically deductive
    reasoning
  • Need to study inductive reasoning in logic

50
Deduction vs. induction
  • Deduction reasoning from general to specific
  • Is "always correct", truth-preserving
  • Induction reasoning from specific to general
    inverse of deduction
  • Not truth-preserving
  • But there may be statistical evidence

Deduction All men are mortal Socrates is a
man Socrates is mortal
Induction Socrates is mortal Socrates is a man
All men are mortal
51
A Note on Representations
  • So logic would allow us to obtain more general
    kind of inductive reasoning...
  • But do we actually need this in practice?
  • Yes not all problems are equally easy to
    describe in previous settings
  • Lets have a look at a number of different
    representational settings

52
The Attribute-Value Framework
  • Up till now
  • all data in one single table
  • each example described by one vector of fixed
    attributes (one row in table)
  • induced hypothesis contains conditions that
    compare attribute with specific values
  • standard setting, or attribute-value
    setting
  • too limited in some cases!

53
More Complicated the Multi-Instance Setting
  • Example (Dietterich, 1996)
  • set of molecules, musk / non-musk
  • each molecule can have many different
    conformations
  • if at least one of these conformations has a
    certain property, the molecule is a musk
  • in other words relate property of a set to
    properties of its members
  • Not easy to handle in standard format

54
Multi-Instance Illustration
Have to relate class of one row to values in
some other row
55
Even More Complicated Settings
  • What if examples are described by sets, graphs,
    strings, ?
  • Cf. Molecular structures
  • Standard setting often not usable
  • Many specific algorithms devised
  • Alternative
  • use a sufficiently general representation
    mechanism
  • motivates study of induction in FOL

56
Example Learning to Pass
4
7
5
57
Learning to pass (1)
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
58
Learning to Pass Pattern 1
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass
59
Learning to Pass Pattern 2
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red3 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue5 good pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to blue4 bad pass
red3 is close to blue7 red7 is close to
blue4 blue7 passes to red7 bad pass
x passes to y, y is red gt bad pass x passes to
y, z is close to y, z is red gt bad pass
60
Learning to pass representation
  • Could we represent soccer situations in one
    table?
  • E.g., attribute 1 is close to 3, or somebody
    is close to 5,
  • Yes, but
  • Many different attributes necessary
  • Attributes refer to specific player numbers
  • Hypotheses not expressible at right level of
    abstraction
  • No variables possible in hypotheses

61
Representations Conclusions
  • Having a good representation is important!
  • Standard setting
  • limited expressive power
  • high efficiency
  • sufficient in many cases!
  • Other settings only use when needed
  • when transformation into standard setting is not
    feasible

62
Relating ILP to Other Learning Approaches
  • Learning methods we earlier discussed can learn
    hypotheses that can be expressed in propositional
    logic
  • conditions are chosen from a fixed set of
    propositions, e.g., "Sky sunny", ...
  • ILP learners use framework of first order
    predicate logic
  • predicates, variables, quantifiers
  • conditions are literals composed of predicate
    symbol, variables, constants
  • more expressive

63
What Exactly Is ILP?
  • ILP inductive logic programming
  • logic programming programs are sets of first
    order rules ("Horn clauses")
  • inductive logic programming learn such programs
    from examples
  • more generally learn logic formulae from
    examples
  • -gt study induction in the framework of first
    order predicate logic
  • Up till now only propositional methods
  • hypotheses can be expressed in propositional logic

64
Logic Programming
  • Logic program definition of predicates from
    existing predicates
  • practical language Prolog

Definition "a course is an advanced course iff
it is difficult or has an advanced course as a
prerequisite" ?x (Advanced(x) ? Difficult(x) ?
?y Prerequisite(y,x) ? Advanced(y)) Representatio
n as Horn clauses Advanced(x) ?
Difficult(x) Advanced(x) ? Prerequisite(y,x) ?
Advanced(y)) Representation as Prolog
program advanced(X) - difficult(X). advanced(X)
- prerequisite(Y,X), advanced(Y).
65
  • First order logic formulae can be general
    assertions (not necessarily definitions)
  • can represent any kind of knowledge

Assertion "all people are male or female" ?x
Human(x) ? Male(x) ? Female(x) Using Prolog-like
notation male(X) female(X) - human(X)
66
Some terminology
  • Terms refer to objects in the world
  • variables (X, Y, ...) or constants (a, b, 5, ...)
    (...)
  • Predicates properties of / relationships
    between objects
  • predicate human/1, male/1, father/2, ...
  • atom predicate symbol n arguments (terms)
  • e.g. human(X), father(luc, soetkin)
  • evaluate to true or false
  • literal possibly negated atom
  • human(luc), not father(soetkin,luc), ...

67
  • Clause disjunction of literals in which all
    variables universally quantified
  • e.g., ?x,y father(x,y) ? ?male(x) ? ?parent(x,y)
  • equivalent to ?x,y father(x,y) ? male(x)
    ?parent(x,y)
  • Prolog notation father(X,Y) - male(X),
    parent(X,Y)
  • Horn clause contains max. 1 positive literal
  • Variable substitution
  • changing variable into other variable or constant
  • e.g., ? X/a, Y/b
  • application of substitution ? to clause c c?
  • e.g., father(X,Y)? father(a,b)

68
  • Ground formula contains no variables
  • Fact clause consisting of 1 atom
  • e.g., father(luc,soetkin)
  • Ground facts are very useful to represent data

69
Inductive logic programming
  • Induction of first order logic formulae from data

Prolog dataset
Knowledge discovered
male(luc). male(maarten). female(lieve). female(so
etkin). father(luc,soetkin). father(luc,maarten).
mother(lieve,soetkin). mother(lieve,maarten). pare
nt(luc,soetkin). ...
false - male(X), female(X). female(X) -
mother(X,Y). male(X) - father(X,Y). parent(X,Y)
- father(X,Y). parent(X,Y) - mother(X,Y). ...
70
  • Different settings possible
  • learn definition for one predicate
  • predictive induction
  • e.g., learn definition of "father"
  • very similar to learning rule sets
  • learn general patterns
  • descriptive induction
  • e.g., learn relationships between male, female,
    ...
  • false - male(X),female(X).
  • similar to discovery of association rules

71
  • Representation of data
  • propositional 1 example is described by listing
    true propositions (usually in table format)
  • size of description is constant
  • ILP 1 example is essentially described by a set
    of related facts
  • size of this set may vary

72
"Enjoy Sport" example
  • Description of one example 1 row in table

Day Sky Airtemp Humidity Wind Water
Forecast EnjoySport 1 sunny warm
normal strong cool change yes
Possible propositional representation
Possible first order representation
day(1). sky(1, sunny). airtemp(1,
warm). humidity(1, normal). wind(1,
strong). water(1, cool). forecast(1,
change). enjoy_sport(1, yes).
"Skysunny" true "Skyrainy"
false "Airtempwarm" true ...
73
"Bongard" example
  • Classify examples based on internal structure

neg
pos
74
Bongard Propositional Representation
  • How to represent drawing in 1 table?
  • attributes for each component of drawing
  • attributes for relationships between components
  • Several problems with this representation
  • 1. assumption of fixed number of objects
  • if fewer objects leave attributes blank
  • 2. very large number of attributes
  • many of which possibly blank or irrelevant
  • 3. meaning of attributes not well-defined
  • multiple representations for same thing possible

75
  • Issue 1 and 2
  • e.g., max. 5 objects in drawing
  • attributes Object1, Points1, Object2, Points2,
    ..., Object5, Points5, Inside12, Inside13,
    Inside14, Inside15, Inside21, Inside23, ...,
    Inside54
  • attributes easily superlinear in objects (here
    quadratic)
  • the more objects allowed, the more blank
    attributes

76
  • Issue 3 consider this example
  • Possible representations ("Inside" left out)

Fig. Obj1 Points1 Obj2 Points2 Obj3 Points3 Class
1 Circle - Triangle Down - - pos 1 Triangle Down
Circle - - - pos 1 - - Triangle Down Circle - pos
... ... ... ... ... ... ... ...
or
or
or
How to represent concept "contains triangle
pointing down?
IF Object2 triangle AND Points2 down THEN pos
Does not work with each valid representation!
77
  • Attribute-value table and corresponding rules
    provide incorrect level of abstraction
  • Better representation
  • 1 example multiple rows (possibly in multiple
    tables)
  • Learning algorithms need to be adapted!

78
Bongard First Order Logic Representation
  • First order logic representation

contains(1, o1). contains(1, o2). triangle(o1). po
ints(o1, down). circle(o2). pos(1).
Drawing 1
any number of objects allowed
pos(X) - contains(X,Y), triangle(Y), points(Y,
down).
use of variables provides right abstraction level
for hypothesis
79
  • Equivalent representation as relational database
    (cf. data mining)
  • 1 example set of tuples instead of 1 tuple
  • ILP mining in multiple tuples / multiple
    relations
  • important issue in current data mining research

Contains
Objects
Inside
information about example 1
80
Background knowledge
  • Additional advantage of first order logic
    background knowledge about domain can be
    expressed concisely

triangle(o1). polygon(o1). square(o2). polygon(o2)
. circle(o3). square(o4). polygon(o4). square(o5).
polygon(o5). ...
Background knowledge
polygon(X) - triangle(X). polygon(X) -
square(X).
triangle(o1). square(o2). circle(o3). square(o4).
square(o5). ...
Data about examples
81
A Real World Example
  • Find "pharmacophore" in molecules
  • identify substructure that causes it to "dock"
    on certain other molecules
  • Molecules described by listing for each atom in
    it element, 3-D coordinates, ...
  • Background defines computation of euclidean
    distance, ...

82
Background knowledge
Description of molecules
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.4304,-3.1160,0.0489). atm(m1,a2,c,
2,6.0334,-1.7760,0.6795). atm(m1,a3,o,2,7.0265,-2.
0425,0.0232). ... bond(m1,a2,a3,2). bond(m1,a5,a6,
1). bond(m1,a2,a4,1). bond(m1,a6,a7,du). ...
Hypothesis
active(A) - zincsite(A,B), hacc(A,C), hacc(A,D),
hacc(A,E), dist(A,C,B,4.891,0.750),
dist(A,C,D,3.753,0.750), dist(A,
C,E,3.114,0.750), dist(A,D,B,8.475,0.750),
dist(A,D,E, 2.133,0.750),
dist(A,E,B,7.899,0.750).
83
  • Some example molecules

84
ILP Related to "Explanation Based Learning"
  • EBL explanation based learning
  • One of first approaches to use first order logic
  • Special case of "analytical learning"
  • Idea
  • Specify full domain theory explaining examples
  • Only part of theory actually relevant for
    examples
  • Find an explanation for an example find that
    part of the full theory that is relevant for this
    example

85
  • Also called "speed-up" learning
  • Once more specific explanation are available,
    these can be used to make predictions much more
    efficiently than with full theory
  • Relation to ILP
  • Language bias in ILP is very similar to domain
    theory in EBL
  • ILP language bias can be seen as specifying a
    domain theory that is not necessarily correct
  • That's what makes it inductive
  • EBL could be simulated using ILP

86
Some general theory EBL correct, not
necessarily relevant explanations ILP possible,
not necessarily correct explanations
deduction
Hypothesis
induction
Observed data
87
Conclusions
  • Advantages of using first order logic
  • More complex data can be represented
  • Existing background knowledge can be represented
  • More powerful representation language for
    hypotheses
  • Hence, useful for certain kinds of learning...
  • "structural" learning examples have complex
    structure
  • "relational" learning relations between objects
    are important
  • Also related to data mining in relational
    database
  • Inductive logic programming provides these

88
Fundamentals of inductive logic programming
  • Notion of generality (cf. versionspaces)
  • How to specialise conditions?
  • How to generalise conditions?
  • Main concepts
  • operators for specialisation / generalisation
  • ?-subsumption
  • inverse resolution
  • least general generalisation

89
Notion of generality
  • Remember versionspaces
  • notion of generality was very important
  • lt?,?gt ?g ltwarm,?gt ?g ltwarm,sunnygt
  • In first order logic same question...
  • when is a concept definition more general than
    another one?
  • answer will allow to implement, e.g.,
    general-to-specific search
  • ... but more difficult to answer

90
One Option...
  • A theory G is more general than a theory S if and
    only if G S
  • G S in every interpretation (set of facts)
    for which G is true, S is also true
  • "G logically implies S"
  • e.g., "all fruit tastes good" "all apples
    taste good" (assuming apples are fruit)
  • Note talking about general theories here, not
    just concepts (lt-gt versionspaces)
  • generality of concepts is special case of this

91
  • Induction inverse of deduction
  • Deductive operators "-" exist that implement (or
    approximate)
  • E.g., resolution (from logic programming)
  • So, inverting these operators should yield
    inductive operators
  • basic technique in many inductive logic
    programming systems

92
Various frameworks for generality
  • Depending on form of G and S
  • 1 clause / set of clauses / any first order
    theory
  • Depending on choice of - to invert
  • theta-subsumption
  • resolution
  • implication
  • Some frameworks easier than others

93
Inverting resolution
  • Resolution works very well for deductive systems
    (e.g., Prolog)
  • Simple cases of resolution

Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
94
The Resolution Rule
  • General resolution rule

2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
95
Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
96
Inverting Resolution
  • Inverse resolution is much more difficult than
    resolution itself
  • different operators needed (see further)
  • no unique results
  • making 2 things equal can be done in only one
    way, but making things different can be done in
    many ways!
  • hence, very large search space
  • Turned out to be impractical, unless a human
    guides the generalisation process
  • Interactive learners

97
Inverse resolution operators
  • Some operators related to inverse resolution (A,
    B are conjunctions of literals)
  • absorption
  • from q-A and p - A,B
  • infer p - q,B
  • identification
  • from p - q,B and p - A,B
  • infer q - A

q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
98
  • Intra-construction
  • from p - A,B and p - A,C
  • infer q - B and p - A,q and q - C
  • Inter-construction
  • from p - A,B and q - A,C
  • infer p - r,B and r - A and q - r,C

q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
99
Predicate invention
  • With intra- and inter-construction, new
    predicates are invented
  • E.g., apply intra-construction on
  • grandparent(X,Y) - father(X,Z), father(Z,Y)
  • grandparent(X,Y) - father(X,Z), mother(Z,Y)
  • What predicate is invented?

100
Example derivation
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
101
We need something simpler
  • Inverse resolution
  • allows to generalise sets of clauses
  • but most of the time is too complex
  • Move towards generalisation of single clauses
  • popular operators will be based on
    theta-subsumption

102
Theta-subsumption
  • 1 clause c1 theta-subsumes another clause c2 if
    there exists a variable substitution such that it
    becomes a subset of the other one
  • c1 ?? c2 ? ?? c1? ? c2
  • to check this, first write clauses as
    disjunctions
  • a,b,c ? d,e,f ltgt a ? b ? c ? ?d ? ?e ?
    ?f
  • then try to replace variables by constants or
    other variables
  • Intuition c2 is a special case of c1

103
Theta-subsumption examples
  • c1 father(X,Y) - parent(X,Y)
  • c2 father(X,Y) - parent(X,Y), male(X)
  • for ? c1? ? c2 gt c1 ?-subsumes c2
  • c3 father(luc,Y) - parent(luc,Y)
  • for ? X/luc c1? c3 gt c1 ?-subsumes c3
  • c2 and c3 do not ?-subsume one another

104
  • Try yourself
  • c1 p(X,Y) - q(X,Y)
  • c2 p(X,Y) - q(X,Y), q(Y,X)
  • c3 p(Z,Z) - q(Z,Z)
  • c4 p(a,a) - q(a,a)
  • Which clauses ?-subsumed by which?

105
Searching for clauses
  • Structure hypothesis space (space of clauses)
    according to ?-subsumption
  • Operators for moving in this space
  • minimal specialisation operator
  • from a clause c, derive d such that c ?? d and
    ??e c ?? e ?? d
  • minimal generalisation operator
  • usually starts from 2 clauses
  • from c and d finds e such that e ?? c, e ?? d and
    ?? f e ?? f and f ?? c and f ?? d

106
  • Note similarity with propositional refinement
  • IF Sky sunny THEN EnjoySportsyes
  • To specialise add 1 condition
  • IF Skysunny AND Humiditylow THEN
    EnjoySportsyes
  • In first order logic
  • c1 father(X,Y) - parent(X,Y)
  • To specialize find clauses ?-subsumed by c1
  • father(X,Y) - parent(X,Y), male(X)
  • father(luc,X) - parent(luc,X)
  • father(X,X) - parent(X,X)
  • ...

107
  • Properties of ?-subsumption
  • Sound
  • if c1 ?-subsumes c2 then c1 c2
  • Incomplete
  • possibly c1 c2 without c1 ?-subsuming c2 (but
    only for recursive clauses)
  • c1 p(f(X)) - p(X)
  • c2 p(f(f(X))) - p(X)
  • Checking whether c1 ?-subsumes c2 is decidable
    but NP-complete
  • Transitive and reflexive, not anti-symmetric
  • "semi-order" relation

108
  • Semi-order generates equivalence classes
    partial order on those equivalence classes
  • equivalence class c1c2 iff c1 ?? c2 and c2 ??
    c1
  • c1 and c2 are then called syntactic variants
  • c1 is reduced clause of c2 iff c1 contains
    minimal subset of literals of c2 that is still
    equivalent with c2
  • if c1 and c2 in different equivalence classes,
    either c1 ?? c2 or c2 ?? c1 or neither gt
    antisymmetry gt partial order

109
p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y),m(X,Z),m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
110
  • Since equivalence classes are partially ordered,
    they form a lattice
  • least upper bound / greatest lower bound of two
    clauses always exists
  • Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
  • h(X,Y) - p(X,Y)
  • h(X,Y) - p(X,Y), p(Y,Y2)
  • h(X,Y) - p(X,Y), p(Y,Y2), p(Y2,Y3)
  • ...
  • h(X,X) - p(X,X)

111
Specialisation operators
  • How to traverse hypothesis space so that
  • no hypotheses are generated more than once
  • no hypotheses are skipped
  • Shapiro general-to-specific traversal using
    refinement operator ?
  • ?(c) yields set of refinements of c
  • ?(c) ? c' c' is a maximally general
    specialisation of c
  • ?(c) ? c ? l l is a literal ? c? ? is a
    substitution

112
daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(X,Y)
113
A generalisation operator
  • Start from 2 clauses and compute least general
    generalisation (lgg)
  • i.e., given 2 clauses, return most specific
    single clause that is more general than both of
    them
  • Definition of lgg of terms
  • (let si, tj denote any term, V a variable)
  • lgg(f(s1,...,sn), f(t1,...,tn))
    f(lgg(s1,t1),...,lgg(sn,tn))
  • lgg(f(s1,...,sn),g(t1,...,tn)) V

114
  • lgg of literals
  • lgg(p(s1,...,sn),p(t1,...,tn))p(lgg(s1,t1),...,lg
    g(sn,tn))
  • lgg(?p(...), ?p(...)) ?lgg(p(...),p(...))
  • lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
  • lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
    undefined
  • lgg of clauses
  • lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
    lgg(l1,l2) defined

115
Applying lgg
  • Example
  • f(t,a) - p(t,a), m(t), f(a)
  • f(j,p) - p(j,p), m(j), m(p)
  • lgg f(X,Y) - p(X,Y), m(X), m(Z)
  • Relative lgg (rlgg) (Plotkin 1971)
  • relative to "background theory" B (assume B is a
    set of facts)
  • rlgg(e1,e2) lgg(e1 - B, e2 - B)

116
  • Example Bongard problems

pos(1). pos(2). contains(1,o1
). contains(2,o3). contains(1,o2). triangl
e(o1). triangle(o3). points(o1,down).
points(o3,down). circle(o2).
1
2
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
117
Conclusions
  • We now have basic operators
  • ?-subsumption-based at single clause level
  • Specialization operator ?
  • Generalization operator lgg
  • Inverse resolution generalize a theory (set of
    clauses)
  • These can be used to build ILP systems
  • Top-down using specialization operators
  • Bottom-up using generalization operators

118
Rule induction
  • Most inductive logic programming systems induce
    concept definition in form of rule set
  • Algorithms similar to propositional algorithms
  • FOIL -gt CN2
  • Progol -gt AQ

119
FOIL (Quinlan)
  • Learns single concept, e.g., p(X,Y) - ...
  • To learn one clause (hill-climbing search)
  • start with general clause p(X,Y) - true
  • repeat
  • add best literal to clause (i.e., literal that
    most improves quality of clause)
  • new literal can also be unification Xc or XY
  • applying refinement operator under
    ?-subsumption
  • until no further improvement

120
FOIL ExampleLearning One Clause
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
121
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
122
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
123
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
124
FOIL Learning Multiple Clauses
  • To learn multiple clauses
  • repeat
  • learn a single clause c (see previous algorithm)
  • add c to h
  • mark positive examples covered by c as covered
  • until
  • all positive examples marked covered
  • or no more good clauses found

125
likes(garfield, lasagna). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie). edible(lasagna). e
dible(birds). subject_to_cruelty(odie). subject_
to_cruelty(jon). subject_to_cruelty(birds).
likes(garfield, X) - edible(X).
3,0-
126
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
127
Some pitfalls
  • Avoiding infinite recursion
  • when recursive clauses allowed, e.g.,
    ancestor(X,Y) - parent(X,Z), ancestor(Z,Y)
  • avoid learning parent(X,Y) - parent(X,Y)
  • won't be useful, even though it's 100 correct
  • Bonus for introduction of new variables
  • literal may not yield any direct gain, but may
    introduce variables that may be useful later

p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
128
Golem (Muggleton Feng)
  • Based on rlgg-operator
  • To build one clause
  • Look at 2 positive examples, find rlgg,
    generalize using yet another example, until no
    improvement in quality of clause
  • bottom-up search
  • Result very dependent on choice of examples
  • e.g. what if true theory is p(X) - q(X) , p(X)
    - r(X) ?

129
  • Try this for different couples, pick best clause
    found
  • this reduces dependency on choice of couple (if 1
    of them noisy no good clause found)
  • Remove covered positive examples, restart process
  • Repeat until no more good clauses found

130
Progol (Muggleton)
  • Top-down approach, but with seed
  • To find one clause
  • Start with 1 positive example e
  • Generate hypothesis space He that contains only
    hypotheses that cover at least this one example
  • first generate most specific clause c that covers
    e
  • He contains every clause more general than c
  • Perform exhaustive top-down search in He, looking
    for clause that maximizes compaction
  • Compaction size(covered examples) - size(clause)

131
  • Repeat process of finding one clause until no
    more good ( causing compaction) clauses found
  • Compaction heuristic in principle allows no
    coverage of negatives
  • can be relaxed (accommodating noise)

132
Generation of bottom clause
  • Language bias set of all acceptable clauses
    (chosen by user)
  • specification of H (on level of single clauses)
  • Bottom clause ? for example e most specific
    clause in language bias covering e
  • Constructed using inverse entailment

133
  • Construction of ?
  • if B?H e, then B ? ?e ?H
  • if H is clause, ?H is conjunction of ground
    (skolemized) literals
  • compute ?? all ground literals entailed by B ?
    ?e
  • ?H must be subset of these
  • so B ? ?e ?? ?H
  • hence H ?

134
  • Some examples (Muggleton, 1995, New Generation
    Computing)

?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
135
  • Example of (part of) Progol run
  • learn to classify animals as mammals, reptiles,
    ...

- generalise(class/2)? Generalising
class(dog,mammal). Most specific clause
is class(A,mammal) - has_milk(A),
has_covering(A,hair
Write a Comment
User Comments (0)
About PowerShow.com