CIS732Lecture3320070411

About This Presentation

Title:

CIS732Lecture3320070411

Description:

Father, Mother, Parent, Male, Female. Learning Problem. Formulation ... Predicates: e.g., Father-Of, Greater-Than. Functions: e.g., age, cosine ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 33

Provided by: lindajacks

Learn more at: https://www.kddresearch.org

Category:

more less

Transcript and Presenter's Notes

Title: CIS732Lecture3320070411

1
Lecture 33 of 42
Intro to Rule Learning
Wednesday, 11 April 2007 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.cis.ksu.edu/bhsu
Readings Chapter 10, Mitchell
2
Lecture Outline

Readings Sections 10.6-10.8, Mitchell Section
21.4, Russell and Norvig
Suggested Exercises 10.5, Mitchell
Induction as Inverse of Deduction
Problem of inductive learning revisited
Operators for automated deductive inference
Resolution rule for deduction
First-order predicate calculus (FOPC) and
resolution theorem proving
Inverting resolution
Propositional case
First-order case
Inductive Logic Programming (ILP)
Cigol
Progol

3
Induction as Inverted DeductionDesign Principles
4
Induction as Inverted DeductionExample

Deductive Query
Pairs ltu, vgt of people such that u is a child of
v
Relations (predicates)
Child (target predicate)
Father, Mother, Parent, Male, Female
Learning Problem
Formulation
Concept learning target function f is
Boolean-valued
i.e., target predicate
Components
Target function f(xi) Child (Bob, Sharon)
xi Male (Bob), Female (Sharon), Father (Sharon,
Bob)
B Parent (x, y) ? Father (x, y). Parent (x, y)
? Mother (x, y).
What satisfies ? ltxi, f(xi)gt ? D . (B ? D ? xi)
? f(xi)?
h1 Child (u, v) ? Father (v, u). - doesnt use B
h2 Child (u, v) ? Parent (v, u). - uses B

5
Perspectives onLearning and Inference

Jevons (1874)
First published insight that induction can be
interpreted as inverted deduction
Induction is, in fact, the inverse operation of
deduction, and cannot be conceived to exist
without the corresponding operation, so that the
question of relative importance cannot arise.
Who thinks of asking whether addition or
subtraction is the more important process in
arithmetic? But at the same time much difference
in difficulty may exist between a direct and
inverse operation it must be allowed that
inductive investigations are of a far higher
degree of difficulty and complexity that any
questions of deduction
Aristotle (circa 330 B.C.)
Early views on learning from observations
(examples) and interplay between induction and
deduction
scientific knowledge through demonstration
i.e., deduction is impossible unless a man
knows the primary immediate premises we must get
to know the primary premises by induction for
the method by which even sense-perception
implants the universal is inductive

6
Induction as Inverted DeductionOperators

Deductive Operators
Have mechanical operators (F) for finding
logically entailed conclusions (C)
F(A, B) C where A ? B ? C
A, B, C logical formulas
F deduction algorithm
Intuitive idea apply deductive inference (aka
sequent) rules to A, B to generate C
Inductive Operators
Need operators O to find inductively inferred
hypotheses (h, primary premises)
O(B, D) h where ? ltxi, f(xi)gt ? D . (B ? D ?
xi) ? f(xi)
B, D, h logical formulas describing observations
O induction algorithm

7
Induction as Inverted DeductionAdvantages and
Disadvantages

Advantages (Pros)
Subsumes earlier idea of finding h that fits
training data
Domain theory B helps define meaning of fitting
the data B ? D ? xi ? f(xi)
Suggests algorithms that search H guided by B
Theory-guided constructive induction Donoho and
Rendell, 1995
aka Knowledge-guided constructive induction
Donoho, 1996
Disadvantages (Cons)
Doesnt allow for noisy data
Q Why not?
A Consider what ? ltxi, f(xi)gt ? D . (B ? D ? xi)
? f(xi) stipulates
First-order logic gives a huge hypothesis space H
Overfitting
Intractability of calculating all acceptable hs

8
DeductionResolution Rule
9
Inverting ResolutionExample
C2 Know-Material ? ?Study
C1 Pass-Exam ? ?Know-Material
Resolution
C1 Pass-Exam ? ?Know-Material
Inverse Resolution
C Pass-Exam ? ?Study
10
Inverted ResolutionPropositional Logic
11
Quick ReviewFirst-Order Predicate Calculus
(FOPC)

Components of FOPC Formulas Quick Intro to
Terminology
Constants e.g., John, Kansas, 42
Variables e.g., Name, State, x
Predicates e.g., Father-Of, Greater-Than
Functions e.g., age, cosine
Term constant, variable, or function(term)
Literals (atoms) Predicate(term) or negation
(e.g., ?Greater-Than (age (John), 42)
Clause disjunction of literals with implicit
universal quantification
Horn clause at most one positive literal (H ?
?L1 ? ?L2 ? ? ?Ln)
FOPC Representation Language for First-Order
Resolution
aka First-Order Logic (FOL)
Applications
Resolution using Horn clauses logic programming
(Prolog)
Automated deduction (deductive inference),
theorem proving
Goal learn first-order rules by inverting
first-order resolution

12
First-Order Resolution
13
Inverted Resolution First-Order Logic
14
Inverse Resolution Algorithm (Cigol) Example
Father (Tom, Bob)
Father (Shannon, Tom)
GrandChild (Bob, Shannon)
15
Progol

Problem Searching Resolution Space Results in
Combinatorial Explosion
Solution Approach
Reduce explosion by generating most specific
acceptable h
Conduct general-to-specific search (cf. Find-G,
CN2 ? Learn-One-Rule)
Procedure
1. User specifies H by stating predicates,
functions, and forms of arguments allowed for
each
2. Progol uses sequential covering algorithm
FOR each ltxi, f(xi)gt DO
Find most specific hypothesis hi such that B ? hi
? xi ? f(xi)
Actually, considers only entailment within k
steps
3. Conduct general-to-specific search bounded by
specific hypothesis hi, choosing hypothesis with
minimum description length

16
Learning First-Order RulesNumerical versus
Symbolic Approaches

Numerical Approaches
Method 1 learning classifiers and extracting
rules
Simultaneous covering decision trees, ANNs
NB extraction methods may not be simple
enumeration of model
Method 2 learning rules directly using numerical
criteria
Sequential covering algorithms and search
Criteria MDL (information gain), accuracy,
m-estimate, other heuristic evaluation functions
Symbolic Approaches
Invert forward inference (deduction) operators
Resolution rule
Propositional and first-order variants
Issues
Need to control search
Ability to tolerate noise (contradictions)
paraconsistent reasoning

17
Learning Disjunctive Sets of Rules

Method 1 Rule Extraction from Trees
Learn decision tree
Convert to rules
One rule per root-to-leaf path
Recall can post-prune rules (drop pre-conditions
to improve validation set accuracy)
Method 2 Sequential Covering
Idea greedily (sequentially) find rules that
apply to (cover) instances in D
Algorithm
Learn one rule with high accuracy, any coverage
Remove positive examples (of target attribute)
covered by this rule
Repeat

18
Sequential CoveringAlgorithm

Algorithm Sequential-Covering (Target-Attribute,
Attributes, D, Threshold)
Learned-Rules ?
New-Rule ? Learn-One-Rule (Target-Attribute,
Attributes, D)
WHILE Performance (Rule, Examples) gt Threshold DO
Learned-Rules New-Rule // add new rule to set
D.Remove-Covered-By (New-Rule) // remove examples
covered by New-Rule
New-Rule ? Learn-One-Rule (Target-Attribute,
Attributes, D)
Sort-By-Performance (Learned-Rules,
Target-Attribute, D)
RETURN Learned-Rules
What Does Sequential-Covering Do?
Learns one rule, New-Rule
Takes out every example in D to which New-Rule
applies (every covered example)

19
Learn-One-Rule(Beam) Search for Preconditions
IF THEN Play-Tennis Yes
20
Learn-One-RuleAlgorithm

Algorithm Sequential-Covering (Target-Attribute,
Attributes, D)
Pos ? D.Positive-Examples()
Neg ? D.Negative-Examples()
WHILE NOT Pos.Empty() DO // learn new rule
Learn-One-Rule (Target-Attribute, Attributes, D)
Learned-Rules.Add-Rule (New-Rule)
Pos.Remove-Covered-By (New-Rule)
RETURN (Learned-Rules)
Algorithm Learn-One-Rule (Target-Attribute,
Attributes, D)
New-Rule ? most general rule possible
New-Rule-Neg ? Neg
WHILE NOT New-Rule-Neg.Empty() DO // specialize
New-Rule
1. Candidate-Literals ? Generate-Candidates() //
NB rank by Performance()
2. Best-Literal ? argmaxL? Candidate-Literals
Performance (Specialize-Rule (New-Rule, L),
Target-Attribute, D) // all possible new
constraints
3. New-Rule.Add-Precondition (Best-Literal) //
add the best one
4. New-Rule-Neg ? New-Rule-Neg.Filter-By
(New-Rule)
RETURN (New-Rule)

21
Learn-One-RuleSubtle Issues

How Does Learn-One-Rule Implement Search?
Effective approach Learn-One-Rule organizes H in
same general fashion as ID3
Difference
Follows only most promising branch in tree at
each step
Only one attribute-value pair (versus splitting
on all possible values)
General to specific search (depicted in figure)
Problem greedy depth-first search susceptible to
local optima
Solution approach beam search (rank by
performance, always expand k best)
Easily generalizes to multi-valued target
functions (how?)
Designing Evaluation Function to Guide Search
Performance (Rule, Target-Attribute, D)
Possible choices
Entropy (i.e., information gain) as for ID3
Sample accuracy (nc / n ? correct rule
predictions / total predictions)
m estimate (nc mp) / (n m) where m ? weight,
p ? prior of rule RHS

22
Variants of Rule Learning Programs

Sequential or Simultaneous Covering of Data?
Sequential isolate components of hypothesis
(e.g., search for one rule at a time)
Simultaneous whole hypothesis at once (e.g.,
search for whole tree at a time)
General-to-Specific or Specific-to-General?
General-to-specific add preconditions, Find-G
Specific-to-general drop preconditions, Find-S
Generate-and-Test or Example-Driven?
Generate-and-test search through syntactically
legal hypotheses
Example-driven Find-S, Candidate-Elimination,
Cigol (next time)
Post-Pruning of Rules?
Recall (Lecture 5) very popular overfitting
recovery method
What Statistical Evaluation Method?
Entropy
Sample accuracy (aka relative frequency)
m-estimate of accuracy

23
First-Order Rules

What Are First-Order Rules?
Well-formed formulas (WFFs) of first-order
predicate calculus (FOPC)
Sentences of first-order logic (FOL)
Example (recursive)
Ancestor (x, y) ? Parent (x, y).
Ancestor (x, y) ? Parent (x, z) ? Ancestor (z,
y).
Components of FOPC Formulas Quick Intro to
Terminology
Constants e.g., John, Kansas, 42
Variables e.g., Name, State, x
Predicates e.g., Father-Of, Greater-Than
Functions e.g., age, cosine
Term constant, variable, or function(term)
Literals (atoms) Predicate(term) or negation
(e.g., ?Greater-Than (age(John), 42))
Clause disjunction of literals with implicit
universal quantification
Horn clause at most one positive literal (H ?
?L1 ? ?L2 ? ? ?Ln)

24
Learning First-Order Rules

Why Do That?
Can learn sets of rules such as
Ancestor (x, y) ? Parent (x, y).
Ancestor (x, y) ? Parent (x, z) ? Ancestor (z,
y).
General-purpose (Turing-complete) programming
language PROLOG
Programs are such sets of rules (Horn clauses)
Inductive logic programming (next time) kind of
program synthesis
Caveat
Arbitrary inference using first-order rules is
semi-decidable
Recursive enumerable but not recursive (reduction
to halting problem LH)
Compare resolution theorem-proving arbitrary
queries in Prolog
Generally, may have to restrict power
Inferential completeness
Expressive power of Horn clauses
Learning part

25
First-Order RuleExample

Prolog (FOPC) Rule for Classifying Web Pages
Slattery, 1997
Course (A) ?
Has-Word (A, instructor),
not Has-Word (A, good),
Link-From (A, B),
Has-Word (B, assign),
not Link-From (B, C).
Train 31/31, test 31/34
How Are Such Rules Used?
Implement search-based (inferential) programs
References
Chapters 1-10, Russell and Norvig
Online resources at http//archive.comlab.ox.ac.uk
/logic-prog.html

26
First-Order Inductive Learning (FOIL)Algorithm

Algorithm FOIL (Target-Predicate, Predicates, D)
Pos ? D.Filter-By(Target-Predicate) // examples
for which it is true
Neg ? D.Filter-By(Not (Target-Predicate)) //
examples for which it is false
WHILE NOT Pos.Empty() DO // learn new rule
Learn-One-First-Order-Rule (Target-Predicate,
Predicates, D)
Learned-Rules.Add-Rule (New-Rule)
Pos.Remove-Covered-By (New-Rule)
RETURN (Learned-Rules)
Algorithm Learn-One-First-Order-Rule
(Target-Predicate, Predicate, D)
New-Rule ? the rule that predicts
Target-Predicate with no preconditions
New-Rule-Neg ? Neg
WHILE NOT New-Rule-Neg.Empty() DO // specialize
New-Rule
1. Candidate-Literals ? Generate-Candidates() //
based on Predicates
2. Best-Literal ? argmaxL? Candidate-Literals
FOIL-Gain (L, New-Rule, Target-Predicate,
D) // all possible new literals
3. New-Rule.Add-Precondition (Best-Literal) //
add the best one
4. New-Rule-Neg ? New-Rule-Neg,Filter-By
(New-Rule)
RETURN (New-Rule)

27
Specializing Rules in FOIL

Learning Rule P(x1, x2, , xk) ? L1 ? L2 ? ?
Ln.
Candidate Specializations
Add new literal to get more specific Horn clause
Form of literal
Q(v1, v2, , vr), where at least one of the vi in
the created literal must already exist as a
variable in the rule
Equal(xj, xk), where xj and xk are variables
already present in the rule
The negation of either of the above forms of
literals

28
Information Gain in FOIL
29
FOILLearning Recursive Rule Sets

Recursive Rules
So far ignored possibility of recursive WFFs
New literals added to rule body could refer to
target predicate itself
i.e., predicate occurs in rule head
Example
Ancestor (x, y) ? Parent (x, z) ? Ancestor (z,
y).
Rule IF Parent (x, z) ? Ancestor (z, y) THEN
Ancestor (x, y)
Learning Recursive Rules from Relations
Given appropriate set of training examples
Can learn using FOIL-based search
Requirement Ancestor ? Predicates (symbol is
member of candidate set)
Recursive rules still have to outscore competing
candidates at FOIL-Gain
NB how to ensure termination? (well-founded
ordering, i.e., no infinite recursion)
Quinlan, 1990 Cameron-Jones and Quinlan, 1993

30
FOILSummary

Extends Sequential-Covering Algorithm
Handles case of learning first-order rules
similar to Horn clauses
Result more powerful rules for performance
element (automated reasoning)
General-to-Specific Search
Adds literals (predicates and negations over
functions, variables, constants)
Can learn sets of recursive rules
Caveat might learn infinitely recursive rule
sets
Has been shown to successfully induce recursive
rules in some cases
Overfitting
If no noise, might keep adding new literals until
rule covers no negative examples
Solution approach tradeoff (heuristic evaluation
function on rules)
Accuracy, coverage, complexity
FOIL-Gain an MDL function
Overfitting recovery in FOIL post-pruning

31
Terminology

Induction and Deduction
Induction finding h such that ? ltxi, f(xi)gt ? D
. (B ? D ? xi) ? f(xi)
Inductive learning B ? background knowledge
(inductive bias, etc.)
Developing inverse deduction operators
Deduction finding entailed logical statements
F(A, B) C where A ? B ? C
Inverse deduction finding hypotheses O(B, D) h
where ? ltxi, f(xi)gt ? D . (B ? D ? xi) ?
f(xi)
Resolution rule deductive inference rule (P ? L,
?L ? R ? P ? R)
Propositional logic boolean terms, connectives
(?, ?, ?, ?)
First-order predicate calculus (FOPC)
well-formed formulas (WFFs), aka clauses (defined
over literals, connectives, implicit quantifiers)
Inverse entailment inverse of resolution
operator
Inductive Logic Programming (ILP)
Cigol ILP algorithm that uses inverse entailment
Progol sequential covering (general-to-specific
search) algorithm for ILP