Rule Induction

About This Presentation

Title:

Rule Induction

Description:

ACAI 05. ADVANCED COURSE ON KNOWLEDGE DISCOVERY. 2. Talk outline. Predictive vs. Descriptive DM ... Classification (learning of rulesets, decision trees, ... – PowerPoint PPT presentation

Number of Views:700

Avg rating:3.0/5.0

Slides: 81

Provided by: Opte

Category:

more less

Transcript and Presenter's Notes

Title: Rule Induction

1
Rule Induction
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Nada Lavrac
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana, Slovenia

2
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Predictive vs. Descriptive DM summary

3
Types of DM tasks

Predictive DM
Classification (learning of rulesets, decision
trees, ...)
Prediction and estimation (regression)
Predictive relational DM (RDM, ILP)
Descriptive DM
description and summarization
dependency analysis (association rule learning)
discovery of properties and constraints
segmentation (clustering)
subgroup discovery
Text, Web and image analysis

H

-
-
-
x
x
x
x
H
x

x
x
4
Predictive vs. descriptive induction

Predictive induction Inducing classifiers for
solving classification and prediction tasks,
Classification rule learning, Decision tree
learning, ...
Bayesian classifier, ANN, SVM, ...
Data analysis through hypothesis generation and
testing
Descriptive induction Discovering interesting
regularities in the data, uncovering patterns,
... for solving KDD tasks
Symbolic clustering, Association rule learning,
Subgroup discovery, ...
Exploratory data analysis

5
Predictive vs. descriptive induction A rule
learning perspective

Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks
Descriptive induction Discovers individual rules
describing interesting regularities in the data
Therefore Different goals, different heuristics,
different evaluation criteria

6
Supervised vs. unsupervised learning A rule
learning perspective

Supervised learning Rules are induced from
labeled instances (training examples with class
assignment) - usually used in predictive
induction
Unsupervised learning Rules are induced from
unlabeled instances (training examples with no
class assignment) - usually used in descriptive
induction
Exception Subgroup discovery
Discovers individual rules describing
interesting regularities in the data induced from
labeled examples

7
Subgroups vs. classifiers

Classifiers
Classification rules aim at pure subgroups
A set of rules forms a domain model
Subgroups
Rules describing subgroups aim at significantly
higher proportion of positives
Each rule is an independent chunk of knowledge
Link SD can be viewed as a form of
cost-sensitive classification

8
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Predictive vs. Descriptive DM summary

9
Predictive DM - Classification

data are objects, characterized with attributes -
objects belong to different classes (discrete
labels)
given the objects described by attribute values,
induce a model to predict different classes
decision trees, if-then rules, ...

10
Illustrative example Contact lenses data
11
Decision tree forcontact lenses recommendation
12
Illustrative example Customer data
13
Induced decision trees
Income
? 102000
? 102000
Age
yes
? 58
? 58
Gender
no
yes
female
male
Age
no
? 49
? 49
no
yes
14
Predictive DM - Estimation

often referred to as regression
data are objects, characterized with attributes
(discrete or continuous), classes of objects are
continuous (numeric)
given objects described with attribute values,
induce a model to predict the numeric class value
regression trees, linear and logistic regression,
ANN, kNN, ...

15
Illustrative example Customer data
16
Customer data regression tree
Income
? 108000
? 108000
Age
12000
? 42.5
? 42.5
16500
26700
17
Predicting algal biomass regression tree
Month
Jan.-June
July - Dec.
Ptot
Si
? 9.34
gt 9.34
? 10.1
gt10.1
2.34?1.65
Ptot
Ptot
4.32?2.07
? 9.1
gt 9.1
? 5.9
gt 5.9
Si
1.28?1.08
2.08 ?0.71
2.97?1.09
gt 2.13
? 2.13
0.70?0.34
1.15?0.21
18
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Predictive vs. Descriptive DM summary

19
Ruleset representation

Rule base is a disjunctive set of conjunctive
rules
Standard form of rules IF Condition THEN Class
Class IF Conditions
Class ? Conditions
Examples
IF OutlookSunny ? HumidityNormal THEN
PlayTennisYesIF OutlookOvercast THEN
PlayTennisYesIF OutlookRain ? WindWeak THEN
PlayTennisYes
Form of CN2 rules IF Conditions THEN
MajClass ClassDistr
Rule base R1, R2, R3, , DefaultRule

20
Classification Rule Learning

Rule set representation
Two rule learning approaches
Learn decision tree, convert to rules
Learn set/list of rules
Learning an unordered set of rules
Learning an ordered list of rules
Heuristics, overfitting, pruning

21
Decision tree vs. rule learning Splitting vs.
covering

Splitting (ID3, C4.5, J48, See5)
Covering (AQ, CN2)

-
-
-
-
-

-
-
-
-
-
22
PlayTennis Training examples
23
PlayTennis Using a decision tree for
classification
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
Weak
Strong
Normal
High
No
Yes
No
Yes
Is Saturday morning OK for playing
tennis? OutlookSunny, TemperatureHot,
HumidityHigh, WindStrong PlayTennis No,
because OutlookSunny ? HumidityHigh
24
PlayTennis Converting a tree to rules

IF OutlookSunny ? HumidityNormal THEN
PlayTennisYes
IF OutlookOvercast THEN PlayTennisYes
IF OutlookRain ? WindWeak THEN PlayTennisYes
IF OutlookSunny ? HumidityHigh THEN
PlayTennisNo
IF OutlookRain ? WindStrong THEN PlayTennisNo

25
Contact lense classification rules

tear productionreduced gt lensesNONE
S0,H0,N12
tear productionnormal astigmatismno gt
lensesSOFT S5,H0,N1
tear productionnormal astigmatismyes spect.
pre.myope gt lensesHARD S0,H3,N2
tear productionnormal astigmatismyes spect.
pre.hypermetrope gt lensesNONE
S0,H1,N2
DEFAULT lenses NONE

26
Unordered rulesets

rule Class IF Conditions is learned by first
determining Class and then Conditions
NB ordered sequence of classes C1, , Cn in
RuleSet
But unordered (independent) execution of rules
when classifying a new instance all rules are
tried and predictions of those covering the
example are collected voting is used to obtain
the final classification
if no rule fires, then DefaultClass (majority
class in E)

27
Contact lense decision list

Ordered (order dependent) rules
IF tear productionreduced THEN lensesNONE
ELSE /tear productionnormal/
IF astigmatismno THEN lensesSOFT
ELSE /astigmatismyes/
IF spect. pre.myope THEN lensesHARD
ELSE / spect.pre.hypermetrope/
lensesNONE

28
Ordered set of rules if-then-else decision lists

rule Class IF Conditions is learned by first
determining Conditions and then Class
Notice mixed sequence of classes C1, , Cn in
RuleBase
But ordered execution when classifying a new
instance rules are sequentially tried and the
first rule that fires (covers the example) is
used for classification
Decision list R1, R2, R3, , D rules Ri are
interpreted as if-then-else rules
If no rule fires, then DefaultClass (majority
class in Ecur)

29
Original covering algorithm(AQ, Michalski
1969,86)

Basic covering algorithm
for each class Ci do
Ei Pi U Ni (Pi pos., Ni neg.)
RuleBase(Ci) empty
repeat learn-set-of-rules
learn-one-rule R covering some positive examples
and no negatives
add R to RuleBase(Ci)
delete from Pi all pos. ex. covered by R
until Pi empty

-
-
-
-
-
30
Learning unordered set of rules (CN2, Clark and
Niblett)

RuleBase empty
for each class Ci do
Ei Pi U Ni, RuleSet(Ci) empty
repeat learn-set-of-rules
R Class Ci IF Conditions, Conditions
true
repeat learn-one-rule R Class Ci IF
Conditions AND Cond (general-to-specific beam
search of Best R)
until stopping criterion is satisfied (no
negatives covered
or Performance(R) lt ThresholdR)
add R to RuleSet(Ci)
delete from Pi all positive examples covered by
R
until stopping criterion is satisfied (all
positives covered or Performance(RuleSet(Ci)) lt
ThresholdRS)
RuleBase RuleBase U RuleSet(Ci)

31
Learn-one-rule Greedy vs. beam search

learn-one-rule by greedy general-to-specific
search, at each step selecting the best
descendant, no backtracking
beam search maintain a list of k best candidates
at each step descendants (specializations) of
each of these k candidates are generated, and the
resulting set is again reduced to k best
candidates

32
Illustrative example Contact lenses data
33
Learn-one-rule as heuristic search
Lenses hard IF true
?S???H???N???
...
Lenses hard IF Astigmatism no
Lenses hard IF Tearprod. reduced
S5, H0, N7
S0, H0, N12
Lenses hard IF Astigmatism yes
Lenses hard IF Tearprod. normal
S0, H4, N8
S5, H4, N3
Lenses hard IF Tearprod. normalAND
Spect.Pre. myope
Lenses hard IF Tearprod. normalAND
Astigmatism yes
S2, H3, N1
Lenses hard IF Tearprod. normalAND
Astigmatism no
Lenses hard IF Tearprod. normalAND
Spect.Pre. hyperm.
S0, H4, N2
S5, H0, N1
S3, H1, N2
34
Rule learning summary

Hypothesis construction find a set of n rules
usually simplified by n separate rule
constructions
Rule construction find a pair (Class, Cond)
select rule head (class) and construct rule body,
or
construct rule body and assign rule head (in
ordered algos)
Body construction find a set of m features
usually simplified by adding to rule body one
feature at a time

35
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Predictive vs. Descriptive DM summary

36
Evaluating rules and rulesets

Predictive evaluation measures maximizing
accuracy, minimizing Error 1 - Accuracy,
avoiding overfitting
Estimating accuracy percentage of correct
classifications
on the training set
on unseen / testing instances
cross validation, leave-one-out, ...
Other measures comprehensibility (size),
information contents (information score),
significance, ...
Other measures of rule interestingness for
descriptive induction

37
n-fold cross validation

A methods for accuracy estimation of classifiers
Partition set D into n disjoint, almost
equally-sized folds Ti where Ui Ti D
for i 1, ..., n do
form a training set out of n-1 folds Di D\Ti
induce classifier Hi from examples in Di
use fold Ti for testing the accuracy of Hi
Estimate the accuracy of the classifier by
averaging accuracies over 10 folds Ti

Partition

D
T1
T2
T3
39

Partition

D
T1
T2
T3

Train

D\T1D1
D\T2D2
D\T3D3
40

Partition

D
T1
T2
T3

Train

D\T1D1
D\T2D2
D\T3D3
41

Partition

D
T1
T2
T3

Train

D\T1D1
D\T2D2
D\T3D3

Test

T1
T2
T3
42
Overfitting and accuracy

Typical relation between hypothesis size and
accuracy
Question how to prune optimally?

43
Overfitting

Consider error of hypothesis h over
training data T ErrorT(h)
entire distribution D of data ErrorD(h)
Hypothesis h ? H overfits training data T if
there is an alternative hypothesis h ? H such
that
ErrorT(h) lt ErrorT(h), and
ErrorD(h) gt ErrorD(h)
Prune a hypothesis (decision tree, ruleset) to
avoid overfitting T

44
Avoiding overfitting

Decision trees
Pre-pruning (forward pruning) stop growing the
tree e.g., when data split not statistically
significant or too few examples are in a split
Post-pruning grow full tree, then post-prune
Rulesets
Pre-pruning (forward pruning) stop growing the
rule e.g., when too few examples are covered by a
rule
Post-pruning construct a full ruleset, then
prune

Pre-pruning
Post-pruning
45
Rule post-pruning (Quinlan 1993)

Very frequently used method, e.g., in C4.5
Procedure
grow a full tree (allowing overfitting)
convert the tree to an equivalent set of rules
prune each rule independently of others
sort final rules into a desired sequence for use

46
Performance metrics

Rule evaluation measures - aimed at avoiding
overfitting
Heuristics for guiding the search
Heuristics for stopping the search
Confusion matrix, contingency table for the
evaluation of individual rules and ruleset
evaluation
Area under ROC evaluation (employing the
confusion matrix information)

47
Learn-one-rule PlayTennis training examples
48
Learn-one-rule as search PlayTennis example
Play tennis yes IF
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
49
Learn-one-rule as heuristic search PlayTennis
example
Play tennis yes IF
9,5- (14)
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
6,2- (8)
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
3,4- (7)
6,1- (7)
3,3- (6)
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
2,0- (2)
50
Heuristics for learn-one-rule PlayTennis
example

PlayTennis yes 9,5- (14)
PlayTennis yes ? Windweak 6,2- (8) ?
Windstrong 3,3- (6) ? Humiditynormal
6,1- (7) ?
PlayTennis yes ? Humiditynormal Outlooksu
nny 2,0- (2) ?
Estimating accuracy with probability
A(Ci ? Cond) p(Ci Cond)
Estimating probability with relative frequency
covered pos. ex. / all covered ex.
6,1- (7) 6/7, 2,0- (2) 2/2 1

51
Probability estimates

Relative frequency of covered positive examples
problems with small samples
Laplace estimate
assumes uniform prior distribution of k classes
m-estimate
special case p()1/k, mk
takes into account prior probabilities pa(C)
instead of uniform distribution
independent of the number of classes k
m is domain dependent (more noise, larger m)

52
Learn-one-rule search heuristics

Assume two classes (,-), learn rules for
class (Cl). Search for specializations of one
rule R Cl ? Cond from RuleBase.
Expected classification accuracy A(R)
p(ClCond)
Informativity (info needed to specify that
example covered by Cond belongs to Cl) I(R)
- log2p(ClCond)
Accuracy gain (increase in expected accuracy)
AG(R,R) p(ClCond) - p(ClCond)
Information gain (decrease in the information
needed)
IG(R,R) log2p(ClCond) -
log2p(ClCond)
Weighted measures favoring more general rules
WAG, WIG
WAG(R,R)
p(Cond)/p(Cond) . (p(ClCond) - p(ClCond))
Weighted relative accuracy trades off coverage
and relative accuracy WRAcc(R) p(Cond) .
(p(ClCond) - pa(Cl))

53
What is high accuracy?

Rule accuracy should be traded off against the
default accuracy of the rule Cl -gt true
68 accuracy is OK if there are 20 examples of
that class in the training set, but bad if there
are 80
Relative accuracy
RAcc(Cl -gt Cond) p(Cl Cond) p(Cl)

54
Weighted relative accuracy

If a rule covers a single example, its accuracy
is either 0 or 100
maximizing relative accuracy tends to produce
many overly specific rules
Weighted relative accuracy
WRAcc(Cl -gt Cond)
p(Cond).p(Cl Cond) p(Cl)

55
Weighted relative accuracy

WRAcc is a fundamental rule evaluation measure
WRAcc can be used if you want to assess both
accuracy and significance
WRAcc can be used if you want to compare rules
with different heads and bodies - appropriate
measure for use in descriptive induction, e.g.,
association rule learning

56
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Subgroup discovery
Association rule learning
Predictive vs. Descriptive DM summary

57
Descriptive DM

Often used for preliminary data analysis
User gets feel for the data and its structure
Aims at deriving descriptions of characteristics
of the data
Visualization and descriptive statistical
techniques can be used

58
Descriptive DM

Description
Data description and summarization describe
elementary and aggregated data characteristics
(statistics, )
Dependency analysis
describe associations, dependencies,
discovery of properties and constraints
Segmentation
Clustering separate objects into subsets
according to distance and/or similarity
(clustering, SOM, visualization, ...)
Subgroup discovery find unusual subgroups that
are significantly different from the majority
(deviation detection w.r.t. overall class
distribution)

59
Subgroup Discovery

Given a population of individuals and a property
of individuals we are interested in
Find population subgroups that are statistically
most interesting, e.g., are as large as
possible and have most unusual statistical
(distributional) characteristics w.r.t. the
property of interest

60
Subgroup interestingness

Interestingness criteria
As large as possible
Class distribution as different as possible from
the distribution in the entire data set
Significant
Surprising to the user
Non-redundant
Simple
Useful - actionable

61
Classification Rule Learning for Subgroup
Discovery Deficiencies

Only first few rules induced by the covering
algorithm have sufficient support (coverage)
Subsequent rules are induced from smaller and
strongly biased example subsets (pos. examples
not covered by previously induced rules), which
hinders their ability to detect population
subgroups
Ordered rules are induced and interpreted
sequentially as a if-then-else decision list

62
CN2-SD Adapting CN2 Rule Learning to Subgroup
Discovery

Weighted covering algorithm
Weighted relative accuracy (WRAcc) search
heuristics, with added example weights
Probabilistic classification
Evaluation with different interestingness measures

63
CN2-SD CN2 Adaptations

General-to-specific search (beam search) for
best rules
Rule quality measure
CN2 Laplace Acc(Class ? Cond)
p(ClassCond) (nc1)/(nrulek)
CN2-SD Weighted Relative Accuracy
WRAcc(Class ? Cond)
p(Cond) (p(ClassCond) - p(Class))
Weighted covering approach (example weights)
Significance testing (likelihood ratio
statistics)
Output Unordered rule sets (probabilistic
classification)

64
CN2-SD Weighted Covering

Standard covering approach
covered examples are deleted from current
training set
Weighted covering approach
weights assigned to examples
covered pos. examples are
re-weighted in all covering loop
iterations, store count i how
many times (with how many
rules induced so far) a pos. example has
been covered w(e,i), w(e,0)1

-
-
-
-
-
65
CN2-SD Weighted Covering

Additive weights w(e,i) 1/(i1)
w(e,i) pos. example e being covered i times
Multiplicative weights w(e,i) gammai,
0ltgammalt1
note gamma 1 ? find the same (first) rule
again and again
gamma 0 ? behaves as standard CN2

-
-
-
-
-
66
CN2-SD Weighted WRAcc Search Heuristic

Weighted relative accuracy (WRAcc) search
heuristics, with added example weights
WRAcc(Cl ? Cond) p(Cond) (p(ClCond) - p(Cl))
increased coverage, decreased of rules, approx.
equal accuracy (PKDD-2000)

67
CN2-SD Weighted WRAcc Search Heuristic

In WRAcc computation, probabilities are estimated
with relative frequencies, adapt
WRAcc(Cl ? Cond) p(Cond) (p(ClCond) - p(Cl))
n(Cond)/N ( n(Cl.Cond)/n(Cond) -
n(Cl)/N)
N sum of weights of examples
n(Cond) sum of weights of all covered examples
n(Cl.Cond) sum of weights of all correctly
covered examples

68
Probabilistic classification

Unlike the ordered case of standard CN2 where
rules are interpreted in an IF-THEN-ELSE fashion,
in the unordered case and in CN2-SD all rules are
tried and all rules which fire are collected
If a clash occurs, a probabilistic method is used
to resolve the clash

69
Probabilistic classification

A simplified example
classbird ? legs2 feathersyes 13,0
classelephant ? sizelarge fliesno 2,10
classbird ? beakyes20,0
35,10
Two-legged, feathered, large,
non-flying animal with a beak?
bird !

70
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Subgroup discovery
Association rule learning
Predictive vs. Descriptive DM summary

71
Association Rule Learning

Rules X gtY, if X then Y
X, Y itemsets (records, conjunction of items),
where items/features are binary-valued
attributes)
Transactions i1 i2
i50
itemsets (records) t1 1 1 1 .
0
t2 1 0
Example
...
Market basket analysis
peanuts chips gt beer coke (0.05, 0.65)
Support Sup(X,Y) XY/D p(XY)
Confidence Conf(X,Y) XY/X Sup(X,Y)/Sup(X)
p(XY)/p(X) p(YX)

72
Association Rule Learning

Given a set of transactions D
Find all association rules that hold on the set
of transactions that have support gt MinSup and
confidence gt MinConf
Procedure
find all large itemsets Z, Sup(Z) gt MinSup
split every large itemset Z into XY, compute
Conf(X,Y) Sup(X,Y)/Sup(X), if Conf(X,Y) gt
MinConf then X gtY (Sup(X,Y) gt MinSup, as
XY is large)

73
Induced association rules

Age ? 52 BigSpender no gt
Gender male
Age ? 52 BigSpender no gt
Gender male Income ? 73250
Gender male Age ? 52 Income ? 73250 gt
BigSpender no
....

74
Association Rule Learning for Classification
APRIORI-C

Simplified APRIORI-C
Discretise numeric attributes, for each discrete
attribute with N values create N items
Run APRIORI
Collect rules whose right-hand side consists of a
single target item, representing a value of the
target attribute

75
Association Rule Learning for Classification
APRIORI-C

Improvements
Creating rules Class ? Conditions during search
Pruning of irrelevant items and itemsets
Pre-processing Feature subset selection
Post-processing Rule subset selection

76
Association Rule Learning for Subgroup Discovery
Advantages

May be used to create rules of the form
Class ? Conditions
Each rule is an independent chunk of knowledge,
with
high support and coverage (p(Class.Cond) gt
MinSup, p(Cond) gt MinSup)
high confidence p(ClassCond) gt MinConf
all interesting rules found (complete search)
Building small and easy-to-understand classifiers
Appropriate for unbalanced class distributions

77
Association Rule Learning for Subgroup Discovery
APRIORI-SD

Further improvements
Create a set of rules Class ? Conditions with
APRIORI-C - advantage exhaustive set of rules
above the MinConf and MinSupp threshold
Order a set of induced rules w.r.t. decreased
WRAcc
Post-process Rule subset selection by a
weighted covering approach
Take the best rule w.r.t. WRAcc
Decrease the weights of covered examples
Reorder the remaining rules and repeat until
stopping criterion is satisfied
significance threshold
WRAcc threshold

78
Talk outline

Predictive vs. Descriptive DM
Predictive rule induction
Classification vs. estimation
Classification rule induction
Heuristics and rule quality evaluation
Descriptive rule induction
Predictive vs. Descriptive DM summary

79
Predictive vs. descriptive induction Summary

Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks
Rules are induced from labeled instances
Descriptive induction Discovers individual rules
describing interesting regularities in the data
Rules are induced from unlabeled instances
Exception Subgroup discovery
Discovers individual rules describing
interesting regularities in the data induced from
labeled examples

80
Rule induction Literature

P. Flach and N. Lavrac Rule Induction
chapter in the book Intelligent Data Analysis,
Springer, edited by M. Berthold and D. Hand
See references to other sources in this book
chapter

Write a Comment

User Comments (0)