Title: Rule Induction
1Rule Induction
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
- Nada Lavrac
- Department of Knowledge Technologies
- Joef Stefan Institute
- Ljubljana, Slovenia
2Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Predictive vs. Descriptive DM summary
3Types of DM tasks
- Predictive DM
- Classification (learning of rulesets, decision
trees, ...) - Prediction and estimation (regression)
- Predictive relational DM (RDM, ILP)
- Descriptive DM
- description and summarization
- dependency analysis (association rule learning)
- discovery of properties and constraints
- segmentation (clustering)
- subgroup discovery
- Text, Web and image analysis
H
-
-
-
x
x
x
x
H
x
x
x
4Predictive vs. descriptive induction
- Predictive induction Inducing classifiers for
solving classification and prediction tasks, - Classification rule learning, Decision tree
learning, ... - Bayesian classifier, ANN, SVM, ...
- Data analysis through hypothesis generation and
testing - Descriptive induction Discovering interesting
regularities in the data, uncovering patterns,
... for solving KDD tasks - Symbolic clustering, Association rule learning,
Subgroup discovery, ... - Exploratory data analysis
5Predictive vs. descriptive induction A rule
learning perspective
- Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks - Descriptive induction Discovers individual rules
describing interesting regularities in the data - Therefore Different goals, different heuristics,
different evaluation criteria
6Supervised vs. unsupervised learning A rule
learning perspective
- Supervised learning Rules are induced from
labeled instances (training examples with class
assignment) - usually used in predictive
induction - Unsupervised learning Rules are induced from
unlabeled instances (training examples with no
class assignment) - usually used in descriptive
induction - Exception Subgroup discovery
- Discovers individual rules describing
interesting regularities in the data induced from
labeled examples
7Subgroups vs. classifiers
- Classifiers
- Classification rules aim at pure subgroups
- A set of rules forms a domain model
- Subgroups
- Rules describing subgroups aim at significantly
higher proportion of positives - Each rule is an independent chunk of knowledge
- Link SD can be viewed as a form of
cost-sensitive classification
8Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Predictive vs. Descriptive DM summary
9Predictive DM - Classification
- data are objects, characterized with attributes -
objects belong to different classes (discrete
labels) - given the objects described by attribute values,
induce a model to predict different classes - decision trees, if-then rules, ...
10Illustrative example Contact lenses data
11Decision tree forcontact lenses recommendation
12Illustrative example Customer data
13Induced decision trees
Income
? 102000
? 102000
Age
yes
? 58
? 58
Gender
no
yes
female
male
Age
no
? 49
? 49
no
yes
14Predictive DM - Estimation
- often referred to as regression
- data are objects, characterized with attributes
(discrete or continuous), classes of objects are
continuous (numeric) - given objects described with attribute values,
induce a model to predict the numeric class value - regression trees, linear and logistic regression,
ANN, kNN, ...
15Illustrative example Customer data
16Customer data regression tree
Income
? 108000
? 108000
Age
12000
? 42.5
? 42.5
16500
26700
17Predicting algal biomass regression tree
Month
Jan.-June
July - Dec.
Ptot
Si
? 9.34
gt 9.34
? 10.1
gt10.1
2.34?1.65
Ptot
Ptot
4.32?2.07
? 9.1
gt 9.1
? 5.9
gt 5.9
Si
1.28?1.08
2.08 ?0.71
2.97?1.09
gt 2.13
? 2.13
0.70?0.34
1.15?0.21
18Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Predictive vs. Descriptive DM summary
19Ruleset representation
- Rule base is a disjunctive set of conjunctive
rules - Standard form of rules IF Condition THEN Class
- Class IF Conditions
- Class ? Conditions
- Examples
- IF OutlookSunny ? HumidityNormal THEN
PlayTennisYesIF OutlookOvercast THEN
PlayTennisYesIF OutlookRain ? WindWeak THEN
PlayTennisYes - Form of CN2 rules IF Conditions THEN
MajClass ClassDistr - Rule base R1, R2, R3, , DefaultRule
20Classification Rule Learning
- Rule set representation
- Two rule learning approaches
- Learn decision tree, convert to rules
- Learn set/list of rules
- Learning an unordered set of rules
- Learning an ordered list of rules
- Heuristics, overfitting, pruning
21Decision tree vs. rule learning Splitting vs.
covering
- Splitting (ID3, C4.5, J48, See5)
- Covering (AQ, CN2)
-
-
-
-
-
-
-
-
-
-
22PlayTennis Training examples
23PlayTennis Using a decision tree for
classification
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
Weak
Strong
Normal
High
No
Yes
No
Yes
Is Saturday morning OK for playing
tennis? OutlookSunny, TemperatureHot,
HumidityHigh, WindStrong PlayTennis No,
because OutlookSunny ? HumidityHigh
24PlayTennis Converting a tree to rules
- IF OutlookSunny ? HumidityNormal THEN
PlayTennisYes - IF OutlookOvercast THEN PlayTennisYes
- IF OutlookRain ? WindWeak THEN PlayTennisYes
- IF OutlookSunny ? HumidityHigh THEN
PlayTennisNo - IF OutlookRain ? WindStrong THEN PlayTennisNo
25Contact lense classification rules
- tear productionreduced gt lensesNONE
- S0,H0,N12
- tear productionnormal astigmatismno gt
lensesSOFT S5,H0,N1 - tear productionnormal astigmatismyes spect.
pre.myope gt lensesHARD S0,H3,N2 - tear productionnormal astigmatismyes spect.
pre.hypermetrope gt lensesNONE - S0,H1,N2
- DEFAULT lenses NONE
26Unordered rulesets
- rule Class IF Conditions is learned by first
determining Class and then Conditions - NB ordered sequence of classes C1, , Cn in
RuleSet - But unordered (independent) execution of rules
when classifying a new instance all rules are
tried and predictions of those covering the
example are collected voting is used to obtain
the final classification - if no rule fires, then DefaultClass (majority
class in E)
27Contact lense decision list
- Ordered (order dependent) rules
- IF tear productionreduced THEN lensesNONE
- ELSE /tear productionnormal/
- IF astigmatismno THEN lensesSOFT
- ELSE /astigmatismyes/
- IF spect. pre.myope THEN lensesHARD
- ELSE / spect.pre.hypermetrope/
- lensesNONE
28Ordered set of rules if-then-else decision lists
- rule Class IF Conditions is learned by first
determining Conditions and then Class - Notice mixed sequence of classes C1, , Cn in
RuleBase - But ordered execution when classifying a new
instance rules are sequentially tried and the
first rule that fires (covers the example) is
used for classification - Decision list R1, R2, R3, , D rules Ri are
interpreted as if-then-else rules - If no rule fires, then DefaultClass (majority
class in Ecur)
29Original covering algorithm(AQ, Michalski
1969,86)
- Basic covering algorithm
- for each class Ci do
- Ei Pi U Ni (Pi pos., Ni neg.)
- RuleBase(Ci) empty
- repeat learn-set-of-rules
- learn-one-rule R covering some positive examples
and no negatives - add R to RuleBase(Ci)
- delete from Pi all pos. ex. covered by R
- until Pi empty
-
-
-
-
-
30Learning unordered set of rules (CN2, Clark and
Niblett)
- RuleBase empty
- for each class Ci do
- Ei Pi U Ni, RuleSet(Ci) empty
- repeat learn-set-of-rules
- R Class Ci IF Conditions, Conditions
true - repeat learn-one-rule R Class Ci IF
Conditions AND Cond (general-to-specific beam
search of Best R) - until stopping criterion is satisfied (no
negatives covered - or Performance(R) lt ThresholdR)
- add R to RuleSet(Ci)
- delete from Pi all positive examples covered by
R - until stopping criterion is satisfied (all
positives covered or Performance(RuleSet(Ci)) lt
ThresholdRS) - RuleBase RuleBase U RuleSet(Ci)
31Learn-one-rule Greedy vs. beam search
- learn-one-rule by greedy general-to-specific
search, at each step selecting the best
descendant, no backtracking - beam search maintain a list of k best candidates
at each step descendants (specializations) of
each of these k candidates are generated, and the
resulting set is again reduced to k best
candidates
32Illustrative example Contact lenses data
33Learn-one-rule as heuristic search
Lenses hard IF true
?S???H???N???
...
Lenses hard IF Astigmatism no
Lenses hard IF Tearprod. reduced
S5, H0, N7
S0, H0, N12
Lenses hard IF Astigmatism yes
Lenses hard IF Tearprod. normal
S0, H4, N8
S5, H4, N3
Lenses hard IF Tearprod. normalAND
Spect.Pre. myope
Lenses hard IF Tearprod. normalAND
Astigmatism yes
S2, H3, N1
Lenses hard IF Tearprod. normalAND
Astigmatism no
Lenses hard IF Tearprod. normalAND
Spect.Pre. hyperm.
S0, H4, N2
S5, H0, N1
S3, H1, N2
34Rule learning summary
- Hypothesis construction find a set of n rules
- usually simplified by n separate rule
constructions - Rule construction find a pair (Class, Cond)
- select rule head (class) and construct rule body,
or - construct rule body and assign rule head (in
ordered algos) - Body construction find a set of m features
- usually simplified by adding to rule body one
feature at a time
35Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Predictive vs. Descriptive DM summary
36Evaluating rules and rulesets
- Predictive evaluation measures maximizing
accuracy, minimizing Error 1 - Accuracy,
avoiding overfitting - Estimating accuracy percentage of correct
classifications - on the training set
- on unseen / testing instances
- cross validation, leave-one-out, ...
- Other measures comprehensibility (size),
information contents (information score),
significance, ... - Other measures of rule interestingness for
descriptive induction
37n-fold cross validation
- A methods for accuracy estimation of classifiers
- Partition set D into n disjoint, almost
equally-sized folds Ti where Ui Ti D - for i 1, ..., n do
- form a training set out of n-1 folds Di D\Ti
- induce classifier Hi from examples in Di
- use fold Ti for testing the accuracy of Hi
- Estimate the accuracy of the classifier by
averaging accuracies over 10 folds Ti -
38D
T1
T2
T3
39D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
40D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
41D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
T1
T2
T3
42Overfitting and accuracy
- Typical relation between hypothesis size and
accuracy - Question how to prune optimally?
43Overfitting
- Consider error of hypothesis h over
- training data T ErrorT(h)
- entire distribution D of data ErrorD(h)
- Hypothesis h ? H overfits training data T if
there is an alternative hypothesis h ? H such
that - ErrorT(h) lt ErrorT(h), and
- ErrorD(h) gt ErrorD(h)
- Prune a hypothesis (decision tree, ruleset) to
avoid overfitting T
44Avoiding overfitting
- Decision trees
- Pre-pruning (forward pruning) stop growing the
tree e.g., when data split not statistically
significant or too few examples are in a split - Post-pruning grow full tree, then post-prune
- Rulesets
- Pre-pruning (forward pruning) stop growing the
rule e.g., when too few examples are covered by a
rule - Post-pruning construct a full ruleset, then
prune
Pre-pruning
Post-pruning
45Rule post-pruning (Quinlan 1993)
- Very frequently used method, e.g., in C4.5
- Procedure
- grow a full tree (allowing overfitting)
- convert the tree to an equivalent set of rules
- prune each rule independently of others
- sort final rules into a desired sequence for use
46Performance metrics
- Rule evaluation measures - aimed at avoiding
overfitting - Heuristics for guiding the search
- Heuristics for stopping the search
- Confusion matrix, contingency table for the
evaluation of individual rules and ruleset
evaluation - Area under ROC evaluation (employing the
confusion matrix information)
47Learn-one-rule PlayTennis training examples
48Learn-one-rule as search PlayTennis example
Play tennis yes IF
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
49Learn-one-rule as heuristic search PlayTennis
example
Play tennis yes IF
9,5- (14)
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
6,2- (8)
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
3,4- (7)
6,1- (7)
3,3- (6)
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
2,0- (2)
50Heuristics for learn-one-rule PlayTennis
example
- PlayTennis yes 9,5- (14)
- PlayTennis yes ? Windweak 6,2- (8) ?
Windstrong 3,3- (6) ? Humiditynormal
6,1- (7) ? - PlayTennis yes ? Humiditynormal Outlooksu
nny 2,0- (2) ? - Estimating accuracy with probability
- A(Ci ? Cond) p(Ci Cond)
- Estimating probability with relative frequency
- covered pos. ex. / all covered ex.
- 6,1- (7) 6/7, 2,0- (2) 2/2 1
51Probability estimates
- Relative frequency of covered positive examples
- problems with small samples
- Laplace estimate
- assumes uniform prior distribution of k classes
- m-estimate
- special case p()1/k, mk
- takes into account prior probabilities pa(C)
instead of uniform distribution - independent of the number of classes k
- m is domain dependent (more noise, larger m)
52Learn-one-rule search heuristics
- Assume two classes (,-), learn rules for
class (Cl). Search for specializations of one
rule R Cl ? Cond from RuleBase. - Expected classification accuracy A(R)
p(ClCond) - Informativity (info needed to specify that
example covered by Cond belongs to Cl) I(R)
- log2p(ClCond) - Accuracy gain (increase in expected accuracy)
- AG(R,R) p(ClCond) - p(ClCond)
- Information gain (decrease in the information
needed) - IG(R,R) log2p(ClCond) -
log2p(ClCond) - Weighted measures favoring more general rules
WAG, WIG - WAG(R,R)
- p(Cond)/p(Cond) . (p(ClCond) - p(ClCond))
- Weighted relative accuracy trades off coverage
and relative accuracy WRAcc(R) p(Cond) .
(p(ClCond) - pa(Cl))
53What is high accuracy?
- Rule accuracy should be traded off against the
default accuracy of the rule Cl -gt true - 68 accuracy is OK if there are 20 examples of
that class in the training set, but bad if there
are 80 - Relative accuracy
- RAcc(Cl -gt Cond) p(Cl Cond) p(Cl)
54Weighted relative accuracy
- If a rule covers a single example, its accuracy
is either 0 or 100 - maximizing relative accuracy tends to produce
many overly specific rules - Weighted relative accuracy
- WRAcc(Cl -gt Cond)
- p(Cond).p(Cl Cond) p(Cl)
55Weighted relative accuracy
- WRAcc is a fundamental rule evaluation measure
- WRAcc can be used if you want to assess both
accuracy and significance - WRAcc can be used if you want to compare rules
with different heads and bodies - appropriate
measure for use in descriptive induction, e.g.,
association rule learning
56Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Subgroup discovery
- Association rule learning
- Predictive vs. Descriptive DM summary
57Descriptive DM
- Often used for preliminary data analysis
- User gets feel for the data and its structure
- Aims at deriving descriptions of characteristics
of the data - Visualization and descriptive statistical
techniques can be used
58Descriptive DM
- Description
- Data description and summarization describe
elementary and aggregated data characteristics
(statistics, ) - Dependency analysis
- describe associations, dependencies,
- discovery of properties and constraints
- Segmentation
- Clustering separate objects into subsets
according to distance and/or similarity
(clustering, SOM, visualization, ...) - Subgroup discovery find unusual subgroups that
are significantly different from the majority
(deviation detection w.r.t. overall class
distribution)
59Subgroup Discovery
- Given a population of individuals and a property
of individuals we are interested in - Find population subgroups that are statistically
most interesting, e.g., are as large as
possible and have most unusual statistical
(distributional) characteristics w.r.t. the
property of interest
60Subgroup interestingness
- Interestingness criteria
- As large as possible
- Class distribution as different as possible from
the distribution in the entire data set - Significant
- Surprising to the user
- Non-redundant
- Simple
- Useful - actionable
61Classification Rule Learning for Subgroup
Discovery Deficiencies
- Only first few rules induced by the covering
algorithm have sufficient support (coverage) - Subsequent rules are induced from smaller and
strongly biased example subsets (pos. examples
not covered by previously induced rules), which
hinders their ability to detect population
subgroups - Ordered rules are induced and interpreted
sequentially as a if-then-else decision list
62CN2-SD Adapting CN2 Rule Learning to Subgroup
Discovery
- Weighted covering algorithm
- Weighted relative accuracy (WRAcc) search
heuristics, with added example weights - Probabilistic classification
- Evaluation with different interestingness measures
63CN2-SD CN2 Adaptations
- General-to-specific search (beam search) for
best rules - Rule quality measure
- CN2 Laplace Acc(Class ? Cond)
- p(ClassCond) (nc1)/(nrulek)
- CN2-SD Weighted Relative Accuracy
- WRAcc(Class ? Cond)
- p(Cond) (p(ClassCond) - p(Class))
- Weighted covering approach (example weights)
- Significance testing (likelihood ratio
statistics) - Output Unordered rule sets (probabilistic
classification)
64CN2-SD Weighted Covering
- Standard covering approach
- covered examples are deleted from current
training set - Weighted covering approach
- weights assigned to examples
- covered pos. examples are
- re-weighted in all covering loop
- iterations, store count i how
- many times (with how many
- rules induced so far) a pos. example has
- been covered w(e,i), w(e,0)1
-
-
-
-
-
65CN2-SD Weighted Covering
- Additive weights w(e,i) 1/(i1)
- w(e,i) pos. example e being covered i times
- Multiplicative weights w(e,i) gammai,
0ltgammalt1 - note gamma 1 ? find the same (first) rule
again and again
gamma 0 ? behaves as standard CN2
-
-
-
-
-
66CN2-SD Weighted WRAcc Search Heuristic
- Weighted relative accuracy (WRAcc) search
heuristics, with added example weights - WRAcc(Cl ? Cond) p(Cond) (p(ClCond) - p(Cl))
- increased coverage, decreased of rules, approx.
equal accuracy (PKDD-2000)
67CN2-SD Weighted WRAcc Search Heuristic
- In WRAcc computation, probabilities are estimated
with relative frequencies, adapt - WRAcc(Cl ? Cond) p(Cond) (p(ClCond) - p(Cl))
- n(Cond)/N ( n(Cl.Cond)/n(Cond) -
n(Cl)/N) - N sum of weights of examples
- n(Cond) sum of weights of all covered examples
- n(Cl.Cond) sum of weights of all correctly
covered examples
68Probabilistic classification
- Unlike the ordered case of standard CN2 where
rules are interpreted in an IF-THEN-ELSE fashion,
in the unordered case and in CN2-SD all rules are
tried and all rules which fire are collected - If a clash occurs, a probabilistic method is used
to resolve the clash
69Probabilistic classification
- A simplified example
- classbird ? legs2 feathersyes 13,0
- classelephant ? sizelarge fliesno 2,10
- classbird ? beakyes20,0
- 35,10
- Two-legged, feathered, large,
- non-flying animal with a beak?
- bird !
70Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Subgroup discovery
- Association rule learning
- Predictive vs. Descriptive DM summary
71Association Rule Learning
- Rules X gtY, if X then Y
- X, Y itemsets (records, conjunction of items),
where items/features are binary-valued
attributes) - Transactions i1 i2
i50 - itemsets (records) t1 1 1 1 .
0 - t2 1 0
- Example
... - Market basket analysis
- peanuts chips gt beer coke (0.05, 0.65)
- Support Sup(X,Y) XY/D p(XY)
- Confidence Conf(X,Y) XY/X Sup(X,Y)/Sup(X)
p(XY)/p(X) p(YX)
72Association Rule Learning
- Given a set of transactions D
- Find all association rules that hold on the set
of transactions that have support gt MinSup and
confidence gt MinConf - Procedure
- find all large itemsets Z, Sup(Z) gt MinSup
- split every large itemset Z into XY, compute
Conf(X,Y) Sup(X,Y)/Sup(X), if Conf(X,Y) gt
MinConf then X gtY (Sup(X,Y) gt MinSup, as
XY is large)
73Induced association rules
- Age ? 52 BigSpender no gt
- Gender male
-
- Age ? 52 BigSpender no gt
- Gender male Income ? 73250
- Gender male Age ? 52 Income ? 73250 gt
BigSpender no - ....
74Association Rule Learning for Classification
APRIORI-C
- Simplified APRIORI-C
- Discretise numeric attributes, for each discrete
attribute with N values create N items - Run APRIORI
- Collect rules whose right-hand side consists of a
single target item, representing a value of the
target attribute
75Association Rule Learning for Classification
APRIORI-C
- Improvements
- Creating rules Class ? Conditions during search
- Pruning of irrelevant items and itemsets
- Pre-processing Feature subset selection
- Post-processing Rule subset selection
76Association Rule Learning for Subgroup Discovery
Advantages
- May be used to create rules of the form
- Class ? Conditions
- Each rule is an independent chunk of knowledge,
with - high support and coverage (p(Class.Cond) gt
MinSup, p(Cond) gt MinSup) - high confidence p(ClassCond) gt MinConf
- all interesting rules found (complete search)
- Building small and easy-to-understand classifiers
- Appropriate for unbalanced class distributions
77Association Rule Learning for Subgroup Discovery
APRIORI-SD
- Further improvements
- Create a set of rules Class ? Conditions with
APRIORI-C - advantage exhaustive set of rules
above the MinConf and MinSupp threshold - Order a set of induced rules w.r.t. decreased
WRAcc - Post-process Rule subset selection by a
weighted covering approach - Take the best rule w.r.t. WRAcc
- Decrease the weights of covered examples
- Reorder the remaining rules and repeat until
stopping criterion is satisfied - significance threshold
- WRAcc threshold
78Talk outline
- Predictive vs. Descriptive DM
- Predictive rule induction
- Classification vs. estimation
- Classification rule induction
- Heuristics and rule quality evaluation
- Descriptive rule induction
- Predictive vs. Descriptive DM summary
79Predictive vs. descriptive induction Summary
- Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks - Rules are induced from labeled instances
- Descriptive induction Discovers individual rules
describing interesting regularities in the data - Rules are induced from unlabeled instances
- Exception Subgroup discovery
- Discovers individual rules describing
interesting regularities in the data induced from
labeled examples
80Rule induction Literature
- P. Flach and N. Lavrac Rule Induction
- chapter in the book Intelligent Data Analysis,
Springer, edited by M. Berthold and D. Hand - See references to other sources in this book
chapter