Title: Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal
1Mining Generalized Association Rules Ramkrishnan
Strikant Rakesh Agrawal
- Data Mining Seminar, spring semester, 2003
- Prof. Amos Fiat
- Student Idit Haran
2Outline
- Motivation
- Terms Definitions
- Interest Measure
- Algorithms for mining generalized association
rules - Comparison
- Conclusions
3Motivation
- Find Association Rules of the formDiapers ?
Beer - Different kinds of diapers Huggies/Pampers,
S/M/L, etc. - Different kinds of beers Heineken/Maccabi, in a
bottle/in a can, etc. - The information on the bar-code is of
typeHuggies Diapers, M ? Heineken Beer in
bottle - The preliminary rule is not interesting, and
probably will not have minimum support.
4Taxonomy
5Taxonomy - Example
- Let say we found the rule Outwear ? Hiking
Bootswith minimum support and confidence. - The rule Jackets ? Hiking Boots may not have
minimum support - The rule Clothes ? Hiking Boots may not have
minimum confidence.
6Taxonomy
- Users are interested in generating rules that
span different levels of the taxonomy. - Rules of lower levels may not have minimum
support - Taxonomy can be used to prune uninteresting or
redundant rules - Multiple taxonomies may be present. for example
category, price(cheap, expensive),
items-on-sale. etc. - Multiple taxonomies may be modeled as a forest,
or a DAG.
7Notations
8Notations
- I i1, i2, , im- items.
- T- transaction, set of items T?I(we expect the
items in T to be leaves in T .) - D set of transactions
- T supports item x, if x is in T or x is an
ancestor of some item in T. - T supports X?I if it supports every item in X.
9Notations
- A generalized association rule X? Y if X?I ,
Y?I , X?Y ? , and no item in Y is an
ancestor of any item in X. - The rule X?Y has confidence c in D if c of
transactions in D that support X also support Y. - The rule X?Y has support s in D if s of
transactions in D supports X?Y.
10Problem Statement
- To find all generalized association rules that
have support and confidence greater than the
user-specified minimum support (called minsup)
and minimum confidence (called minconf)
respectively.
11Example
12Example
minsup 30 minconf 60
13Observation 1
- If the setx,y has minimum support, so do
x,y x,y and x,y - For example if Jacket, Shoes has minsup, so
will Outwear, Shoes, Jacket,Footwear, and
Outwear,Footwear
14Observation 2
- If the rule x?y has minimum support and
confidence, only x?y is guaranteed to have both
minsup and minconf. - The rule Outwear?Hiking Boots has minsup and
minconf. - The rule Outwear?Footwear has both minsup and
minconf.
15Observation 2 cont.
- However, the rules x?y and x?y will have
minsup, they may not have minconf. - For example The rules Clothes?Hiking Boots and
Clothes?Footwear have minsup, but not minconf.
16Interesting Rules Previous Work
- a rule X?Y is not interesting ifsupport(X?Y) ?
support(X)support(Y) - Previous work does not consider taxonomy.
- The previous interest measure pruned less than 1
of the rules on a real database.
17Interesting Rules Using the Taxonomy
- Milk?Cereal (8 support, 70 conf)
- Milk is parent of Skim Milk, and 25 of sales of
Milk are Skim Milk - We expectSkim Milk?Cereal to have 2 support
and 70 confidence
18R-Interesting Rules
- A rule is X?Y is R-interesting w.r.t an ancestor
X?Y if - or,
- With R 1.1 about 40-55 of the rules were
prunes.
real support(X?Y)
expected support (X?Y) based on (X?Y)
gt
R
real confidence(X?Y)
expected confidence (X?Y) based on (X?Y)
gt
R
19Problem Statement (new)
- To find all generalized R-interesting association
rules (R is a user-specified minimum interest
called min-interest) that have support and
confidence greater than minsup and minconf
respectively.
20Algorithms 3 steps
- 1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets. - 2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB) - 3. Prune all uninteresting rules from this set.
- All presented algorithms will only implement
step 1.
21Algorithms 3 steps
- 1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets. - 2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB) - 3. Prune all uninteresting rules from this set.
- All presented algorithms will only implement
step 1.
22Algorithms (step 1)
- Input Database, Taxonomy
- Output All frequent itemsets
- 3 algorithms (same output, different run-time)
Basic, Cumulate, EstMerge
23Algorithm Basic Main Idea
- Is itemset X is frequent?
- Does transaction T supports X? (X contains items
from different levels of taxonomy, T contains
only leaves) - T T ancestors(T)
- Answer T supports X ? X ? T
24Algorithm Basic
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t,
removing any duplication
Find the support of all the candidates
Take only those with support over minsup
25Candidate generation
P and q are 2 k-1 frequent itemsets identical in
all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with
small subset
26Optimization 1
- Filtering the ancestors added to transactions
- We only need to add to transaction t the
ancestors that are in one of the candidates. - If the original item is not in any itemsets, it
can be dropped from the transaction. - Examplecandidates clothes,shoes.Transaction
t Jacket, can be replaced with clothes,
27Optimization 2
- Pre-computing ancestors
- Rather than finding ancestors for each item by
traversing the taxonomy graph, we can pre-compute
the ancestors for each item. - We can drop ancestors that are not contained in
any of the candidates in the same time.
28Optimization 3
- Pruning itemsets containing an item and its
ancestor - If we have Jacket and Outwear, we will have
candidate Jacket, Outwear which is not
interesting. - support(Jacket ) support(Jacket, Outwear)
- Delete (Jacket, Outwear) in k2 will ensure it
will not erase in kgt2. (because of the prune step
of candidate generation method) - Therefore, we can prune the rules containing an
item an its ancestor only for k2, and in the
next steps all candidates will not include item
ancestor.
29Algorithm Cumulate
30Stratification
- Candidates Clothes, Shoes, Outwear,Shoes,
Jacket,Shoes - If Clothes, Shoes does not have minimum
support, we dont need to count either
Outwear,Shoes or Jacket,Shoes - We will count in steps step 1 count Clothes,
Shoes, and if it has minsup - step 2 count
Outwear,Shoes, if has minsup step 3 count
Jacket,Shoes
31Version 1 Stratify
- Depth of an itemset
- itemsets with no parents are of depth 0.
- others depth(X) max(depth(X) X is a
parent of X) 1 - The algorithm
- Count all itemsets C0 of depth 0.
- Delete candidates that are descendants to the
itemsets in C0 that didnt have minsup. - Count remaining itemsets at depth 1 (C1)
- Delete candidates that are descendants to the
itemsets in C1 that didnt have minsup. - Count remaining itemsets at depth 2 (C2), etc
32Tradeoff Optimizations
candidates counted
passes over DB
Cumulate
Count each depth on different pass
Optimiztion 1 Count together multiple depths
from certain level
Optimiztion 2 Count more than 20 of candidates
per pass
33Version 2 Estimate
- Estimating candidates support using sample
- 1st pass (Ck)
- count candidates that are expected to have
minsup (we count these candidates as candidates
that has 0.9minsup in the sample) - count candidates whose parents expect to have
minsup. - 2nd pass (Ck)
- count children of candidates in Ck that were not
expected to have minsup.
34Example for Estimate
35Version 3 EstMerge
- Motivation eliminate 2nd pass of algorithm
Estimate - Implementation count these candidates of Ck
with the candidates in Ck1. - Restriction to create Ck1 we assume that all
candidates in Ck has minsup. - The tradeoff extra candidates counted by
EstMerge v.s. extra pass made by Estimate.
36Algorithm EstMerge
37Stratify - Variants
38Size of Sample
39Size of Sample
40Performance Evaluation
- Compare running time of 3 algorithmsBasic,
Cumulate and EstMerge - On synthetic data
- effect of each parameter on performance
- On real data
- Supermarket Data
- Department Store Data
41Synthetic Data Generation
42Minimum Support
43Number of Transactions
44Fanout
45Number of Items
46Reality Check
- Supermarket Data
- 548,000 items
- Taxonomy 4 levels, 118 roots
- 1.5 million transactions
- Average of 9.6 items per transaction
- Department Store Data
- 228,000 items
- Taxonomy 7 levels, 89 roots
- 570,000 transactions
- Average of 4.4 items per transaction
47Results
48Conclusions
- Cumulate and EstMerge were 2 to 5 times faster
than Basic on all synthetic datasets. On the
supermarket database they were 100 times faster ! - EstMerge was 25-30 faster than Cumulate.
- Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.
49Summary
- The use of taxonomy is necessary for finding
association rules between items at any level of
hierarchy. - The obvious solution (algorithm Basic) is not
very fast. - New algorithms that use the taxonomy benefits are
much faster - We can use the taxonomy to prune uninteresting
rules.
50