Title: Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal
1Mining Generalized Association Rules Ramkrishnan
Strikant Rakesh Agrawal
- Data Mining Seminar, spring semester, 2003
- Prof. Amos Fiat
- Student Idit Haran
2Outline
- Motivation
- Terms Definitions
- Interest Measure
- Algorithms for mining generalized association
rules - Comparison
- Conclusions
3Motivation
- Find Association Rules of the formDiapers ?
Beer - Different kinds of diapers Huggies/Pampers,
S/M/L, etc. - Different kinds of beers Heineken/Maccabi, in a
bottle/in a can, etc. - The information on the bar-code is of
typeHuggies Diapers, M ? Heineken Beer in
bottle - The preliminary rule is not interesting, and
probably will not have minimum support.
4Taxonomy
5Taxonomy - Example
- Let say we found the rule Outwear ? Hiking
Bootswith minimum support and confidence. - The rule Jackets ? Hiking Boots may not have
minimum support - The rule Clothes ? Hiking Boots may not have
minimum confidence.
6Taxonomy
- Users are interested in generating rules that
span different levels of the taxonomy. - Rules of lower levels may not have minimum
support - Taxonomy can be used to prune uninteresting or
redundant rules - Multiple taxonomies may be present. for example
category, price(cheap, expensive),
items-on-sale. etc. - Multiple taxonomies may be modeled as a forest,
or a DAG.
7Notations
8Notations
- I i1, i2, , im- items.
- T- transaction, set of items T?I(we expect the
items in T to be leaves in T .) - D set of transactions
- T supports item x, if x is in T or x is an
ancestor of some item in T. - T supports X?I if it supports every item in X.
9Notations
- A generalized association rule X? Y if X?I ,
Y?I , X?Y ? , and no item in Y is an
ancestor of any item in X. - The rule X?Y has confidence c in D if c of
transactions in D that support X also support Y. - The rule X?Y has support s in D if s of
transactions in D supports X?Y.
10Problem Statement
- To find all generalized association rules that
have support and confidence greater than the
user-specified minimum support (called minsup)
and minimum confidence (called minconf)
respectively.
11Example
12Example
Frequent Itemsets Frequent Itemsets
Itemset Support
Jacket 2
Outwear 3
Clothes 4
Shoes 2
Hiking Boots 2
Footwear 4
Outwear, Hiking Boots 2
Clothes,Hiking Boots 2
Outwear, Footwear 2
Clothes, Footwear 2
Database D Database D
Transaction Items Bought
100 Shirt
200 Jacket, Hiking Boots
300 Ski Pants, Hiking Boots
400 Shoes
500 Shoes
600 Jacket
Rules Rules Rules
Rule Support Confidence
Outwear ? Hiking Boots 33 66.6
Outwear ? Footwear 33 66.6
Hiking Boots ? Outwear 33 100
Hiking Boots ? Clothes 33 100
minsup 30 minconf 60
13Observation 1
- If the setx,y has minimum support, so do
x,y x,y and x,y - For example if Jacket, Shoes has minsup, so
will Outwear, Shoes, Jacket,Footwear, and
Outwear,Footwear
14Observation 2
- If the rule x?y has minimum support and
confidence, only x?y is guaranteed to have both
minsup and minconf. - The rule Outwear?Hiking Boots has minsup and
minconf. - The rule Outwear?Footwear has both minsup and
minconf.
15Observation 2 cont.
- However, the rules x?y and x?y will have
minsup, they may not have minconf. - For example The rules Clothes?Hiking Boots and
Clothes?Footwear have minsup, but not minconf.
16Interesting Rules Previous Work
- a rule X?Y is not interesting ifsupport(X?Y) ?
support(X)support(Y) - Previous work does not consider taxonomy.
- The previous interest measure pruned less than 1
of the rules on a real database.
17Interesting Rules Using the Taxonomy
- Milk?Cereal (8 support, 70 conf)
- Milk is parent of Skim Milk, and 25 of sales of
Milk are Skim Milk - We expectSkim Milk?Cereal to have 2 support
and 70 confidence
18R-Interesting Rules
- A rule is X?Y is R-interesting w.r.t an ancestor
X?Y if - or,
- With R 1.1 about 40-55 of the rules were
prunes.
real support(X?Y)
expected support (X?Y) based on (X?Y)
gt
R
real confidence(X?Y)
expected confidence (X?Y) based on (X?Y)
gt
R
19Problem Statement (new)
- To find all generalized R-interesting association
rules (R is a user-specified minimum interest
called min-interest) that have support and
confidence greater than minsup and minconf
respectively.
20Algorithms 3 steps
- 1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets. - 2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB) - 3. Prune all uninteresting rules from this set.
- All presented algorithms will only implement
step 1.
21Algorithms 3 steps
- 1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets. - 2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB) - 3. Prune all uninteresting rules from this set.
- All presented algorithms will only implement
step 1.
22Algorithms (step 1)
- Input Database, Taxonomy
- Output All frequent itemsets
- 3 algorithms (same output, different run-time)
Basic, Cumulate, EstMerge
23Algorithm Basic Main Idea
- Is itemset X is frequent?
- Does transaction T supports X? (X contains items
from different levels of taxonomy, T contains
only leaves) - T T ancestors(T)
- Answer T supports X ? X ? T
24Algorithm Basic
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t,
removing any duplication
Find the support of all the candidates
Take only those with support over minsup
25Candidate generation
P and q are 2 k-1 frequent itemsets identical in
all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with
small subset
26Optimization 1
- Filtering the ancestors added to transactions
- We only need to add to transaction t the
ancestors that are in one of the candidates. - If the original item is not in any itemsets, it
can be dropped from the transaction. - Examplecandidates clothes,shoes.Transaction
t Jacket, can be replaced with clothes,
27Optimization 2
- Pre-computing ancestors
- Rather than finding ancestors for each item by
traversing the taxonomy graph, we can pre-compute
the ancestors for each item. - We can drop ancestors that are not contained in
any of the candidates in the same time.
28Optimization 3
- Pruning itemsets containing an item and its
ancestor - If we have Jacket and Outwear, we will have
candidate Jacket, Outwear which is not
interesting. - support(Jacket ) support(Jacket, Outwear)
- Delete (Jacket, Outwear) in k2 will ensure it
will not erase in kgt2. (because of the prune step
of candidate generation method) - Therefore, we can prune the rules containing an
item an its ancestor only for k2, and in the
next steps all candidates will not include item
ancestor.
29Algorithm Cumulate
30Stratification
- Candidates Clothes, Shoes, Outwear,Shoes,
Jacket,Shoes - If Clothes, Shoes does not have minimum
support, we dont need to count either
Outwear,Shoes or Jacket,Shoes - We will count in steps step 1 count Clothes,
Shoes, and if it has minsup - step 2 count
Outwear,Shoes, if has minsup step 3 count
Jacket,Shoes
31Version 1 Stratify
- Depth of an itemset
- itemsets with no parents are of depth 0.
- others depth(X) max(depth(X) X is a
parent of X) 1 - The algorithm
- Count all itemsets C0 of depth 0.
- Delete candidates that are descendants to the
itemsets in C0 that didnt have minsup. - Count remaining itemsets at depth 1 (C1)
- Delete candidates that are descendants to the
itemsets in C1 that didnt have minsup. - Count remaining itemsets at depth 2 (C2), etc
32Tradeoff Optimizations
candidates counted
passes over DB
Cumulate
Count each depth on different pass
Optimiztion 1 Count together multiple depths
from certain level
Optimiztion 2 Count more than 20 of candidates
per pass
33Version 2 Estimate
- Estimating candidates support using sample
- 1st pass (Ck)
- count candidates that are expected to have
minsup (we count these candidates as candidates
that has 0.9minsup in the sample) - count candidates whose parents expect to have
minsup. - 2nd pass (Ck)
- count children of candidates in Ck that were not
expected to have minsup.
34Example for Estimate
Candidates Itemsets Support in Sample Support in Database Support in Database
Candidates Itemsets Support in Sample Scenario A Scenario B
Clothes, Shoes 8 7 9
Outwear, Shoes 4 4 6
Jacket, Shoes 2
35Version 3 EstMerge
- Motivation eliminate 2nd pass of algorithm
Estimate - Implementation count these candidates of Ck
with the candidates in Ck1. - Restriction to create Ck1 we assume that all
candidates in Ck has minsup. - The tradeoff extra candidates counted by
EstMerge v.s. extra pass made by Estimate.
36Algorithm EstMerge
37Stratify - Variants
38Size of Sample
P5 P5 P1 P1 P0.5 P0.5 P0.1 P0.1
a.8p a.9p a.8p a.9p a.8p a.9p a.8p a.9p
n1000 0.32 0.76 0.80 0.95 0.89 0.97 0.98 0.99
n10,000 0.00 0.07 0.11 0.59 0.34 0.77 0.80 0.95
n100,000 0.00 0.00 0.00 0.01 0.00 0.07 0.12 0.60
n1,000,000 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
39Size of Sample
40Performance Evaluation
- Compare running time of 3 algorithmsBasic,
Cumulate and EstMerge - On synthetic data
- effect of each parameter on performance
- On real data
- Supermarket Data
- Department Store Data
41Synthetic Data Generation
Parameter Parameter Default Value
D Number of transactions 1,000,000
T Average size of the Transactions 10
I Average size of the maximal potentially frequent itemsets 4
I Number of maximal potentially frequent itemsets 10,000
N Number of items 100,000
R Number of Roots 250
L Number of Levels 4-5
F Fanout 5
D Depth-ration (? probability that item in a rule comes from level i / probability that item comes from level i1) 1
42Minimum Support
43Number of Transactions
44Fanout
45Number of Items
46Reality Check
- Supermarket Data
- 548,000 items
- Taxonomy 4 levels, 118 roots
- 1.5 million transactions
- Average of 9.6 items per transaction
- Department Store Data
- 228,000 items
- Taxonomy 7 levels, 89 roots
- 570,000 transactions
- Average of 4.4 items per transaction
47Results
48Conclusions
- Cumulate and EstMerge were 2 to 5 times faster
than Basic on all synthetic datasets. On the
supermarket database they were 100 times faster ! - EstMerge was 25-30 faster than Cumulate.
- Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.
49Summary
- The use of taxonomy is necessary for finding
association rules between items at any level of
hierarchy. - The obvious solution (algorithm Basic) is not
very fast. - New algorithms that use the taxonomy benefits are
much faster - We can use the taxonomy to prune uninteresting
rules.
50