Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal - PowerPoint PPT Presentation

About This Presentation

Title:

Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal

Description:

Title: No Slide Title Author: nitzan & shlomo Last modified by: haranidi Created Date: 1/23/1998 3:07:06 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 51

Provided by: nit105

Category:

more less

Transcript and Presenter's Notes

Title: Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal

1
Mining Generalized Association Rules Ramkrishnan
Strikant Rakesh Agrawal

Data Mining Seminar, spring semester, 2003
Prof. Amos Fiat
Student Idit Haran

2
Outline

Motivation
Terms Definitions
Interest Measure
Algorithms for mining generalized association
rules
Comparison
Conclusions

3
Motivation

Find Association Rules of the formDiapers ?
Beer
Different kinds of diapers Huggies/Pampers,
S/M/L, etc.
Different kinds of beers Heineken/Maccabi, in a
bottle/in a can, etc.
The information on the bar-code is of
typeHuggies Diapers, M ? Heineken Beer in
bottle
The preliminary rule is not interesting, and
probably will not have minimum support.

4
Taxonomy

is-a hierarchies

5
Taxonomy - Example

Let say we found the rule Outwear ? Hiking
Bootswith minimum support and confidence.
The rule Jackets ? Hiking Boots may not have
minimum support
The rule Clothes ? Hiking Boots may not have
minimum confidence.

6
Taxonomy

Users are interested in generating rules that
span different levels of the taxonomy.
Rules of lower levels may not have minimum
support
Taxonomy can be used to prune uninteresting or
redundant rules
Multiple taxonomies may be present. for example
category, price(cheap, expensive),
items-on-sale. etc.
Multiple taxonomies may be modeled as a forest,
or a DAG.

7
Notations
8
Notations

I i1, i2, , im- items.
T- transaction, set of items T?I(we expect the
items in T to be leaves in T .)
D set of transactions
T supports item x, if x is in T or x is an
ancestor of some item in T.
T supports X?I if it supports every item in X.

9
Notations

A generalized association rule X? Y if X?I ,
Y?I , X?Y ? , and no item in Y is an
ancestor of any item in X.
The rule X?Y has confidence c in D if c of
transactions in D that support X also support Y.
The rule X?Y has support s in D if s of
transactions in D supports X?Y.

10
Problem Statement

To find all generalized association rules that
have support and confidence greater than the
user-specified minimum support (called minsup)
and minimum confidence (called minconf)
respectively.

11
Example

Recall the taxonomy

12
Example
Frequent Itemsets Frequent Itemsets
Itemset Support
Jacket 2
Outwear 3
Clothes 4
Shoes 2
Hiking Boots 2
Footwear 4
Outwear, Hiking Boots 2
Clothes,Hiking Boots 2
Outwear, Footwear 2
Clothes, Footwear 2
Database D Database D
Transaction Items Bought
100 Shirt
200 Jacket, Hiking Boots
300 Ski Pants, Hiking Boots
400 Shoes
500 Shoes
600 Jacket
Rules Rules Rules
Rule Support Confidence
Outwear ? Hiking Boots 33 66.6
Outwear ? Footwear 33 66.6
Hiking Boots ? Outwear 33 100
Hiking Boots ? Clothes 33 100
minsup 30 minconf 60
13
Observation 1

If the setx,y has minimum support, so do
x,y x,y and x,y
For example if Jacket, Shoes has minsup, so
will Outwear, Shoes, Jacket,Footwear, and
Outwear,Footwear

14
Observation 2

If the rule x?y has minimum support and
confidence, only x?y is guaranteed to have both
minsup and minconf.
The rule Outwear?Hiking Boots has minsup and
minconf.
The rule Outwear?Footwear has both minsup and
minconf.

15
Observation 2 cont.

However, the rules x?y and x?y will have
minsup, they may not have minconf.
For example The rules Clothes?Hiking Boots and
Clothes?Footwear have minsup, but not minconf.

16
Interesting Rules Previous Work

a rule X?Y is not interesting ifsupport(X?Y) ?
support(X)support(Y)
Previous work does not consider taxonomy.
The previous interest measure pruned less than 1
of the rules on a real database.

17
Interesting Rules Using the Taxonomy

Milk?Cereal (8 support, 70 conf)
Milk is parent of Skim Milk, and 25 of sales of
Milk are Skim Milk
We expectSkim Milk?Cereal to have 2 support
and 70 confidence

18
R-Interesting Rules

A rule is X?Y is R-interesting w.r.t an ancestor
X?Y if
or,
With R 1.1 about 40-55 of the rules were
prunes.

real support(X?Y)
expected support (X?Y) based on (X?Y)
gt
R
real confidence(X?Y)
expected confidence (X?Y) based on (X?Y)
gt
R
19
Problem Statement (new)

To find all generalized R-interesting association
rules (R is a user-specified minimum interest
called min-interest) that have support and
confidence greater than minsup and minconf
respectively.

20
Algorithms 3 steps

1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
All presented algorithms will only implement
step 1.

21
Algorithms 3 steps

1. Find all itemsets whose support is greater
than minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules if ABCD and AB are frequent then
conf(AB?CD) support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
All presented algorithms will only implement
step 1.

22
Algorithms (step 1)

Input Database, Taxonomy
Output All frequent itemsets
3 algorithms (same output, different run-time)
Basic, Cumulate, EstMerge

23
Algorithm Basic Main Idea

Is itemset X is frequent?
Does transaction T supports X? (X contains items
from different levels of taxonomy, T contains
only leaves)
T T ancestors(T)
Answer T supports X ? X ? T

24
Algorithm Basic
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t,
removing any duplication
Find the support of all the candidates
Take only those with support over minsup
25
Candidate generation

Join step
Prune step

P and q are 2 k-1 frequent itemsets identical in
all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with
small subset
26
Optimization 1

Filtering the ancestors added to transactions
We only need to add to transaction t the
ancestors that are in one of the candidates.
If the original item is not in any itemsets, it
can be dropped from the transaction.
Examplecandidates clothes,shoes.Transaction
t Jacket, can be replaced with clothes,

27
Optimization 2

Pre-computing ancestors
Rather than finding ancestors for each item by
traversing the taxonomy graph, we can pre-compute
the ancestors for each item.
We can drop ancestors that are not contained in
any of the candidates in the same time.

28
Optimization 3

Pruning itemsets containing an item and its
ancestor
If we have Jacket and Outwear, we will have
candidate Jacket, Outwear which is not
interesting.
support(Jacket ) support(Jacket, Outwear)
Delete (Jacket, Outwear) in k2 will ensure it
will not erase in kgt2. (because of the prune step
of candidate generation method)
Therefore, we can prune the rules containing an
item an its ancestor only for k2, and in the
next steps all candidates will not include item
ancestor.

29
Algorithm Cumulate
30
Stratification

Candidates Clothes, Shoes, Outwear,Shoes,
Jacket,Shoes
If Clothes, Shoes does not have minimum
support, we dont need to count either
Outwear,Shoes or Jacket,Shoes
We will count in steps step 1 count Clothes,
Shoes, and if it has minsup - step 2 count
Outwear,Shoes, if has minsup step 3 count
Jacket,Shoes

31
Version 1 Stratify

Depth of an itemset
itemsets with no parents are of depth 0.
others depth(X) max(depth(X) X is a
parent of X) 1
The algorithm
Count all itemsets C0 of depth 0.
Delete candidates that are descendants to the
itemsets in C0 that didnt have minsup.
Count remaining itemsets at depth 1 (C1)
Delete candidates that are descendants to the
itemsets in C1 that didnt have minsup.
Count remaining itemsets at depth 2 (C2), etc

32
Tradeoff Optimizations
candidates counted
passes over DB
Cumulate
Count each depth on different pass
Optimiztion 1 Count together multiple depths
from certain level
Optimiztion 2 Count more than 20 of candidates
per pass
33
Version 2 Estimate

Estimating candidates support using sample
1st pass (Ck)
count candidates that are expected to have
minsup (we count these candidates as candidates
that has 0.9minsup in the sample)
count candidates whose parents expect to have
minsup.
2nd pass (Ck)
count children of candidates in Ck that were not
expected to have minsup.

34
Example for Estimate

minsup 5

Candidates Itemsets Support in Sample Support in Database Support in Database
Candidates Itemsets Support in Sample Scenario A Scenario B
Clothes, Shoes 8 7 9
Outwear, Shoes 4 4 6
Jacket, Shoes 2
35
Version 3 EstMerge

Motivation eliminate 2nd pass of algorithm
Estimate
Implementation count these candidates of Ck
with the candidates in Ck1.
Restriction to create Ck1 we assume that all
candidates in Ck has minsup.
The tradeoff extra candidates counted by
EstMerge v.s. extra pass made by Estimate.

36
Algorithm EstMerge
37
Stratify - Variants
38
Size of Sample
P5 P5 P1 P1 P0.5 P0.5 P0.1 P0.1
a.8p a.9p a.8p a.9p a.8p a.9p a.8p a.9p
n1000 0.32 0.76 0.80 0.95 0.89 0.97 0.98 0.99
n10,000 0.00 0.07 0.11 0.59 0.34 0.77 0.80 0.95
n100,000 0.00 0.00 0.00 0.01 0.00 0.07 0.12 0.60
n1,000,000 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01

Prsupport in sample lt a

39
Size of Sample
40
Performance Evaluation

Compare running time of 3 algorithmsBasic,
Cumulate and EstMerge
On synthetic data
effect of each parameter on performance
On real data
Supermarket Data
Department Store Data

41
Synthetic Data Generation
Parameter Parameter Default Value
D Number of transactions 1,000,000
T Average size of the Transactions 10
I Average size of the maximal potentially frequent itemsets 4
I Number of maximal potentially frequent itemsets 10,000
N Number of items 100,000
R Number of Roots 250
L Number of Levels 4-5
F Fanout 5
D Depth-ration (? probability that item in a rule comes from level i / probability that item comes from level i1) 1
42
Minimum Support
43
Number of Transactions
44
Fanout
45
Number of Items
46
Reality Check

Supermarket Data
548,000 items
Taxonomy 4 levels, 118 roots
1.5 million transactions
Average of 9.6 items per transaction
Department Store Data
228,000 items
Taxonomy 7 levels, 89 roots
570,000 transactions
Average of 4.4 items per transaction

47
Results
48
Conclusions

Cumulate and EstMerge were 2 to 5 times faster
than Basic on all synthetic datasets. On the
supermarket database they were 100 times faster !
EstMerge was 25-30 faster than Cumulate.
Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.

49
Summary

The use of taxonomy is necessary for finding
association rules between items at any level of
hierarchy.
The obvious solution (algorithm Basic) is not
very fast.
New algorithms that use the taxonomy benefits are
much faster
We can use the taxonomy to prune uninteresting
rules.