Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal

Description:

Title: No Slide Title Author: nitzan & shlomo Last modified by: haranidi Created Date: 1/23/1998 3:07:06 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 51
Provided by: nit105
Category:

less

Transcript and Presenter's Notes

Title: Mining%20Generalized%20Association%20Rules%20Ramkrishnan%20Strikant%20Rakesh%20Agrawal


1
Mining Generalized Association Rules Ramkrishnan
Strikant Rakesh Agrawal
  • Data Mining Seminar, spring semester, 2003
  • Prof. Amos Fiat
  • Student Idit Haran

2
Outline
  • Motivation
  • Terms Definitions
  • Interest Measure
  • Algorithms for mining generalized association
    rules
  • Comparison
  • Conclusions

3
Motivation
  • Find Association Rules of the formDiapers ?
    Beer
  • Different kinds of diapers Huggies/Pampers,
    S/M/L, etc.
  • Different kinds of beers Heineken/Maccabi, in a
    bottle/in a can, etc.
  • The information on the bar-code is of
    typeHuggies Diapers, M ? Heineken Beer in
    bottle
  • The preliminary rule is not interesting, and
    probably will not have minimum support.

4
Taxonomy
  • is-a hierarchies

5
Taxonomy - Example
  • Let say we found the rule Outwear ? Hiking
    Bootswith minimum support and confidence.
  • The rule Jackets ? Hiking Boots may not have
    minimum support
  • The rule Clothes ? Hiking Boots may not have
    minimum confidence.

6
Taxonomy
  • Users are interested in generating rules that
    span different levels of the taxonomy.
  • Rules of lower levels may not have minimum
    support
  • Taxonomy can be used to prune uninteresting or
    redundant rules
  • Multiple taxonomies may be present. for example
    category, price(cheap, expensive),
    items-on-sale. etc.
  • Multiple taxonomies may be modeled as a forest,
    or a DAG.

7
Notations
8
Notations
  • I i1, i2, , im- items.
  • T- transaction, set of items T?I(we expect the
    items in T to be leaves in T .)
  • D set of transactions
  • T supports item x, if x is in T or x is an
    ancestor of some item in T.
  • T supports X?I if it supports every item in X.

9
Notations
  • A generalized association rule X? Y if X?I ,
    Y?I , X?Y ? , and no item in Y is an
    ancestor of any item in X.
  • The rule X?Y has confidence c in D if c of
    transactions in D that support X also support Y.
  • The rule X?Y has support s in D if s of
    transactions in D supports X?Y.

10
Problem Statement
  • To find all generalized association rules that
    have support and confidence greater than the
    user-specified minimum support (called minsup)
    and minimum confidence (called minconf)
    respectively.

11
Example
  • Recall the taxonomy

12
Example
Frequent Itemsets Frequent Itemsets
Itemset Support
Jacket 2
Outwear 3
Clothes 4
Shoes 2
Hiking Boots 2
Footwear 4
Outwear, Hiking Boots 2
Clothes,Hiking Boots 2
Outwear, Footwear 2
Clothes, Footwear 2
Database D Database D
Transaction Items Bought
100 Shirt
200 Jacket, Hiking Boots
300 Ski Pants, Hiking Boots
400 Shoes
500 Shoes
600 Jacket
Rules Rules Rules
Rule Support Confidence
Outwear ? Hiking Boots 33 66.6
Outwear ? Footwear 33 66.6
Hiking Boots ? Outwear 33 100
Hiking Boots ? Clothes 33 100
minsup 30 minconf 60
13
Observation 1
  • If the setx,y has minimum support, so do
    x,y x,y and x,y
  • For example if Jacket, Shoes has minsup, so
    will Outwear, Shoes, Jacket,Footwear, and
    Outwear,Footwear

14
Observation 2
  • If the rule x?y has minimum support and
    confidence, only x?y is guaranteed to have both
    minsup and minconf.
  • The rule Outwear?Hiking Boots has minsup and
    minconf.
  • The rule Outwear?Footwear has both minsup and
    minconf.

15
Observation 2 cont.
  • However, the rules x?y and x?y will have
    minsup, they may not have minconf.
  • For example The rules Clothes?Hiking Boots and
    Clothes?Footwear have minsup, but not minconf.

16
Interesting Rules Previous Work
  • a rule X?Y is not interesting ifsupport(X?Y) ?
    support(X)support(Y)
  • Previous work does not consider taxonomy.
  • The previous interest measure pruned less than 1
    of the rules on a real database.

17
Interesting Rules Using the Taxonomy
  • Milk?Cereal (8 support, 70 conf)
  • Milk is parent of Skim Milk, and 25 of sales of
    Milk are Skim Milk
  • We expectSkim Milk?Cereal to have 2 support
    and 70 confidence

18
R-Interesting Rules
  • A rule is X?Y is R-interesting w.r.t an ancestor
    X?Y if
  • or,
  • With R 1.1 about 40-55 of the rules were
    prunes.

real support(X?Y)
expected support (X?Y) based on (X?Y)
gt
R
real confidence(X?Y)
expected confidence (X?Y) based on (X?Y)
gt
R
19
Problem Statement (new)
  • To find all generalized R-interesting association
    rules (R is a user-specified minimum interest
    called min-interest) that have support and
    confidence greater than minsup and minconf
    respectively.

20
Algorithms 3 steps
  • 1. Find all itemsets whose support is greater
    than minsup. These itemsets are called frequent
    itemsets.
  • 2. Use the frequent itemsets to generate the
    desired rules if ABCD and AB are frequent then
    conf(AB?CD) support(ABCD)/support(AB)
  • 3. Prune all uninteresting rules from this set.
  • All presented algorithms will only implement
    step 1.

21
Algorithms 3 steps
  • 1. Find all itemsets whose support is greater
    than minsup. These itemsets are called frequent
    itemsets.
  • 2. Use the frequent itemsets to generate the
    desired rules if ABCD and AB are frequent then
    conf(AB?CD) support(ABCD)/support(AB)
  • 3. Prune all uninteresting rules from this set.
  • All presented algorithms will only implement
    step 1.

22
Algorithms (step 1)
  • Input Database, Taxonomy
  • Output All frequent itemsets
  • 3 algorithms (same output, different run-time)
    Basic, Cumulate, EstMerge

23
Algorithm Basic Main Idea
  • Is itemset X is frequent?
  • Does transaction T supports X? (X contains items
    from different levels of taxonomy, T contains
    only leaves)
  • T T ancestors(T)
  • Answer T supports X ? X ? T

24
Algorithm Basic
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t,
removing any duplication
Find the support of all the candidates
Take only those with support over minsup
25
Candidate generation
  • Join step
  • Prune step

P and q are 2 k-1 frequent itemsets identical in
all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with
small subset
26
Optimization 1
  • Filtering the ancestors added to transactions
  • We only need to add to transaction t the
    ancestors that are in one of the candidates.
  • If the original item is not in any itemsets, it
    can be dropped from the transaction.
  • Examplecandidates clothes,shoes.Transaction
    t Jacket, can be replaced with clothes,

27
Optimization 2
  • Pre-computing ancestors
  • Rather than finding ancestors for each item by
    traversing the taxonomy graph, we can pre-compute
    the ancestors for each item.
  • We can drop ancestors that are not contained in
    any of the candidates in the same time.

28
Optimization 3
  • Pruning itemsets containing an item and its
    ancestor
  • If we have Jacket and Outwear, we will have
    candidate Jacket, Outwear which is not
    interesting.
  • support(Jacket ) support(Jacket, Outwear)
  • Delete (Jacket, Outwear) in k2 will ensure it
    will not erase in kgt2. (because of the prune step
    of candidate generation method)
  • Therefore, we can prune the rules containing an
    item an its ancestor only for k2, and in the
    next steps all candidates will not include item
    ancestor.

29
Algorithm Cumulate
30
Stratification
  • Candidates Clothes, Shoes, Outwear,Shoes,
    Jacket,Shoes
  • If Clothes, Shoes does not have minimum
    support, we dont need to count either
    Outwear,Shoes or Jacket,Shoes
  • We will count in steps step 1 count Clothes,
    Shoes, and if it has minsup - step 2 count
    Outwear,Shoes, if has minsup step 3 count
    Jacket,Shoes

31
Version 1 Stratify
  • Depth of an itemset
  • itemsets with no parents are of depth 0.
  • others depth(X) max(depth(X) X is a
    parent of X) 1
  • The algorithm
  • Count all itemsets C0 of depth 0.
  • Delete candidates that are descendants to the
    itemsets in C0 that didnt have minsup.
  • Count remaining itemsets at depth 1 (C1)
  • Delete candidates that are descendants to the
    itemsets in C1 that didnt have minsup.
  • Count remaining itemsets at depth 2 (C2), etc

32
Tradeoff Optimizations
candidates counted
passes over DB
Cumulate
Count each depth on different pass
Optimiztion 1 Count together multiple depths
from certain level
Optimiztion 2 Count more than 20 of candidates
per pass
33
Version 2 Estimate
  • Estimating candidates support using sample
  • 1st pass (Ck)
  • count candidates that are expected to have
    minsup (we count these candidates as candidates
    that has 0.9minsup in the sample)
  • count candidates whose parents expect to have
    minsup.
  • 2nd pass (Ck)
  • count children of candidates in Ck that were not
    expected to have minsup.

34
Example for Estimate
  • minsup 5

Candidates Itemsets Support in Sample Support in Database Support in Database
Candidates Itemsets Support in Sample Scenario A Scenario B
Clothes, Shoes 8 7 9
Outwear, Shoes 4 4 6
Jacket, Shoes 2
35
Version 3 EstMerge
  • Motivation eliminate 2nd pass of algorithm
    Estimate
  • Implementation count these candidates of Ck
    with the candidates in Ck1.
  • Restriction to create Ck1 we assume that all
    candidates in Ck has minsup.
  • The tradeoff extra candidates counted by
    EstMerge v.s. extra pass made by Estimate.

36
Algorithm EstMerge
37
Stratify - Variants
38
Size of Sample
P5 P5 P1 P1 P0.5 P0.5 P0.1 P0.1
a.8p a.9p a.8p a.9p a.8p a.9p a.8p a.9p
n1000 0.32 0.76 0.80 0.95 0.89 0.97 0.98 0.99
n10,000 0.00 0.07 0.11 0.59 0.34 0.77 0.80 0.95
n100,000 0.00 0.00 0.00 0.01 0.00 0.07 0.12 0.60
n1,000,000 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
  • Prsupport in sample lt a

39
Size of Sample
40
Performance Evaluation
  • Compare running time of 3 algorithmsBasic,
    Cumulate and EstMerge
  • On synthetic data
  • effect of each parameter on performance
  • On real data
  • Supermarket Data
  • Department Store Data

41
Synthetic Data Generation
Parameter Parameter Default Value
D Number of transactions 1,000,000
T Average size of the Transactions 10
I Average size of the maximal potentially frequent itemsets 4
I Number of maximal potentially frequent itemsets 10,000
N Number of items 100,000
R Number of Roots 250
L Number of Levels 4-5
F Fanout 5
D Depth-ration (? probability that item in a rule comes from level i / probability that item comes from level i1) 1
42
Minimum Support
43
Number of Transactions
44
Fanout
45
Number of Items
46
Reality Check
  • Supermarket Data
  • 548,000 items
  • Taxonomy 4 levels, 118 roots
  • 1.5 million transactions
  • Average of 9.6 items per transaction
  • Department Store Data
  • 228,000 items
  • Taxonomy 7 levels, 89 roots
  • 570,000 transactions
  • Average of 4.4 items per transaction

47
Results
48
Conclusions
  • Cumulate and EstMerge were 2 to 5 times faster
    than Basic on all synthetic datasets. On the
    supermarket database they were 100 times faster !
  • EstMerge was 25-30 faster than Cumulate.
  • Both EstMerge and Cumulate exhibits linear
    scale-up with the number of transactions.

49
Summary
  • The use of taxonomy is necessary for finding
    association rules between items at any level of
    hierarchy.
  • The obvious solution (algorithm Basic) is not
    very fast.
  • New algorithms that use the taxonomy benefits are
    much faster
  • We can use the taxonomy to prune uninteresting
    rules.

50
  • THE END
Write a Comment
User Comments (0)
About PowerShow.com