Sampling Large Databases for Association Rules - PowerPoint PPT Presentation

About This Presentation
Title:

Sampling Large Databases for Association Rules

Description:

WalMart sells 100,000 items and can store hundreds of millions of baskets. ... Goal is to avoid missing any itemset that is frequent in the full set of baskets. ... – PowerPoint PPT presentation

Number of Views:230
Avg rating:3.0/5.0
Slides: 27
Provided by: hkust6
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: Sampling Large Databases for Association Rules


1
Sampling Large Databases for Association Rules
  • Jingting Zeng
  • CIS 664 Presentation
  • March 13, 2007

2
Association Rules Outline
  • Association Rules Problem Overview
  • Association Rules Definitions
  • Previous Work on Association Rules
  • Toivonens Algorithm
  • Experiments Result
  • Conclusion

3
Overview
  • Purpose
  • If people tend to buy A and B together, then a
    buyer of A is a good target for an advertisement
    for B.

4
The Market-Basket Example
  • Items frequently purchased together
  • Bread ?PeanutButter
  • Uses
  • Placement
  • Advertising
  • Sales
  • Coupons
  • Objective increase sales and reduce costs

5
Other Example
  • The same technology has other uses
  • University course enrollment data has been
    analyzed to find combinations of courses taken by
    the same students

6
Scale of Problem
  • WalMart sells 100,000 items and can store
    hundreds of millions of baskets.
  • The Web has 100,000,000 words and several billion
    pages.

7
Association Rule Definitions
  • Set of items II1,I2,,Im
  • Transactions Dt1,t2, , tn, tj? I
  • Support of an itemset Percentage of transactions
    which contain that itemset.
  • Frequent itemset Itemset whose number of
    occurrences is above a threshold.

8
Association Rule Definitions
  • Association Rule (AR) implication X ? Y where
    X,Y ? I and X ? Y
  • Support of AR (s) X ? Y Percentage of
    transactions that contain X ?Y
  • Confidence of AR (a) X ? Y Ratio of number of
    transactions that contain X ? Y to the number
    that contain X

9
Example
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • Association Rule
  • m, b ? c
  • Support 2/8 25
  • Confidence 2/4 50

10
Association Rule Problem
  • Given a set of items II1,I2,,Im and a
    database of transactions Dt1,t2, , tn where
    tiIi1,Ii2, , Iik and Iij ? I, the Association
    Rule Problem is to identify all association rules
    X ? Y with a minimum support and confidence
    threshold.

11
Association Rule Techniques
  • Find all frequent itemsets
  • Generate strong association rules from the
    frequent itemsets

12
APriori Algorithm
  • A two-pass approach called a-priori limits the
    need for main memory.
  • Key idea monotonicity if a set of items
    appears at least s times, so does every subset.
  • Converse for pairs if item i does not appear in
    s baskets, then no pair including i can appear
    in s baskets.

13
APriori Algorithm (contd.)
  • Pass 1 Read baskets and count in main memory the
    occurrences of each item.
  • Requires only memory proportional to items.
  • Pass 2 Read baskets again and count in main
    memory only those pairs both of which were found
    in Pass 1 to have occurred at least s times.
  • Requires memory proportional to square of
    frequent items only.

14
Partitioning
  • Divide database into partitions D1,D2,,Dp
  • Apply Apriori to each partition
  • Any large itemset must be large in at least one
    partition.

15
Partitioning Algorithm
  • Divide D into partitions D1,D2,,Dp
  • For I 1 to p do
  • Li Apriori(Di)
  • C L1 ? ? Lp
  • Count C on D to generate L

16
Sampling
  • Large databases
  • Sample the database and apply Apriori to the
    sample.
  • Potentially Frequent Itemsets (PL) Large
    itemsets from sample
  • Negative Border (BD - )
  • Generalization of Apriori-Gen applied to itemsets
    of varying sizes.
  • Minimal set of itemsets which are not in PL, but
    whose subsets are all in PL.

17
Negative Border Example
  • Let Items A,,F and there are
  • itemsets
  • A, B, C, F, A,B, A,C, A,F,
    C,F, A,C,F
  • The whole negative border is
  • B,C, B,F, D, E

18
Toivonens Algorithm
  • Start as in the simple algorithm, but lower the
    threshold slightly for the sample.
  • Example if the sample is 1 of the baskets, use
    0.008 as the support threshold rather than 0.01
    .
  • Goal is to avoid missing any itemset that is
    frequent in the full set of baskets.

19
Toivonens Algorithm (contd.)
  • Add to the itemsets that are frequent in the
    sample the negative border of these itemsets.
  • An itemset is in the negative border if it is not
    deemed frequent in the sample, but all its
    immediate subsets are.
  • Example ABCD is in the negative border if and
    only if it is not frequent, but all of ABC , BCD
    , ACD , and ABD are.

20
Toivonens Algorithm (contd.)
  • In a second pass, count all candidate frequent
    itemsets from the first pass, and also count the
    negative border.
  • If no itemset from the negative border turns out
    to be frequent, then the candidates found to be
    frequent in the whole data are exactly the
    frequent itemsets.

21
Toivonens Algorithm (contd.)
  • What if we find something in the negative border
    is actually frequent?
  • We must start over again!
  • But by choosing the support threshold for the
    sample wisely, we can make the probability of
    failure low, while still keeping the number of
    itemsets checked on the second pass low enough
    for main-memory.

22
Experiment
  • Synthetic data set characteristics (T row
    size on average, I size of maximal frequent
    sets on average)

23
Experiment (contd.)
Lowered frequency thresholds () for probability
of missing any given frequent set is less than d
0.001
24
Number of trials with misses
25
Conclusions
  • Advantages
  • Reduced failure probability, while keeping
    candidate-count low enough for memory
  • Disadvantages
  • Potentially large number of candidates
  • in second pass

26
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com