Sampling Large Databases for Association Rules - PowerPoint PPT Presentation

About This Presentation

Title:

Sampling Large Databases for Association Rules

Description:

WalMart sells 100,000 items and can store hundreds of millions of baskets. ... Goal is to avoid missing any itemset that is frequent in the full set of baskets. ... – PowerPoint PPT presentation

Number of Views:230

Avg rating:3.0/5.0

Slides: 27

Provided by: hkust6

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sampling Large Databases for Association Rules

1
Sampling Large Databases for Association Rules

Jingting Zeng
CIS 664 Presentation
March 13, 2007

2
Association Rules Outline

Association Rules Problem Overview
Association Rules Definitions
Previous Work on Association Rules
Toivonens Algorithm
Experiments Result
Conclusion

3
Overview

Purpose
If people tend to buy A and B together, then a
buyer of A is a good target for an advertisement
for B.

4
The Market-Basket Example

Items frequently purchased together
Bread ?PeanutButter
Uses
Placement
Advertising
Sales
Coupons
Objective increase sales and reduce costs

5
Other Example

The same technology has other uses
University course enrollment data has been
analyzed to find combinations of courses taken by
the same students

6
Scale of Problem

WalMart sells 100,000 items and can store
hundreds of millions of baskets.
The Web has 100,000,000 words and several billion
pages.

7
Association Rule Definitions

Set of items II1,I2,,Im
Transactions Dt1,t2, , tn, tj? I
Support of an itemset Percentage of transactions
which contain that itemset.
Frequent itemset Itemset whose number of
occurrences is above a threshold.

8
Association Rule Definitions

Association Rule (AR) implication X ? Y where
X,Y ? I and X ? Y
Support of AR (s) X ? Y Percentage of
transactions that contain X ?Y
Confidence of AR (a) X ? Y Ratio of number of
transactions that contain X ? Y to the number
that contain X

9
Example

B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
Association Rule
m, b ? c
Support 2/8 25
Confidence 2/4 50

10
Association Rule Problem

Given a set of items II1,I2,,Im and a
database of transactions Dt1,t2, , tn where
tiIi1,Ii2, , Iik and Iij ? I, the Association
Rule Problem is to identify all association rules
X ? Y with a minimum support and confidence
threshold.

11
Association Rule Techniques

Find all frequent itemsets
Generate strong association rules from the
frequent itemsets

12
APriori Algorithm

A two-pass approach called a-priori limits the
need for main memory.
Key idea monotonicity if a set of items
appears at least s times, so does every subset.
Converse for pairs if item i does not appear in
s baskets, then no pair including i can appear
in s baskets.

13
APriori Algorithm (contd.)

Pass 1 Read baskets and count in main memory the
occurrences of each item.
Requires only memory proportional to items.
Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to have occurred at least s times.
Requires memory proportional to square of
frequent items only.

14
Partitioning

Divide database into partitions D1,D2,,Dp
Apply Apriori to each partition
Any large itemset must be large in at least one
partition.

15
Partitioning Algorithm

Divide D into partitions D1,D2,,Dp
For I 1 to p do
Li Apriori(Di)
C L1 ? ? Lp
Count C on D to generate L

16
Sampling

Large databases
Sample the database and apply Apriori to the
sample.
Potentially Frequent Itemsets (PL) Large
itemsets from sample
Negative Border (BD - )
Generalization of Apriori-Gen applied to itemsets
of varying sizes.
Minimal set of itemsets which are not in PL, but
whose subsets are all in PL.

17
Negative Border Example

Let Items A,,F and there are
itemsets
A, B, C, F, A,B, A,C, A,F,
C,F, A,C,F
The whole negative border is
B,C, B,F, D, E

18
Toivonens Algorithm

Start as in the simple algorithm, but lower the
threshold slightly for the sample.
Example if the sample is 1 of the baskets, use
0.008 as the support threshold rather than 0.01
.
Goal is to avoid missing any itemset that is
frequent in the full set of baskets.

19
Toivonens Algorithm (contd.)

Add to the itemsets that are frequent in the
sample the negative border of these itemsets.
An itemset is in the negative border if it is not
deemed frequent in the sample, but all its
immediate subsets are.
Example ABCD is in the negative border if and
only if it is not frequent, but all of ABC , BCD
, ACD , and ABD are.

20
Toivonens Algorithm (contd.)

In a second pass, count all candidate frequent
itemsets from the first pass, and also count the
negative border.
If no itemset from the negative border turns out
to be frequent, then the candidates found to be
frequent in the whole data are exactly the
frequent itemsets.

21
Toivonens Algorithm (contd.)

What if we find something in the negative border
is actually frequent?
We must start over again!
But by choosing the support threshold for the
sample wisely, we can make the probability of
failure low, while still keeping the number of
itemsets checked on the second pass low enough
for main-memory.

22
Experiment

Synthetic data set characteristics (T row
size on average, I size of maximal frequent
sets on average)

23
Experiment (contd.)
Lowered frequency thresholds () for probability
of missing any given frequent set is less than d
0.001
24
Number of trials with misses
25
Conclusions

Advantages
Reduced failure probability, while keeping
candidate-count low enough for memory
Disadvantages
Potentially large number of candidates
in second pass