Title: Association Rules
1Association Rules
2Computation Model
- Typically, data is kept in a flat file rather
than a database system. - Stored on disk.
- Stored basket-by-basket.
- The true cost of mining disk-resident data is
usually the number of disk I/Os. - In practice, association-rule algorithms read the
data in passes all baskets read in turn. - Thus, we measure the cost by the number of passes
an algorithm takes.
3Main-Memory Bottleneck
- For many frequent-itemset algorithms, main memory
is the critical resource. - As we read baskets, we need to count something,
e.g., occurrences of pairs. - The number of different things we can count is
limited by main memory. - Swapping counts in/out is a disaster.
4Finding Frequent Pairs
- The hardest problem often turns out to be finding
the frequent pairs. - Well concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.
5Naïve Algorithm
- Read file once, counting in main memory the
occurrences of each pair. - From each basket of n items, generate its
n (n -1)/2 pairs by two nested loops. - Fails if (items)2 exceeds main memory.
- Remember items can be 100K (Wal-Mart) or 10B
(Web pages).
6Details of Main-Memory Counting
- Two approaches
- Count all pairs, using a triangular matrix.
- Keep a table of triples i, j, c the count of
the pair of items i,j is c. - (1) requires only 4 bytes/pair.
- Note assume integers are 4 bytes.
- (2) requires 12 bytes, but only for those pairs
with count gt 0.
74 per pair
12 per occurring pair
Method (1)
Method (2)
8Triangular-Matrix Approach (1)
- Number items 1, 2, ,n
- Requires table of size O(n).
- Count i, j only if i lt j.
- Keep pairs in the order
- 1,2, 1,3,, 1,n,
- 2,3, 2,4,,2,n,
- 3,4,, 3, n,
-
- n -1,n.
9Triangular-Matrix Approach (2)
- Let n be the number of items. Find pair i, j
at the position - (i-1)n-i(i1)/2j
- 1,2, 1,3, 1,4,
- 2,3, 2,4
- 3,4
- Total number of pairs n (n 1)/2 total bytes
about 2n 2.
10Details of Approach 2
- Total bytes used is about 12p, where p is the
number of pairs that actually occur. - Beats triangular matrix if at most 1/3 of
possible pairs actually occur. - May require extra space for retrieval structure,
e.g., a hash table.
11Apriori Algorithm for pairs (1)
- A two-pass approach called a-priori limits the
need for main memory. - Key idea monotonicity if a set of items
appears at least s times, so does every subset. - Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.
12Apriori Algorithm for pairs (2)
- Pass 1 Read baskets and count in main memory the
occurrences of each item. - Requires only memory proportional to items.
- Pass 2 Read baskets again and count in main
memory only those pairs whose both elements were
found in Pass 1 to be frequent. - Requires memory proportional to square of
frequent items only.
13Detail for A-Priori
- You can use the triangular matrix method with n
number of frequent items. - Saves space compared with storing triples.
- Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.
14Frequent Triples, Etc.
- For each k, we construct two sets of k tuples
- Ck candidate k - tuples those that might be
frequent sets (support gt s ) based on information
from the pass for k 1. - Fk the set of truly frequent k tuples.
15Lattice of Itemsets
Given d items, there are 2d possible candidate
itemsets
16Illustrating Apriori Principle
17Full Apriori Algorithm
- Let k1
- Generate frequent itemsets of length 1
- Repeat until no new frequent itemsets are found
- kk1
- Generate length-k candidate itemsets from
length-k-1 frequent itemsets - Prune candidate itemsets containing subsets of
length k-1 that are infrequent - Count the support of each candidate by scanning
the DB and eliminate candidates that are
infrequent, leaving only those that are frequent
18Candidate generation
- An effective candidate generation procedure
-
- Should avoid generating too many unnecessary
candidates. - Must ensure that the candidate set is complete,
- Should not generate the same candidate itemset
more than once.
19Data Set Example
s3
20Fk-1?F1 Method
- Extend each frequent (k - 1)itemset with a
frequent 1-itemset. - Is it complete?
- Yes, because every frequent kitemset is composed
of - a frequent (k-1)itemset and
- a frequent 1itemset.
- However, it doesnt prevent the same candidate
itemset from being generated more than once. - E.g., Bread, Diapers, Milk can be generated by
merging - Bread, Diapers with Milk,
- Bread, Milk with Diapers, or
- Diapers, Milk with Bread.
21Lexicographic Order
- Avoid generating duplicate candidates by ensuring
that the items in each frequent itemset are kept
sorted in their lexicographic order. - Each frequent (k-1)itemset X is then extended
with frequent items that are lexicographically
larger than the items in X. - For example, the itemset Bread, Diapers can be
augmented with Milk since Milk is
lexicographically larger than Bread and Diapers. - However, we dont augment Diapers, Milk with
Bread nor Bread, Milk with Diapers because
they violate the lexicographic ordering
condition. - Why is it complete?
22Prunning
- E.g. merging Beer, Diapers with Milk is
unnecessary because one of its subsets, Beer,
Milk, is infrequent. - Solution Prune!
- How?
23Fk-1?F1 Example
Beer,Diapers,Bread and Bread,Milk,Beer
aren't in fact generated if lexicographical ord.
is considered.
24Fk-1?Fk-1 Method
- Merge a pair of frequent (k-1)itemsets only if
their first k-2 items are identical. - E.g. frequent itemsets Bread, Diapers and
Bread, Milk are merged to form a candidate
3itemset Bread, Diapers, Milk.
25Fk-1?Fk-1 Method
- We dont merge Beer, Diapers with Diapers,
Milk because the first item in both itemsets is
different. - But, is this "don't merge" decision Ok?
- Indeed, if Beer, Diapers, Milk is a viable
candidate, it would have been obtained by merging
Beer, Diapers with Beer, Milk instead. - Pruning
- However, because each candidate is obtained by
merging a pair of frequent (k-1)itemsets, an
additional candidate pruning step is needed to
ensure that the remaining k-2 subsets of k-1
elements are frequent.
26Fk-1?Fk-1 Example
27Another Example
Min_sup_count 2
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
28Generate C2 from F1?F1
Min_sup_count 2
F1
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset Sup. count
I1 6
I2 7
I3 6
I4 2
I5 2
Itemset Sup. C
I1,I2 4
I1,I3 4
I1,I4 1
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
I3,I4 0
I3,I5 1
I4,I5 0
29Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset Sup. C
I1,I2 4
I1,I3 4
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
Itemset
I1,I2,I3
I1,I2,I5
I1,I3,I5
I2,I3,I4
I2,I3,I5
I2,I4,I5
Itemset
I1,I2,I3
I1,I2,I5
I1,I3,I5
I2,I3,I4
I2,I3,I5
I2,I4,I5
F3
Itemset Sup. C
I1,I2,I3 2
I1,I2,I5 2
30Generate C4 from F3?F3
Min_sup_count 2
C4
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset
I1,I2,I3,I5
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
F3
Itemset Sup. C
I1,I2,I3 2
I1,I2,I5 2