Association Rules - PowerPoint PPT Presentation

About This Presentation
Title:

Association Rules

Description:

Pruned supersets. 17. Full Apriori Algorithm. Let k=1. Generate frequent itemsets of length 1 ... Prune candidate itemsets containing subsets of ... Pruning ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 31
Provided by: jeff480
Category:

less

Transcript and Presenter's Notes

Title: Association Rules


1
Association Rules
  • Apriori Algorithm

2
Computation Model
  • Typically, data is kept in a flat file rather
    than a database system.
  • Stored on disk.
  • Stored basket-by-basket.
  • The true cost of mining disk-resident data is
    usually the number of disk I/Os.
  • In practice, association-rule algorithms read the
    data in passes all baskets read in turn.
  • Thus, we measure the cost by the number of passes
    an algorithm takes.

3
Main-Memory Bottleneck
  • For many frequent-itemset algorithms, main memory
    is the critical resource.
  • As we read baskets, we need to count something,
    e.g., occurrences of pairs.
  • The number of different things we can count is
    limited by main memory.
  • Swapping counts in/out is a disaster.

4
Finding Frequent Pairs
  • The hardest problem often turns out to be finding
    the frequent pairs.
  • Well concentrate on how to do that, then discuss
    extensions to finding frequent triples, etc.

5
Naïve Algorithm
  • Read file once, counting in main memory the
    occurrences of each pair.
  • From each basket of n items, generate its
    n (n -1)/2 pairs by two nested loops.
  • Fails if (items)2 exceeds main memory.
  • Remember items can be 100K (Wal-Mart) or 10B
    (Web pages).

6
Details of Main-Memory Counting
  • Two approaches
  • Count all pairs, using a triangular matrix.
  • Keep a table of triples i, j, c the count of
    the pair of items i,j is c.
  • (1) requires only 4 bytes/pair.
  • Note assume integers are 4 bytes.
  • (2) requires 12 bytes, but only for those pairs
    with count gt 0.

7
4 per pair
12 per occurring pair
Method (1)
Method (2)
8
Triangular-Matrix Approach (1)
  • Number items 1, 2, ,n
  • Requires table of size O(n).
  • Count i, j only if i lt j.
  • Keep pairs in the order
  • 1,2, 1,3,, 1,n,
  • 2,3, 2,4,,2,n,
  • 3,4,, 3, n,
  • n -1,n.

9
Triangular-Matrix Approach (2)
  • Let n be the number of items. Find pair i, j
    at the position
  • (i-1)n-i(i1)/2j
  • 1,2, 1,3, 1,4,
  • 2,3, 2,4
  • 3,4
  • Total number of pairs n (n 1)/2 total bytes
    about 2n 2.

10
Details of Approach 2
  • Total bytes used is about 12p, where p is the
    number of pairs that actually occur.
  • Beats triangular matrix if at most 1/3 of
    possible pairs actually occur.
  • May require extra space for retrieval structure,
    e.g., a hash table.

11
Apriori Algorithm for pairs (1)
  • A two-pass approach called a-priori limits the
    need for main memory.
  • Key idea monotonicity if a set of items
    appears at least s times, so does every subset.
  • Contrapositive for pairs if item i does not
    appear in s baskets, then no pair including i
    can appear in s baskets.

12
Apriori Algorithm for pairs (2)
  • Pass 1 Read baskets and count in main memory the
    occurrences of each item.
  • Requires only memory proportional to items.
  • Pass 2 Read baskets again and count in main
    memory only those pairs whose both elements were
    found in Pass 1 to be frequent.
  • Requires memory proportional to square of
    frequent items only.

13
Detail for A-Priori
  • You can use the triangular matrix method with n
    number of frequent items.
  • Saves space compared with storing triples.
  • Trick number frequent items 1,2, and keep a
    table relating new numbers to original item
    numbers.

14
Frequent Triples, Etc.
  • For each k, we construct two sets of k tuples
  • Ck candidate k - tuples those that might be
    frequent sets (support gt s ) based on information
    from the pass for k 1.
  • Fk the set of truly frequent k tuples.

15
Lattice of Itemsets
Given d items, there are 2d possible candidate
itemsets
16
Illustrating Apriori Principle
17
Full Apriori Algorithm
  • Let k1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are found
  • kk1
  • Generate length-k candidate itemsets from
    length-k-1 frequent itemsets
  • Prune candidate itemsets containing subsets of
    length k-1 that are infrequent
  • Count the support of each candidate by scanning
    the DB and eliminate candidates that are
    infrequent, leaving only those that are frequent

18
Candidate generation
  • An effective candidate generation procedure
  • Should avoid generating too many unnecessary
    candidates.
  • Must ensure that the candidate set is complete,
  • Should not generate the same candidate itemset
    more than once.

19
Data Set Example
s3
20
Fk-1?F1 Method
  • Extend each frequent (k - 1)itemset with a
    frequent 1-itemset.
  • Is it complete?
  • Yes, because every frequent kitemset is composed
    of
  • a frequent (k-1)itemset and
  • a frequent 1itemset.
  • However, it doesnt prevent the same candidate
    itemset from being generated more than once.
  • E.g., Bread, Diapers, Milk can be generated by
    merging
  • Bread, Diapers with Milk,
  • Bread, Milk with Diapers, or
  • Diapers, Milk with Bread.

21
Lexicographic Order
  • Avoid generating duplicate candidates by ensuring
    that the items in each frequent itemset are kept
    sorted in their lexicographic order.
  • Each frequent (k-1)itemset X is then extended
    with frequent items that are lexicographically
    larger than the items in X.
  • For example, the itemset Bread, Diapers can be
    augmented with Milk since Milk is
    lexicographically larger than Bread and Diapers.
  • However, we dont augment Diapers, Milk with
    Bread nor Bread, Milk with Diapers because
    they violate the lexicographic ordering
    condition.
  • Why is it complete?

22
Prunning
  • E.g. merging Beer, Diapers with Milk is
    unnecessary because one of its subsets, Beer,
    Milk, is infrequent.
  • Solution Prune!
  • How?

23
Fk-1?F1 Example
Beer,Diapers,Bread and Bread,Milk,Beer
aren't in fact generated if lexicographical ord.
is considered.
24
Fk-1?Fk-1 Method
  • Merge a pair of frequent (k-1)itemsets only if
    their first k-2 items are identical.
  • E.g. frequent itemsets Bread, Diapers and
    Bread, Milk are merged to form a candidate
    3itemset Bread, Diapers, Milk.

25
Fk-1?Fk-1 Method
  • We dont merge Beer, Diapers with Diapers,
    Milk because the first item in both itemsets is
    different.
  • But, is this "don't merge" decision Ok?
  • Indeed, if Beer, Diapers, Milk is a viable
    candidate, it would have been obtained by merging
    Beer, Diapers with Beer, Milk instead.
  • Pruning
  • However, because each candidate is obtained by
    merging a pair of frequent (k-1)itemsets, an
    additional candidate pruning step is needed to
    ensure that the remaining k-2 subsets of k-1
    elements are frequent.

26
Fk-1?Fk-1 Example
27
Another Example
Min_sup_count 2
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
28
Generate C2 from F1?F1
Min_sup_count 2
F1
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset Sup. count
I1 6
I2 7
I3 6
I4 2
I5 2
Itemset Sup. C
I1,I2 4
I1,I3 4
I1,I4 1
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
I3,I4 0
I3,I5 1
I4,I5 0
29
Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset Sup. C
I1,I2 4
I1,I3 4
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
Itemset
I1,I2,I3
I1,I2,I5
I1,I3,I5
I2,I3,I4
I2,I3,I5
I2,I4,I5
Itemset
I1,I2,I3
I1,I2,I5
I1,I3,I5
I2,I3,I4
I2,I3,I5
I2,I4,I5
F3
Itemset Sup. C
I1,I2,I3 2
I1,I2,I5 2
30
Generate C4 from F3?F3
Min_sup_count 2
C4
TID List of item IDs
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Itemset
I1,I2,I3,I5
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
F3
Itemset Sup. C
I1,I2,I3 2
I1,I2,I5 2
Write a Comment
User Comments (0)
About PowerShow.com