Mining Associations - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Mining Associations

Description:

For each k, we construct two sets of k itemsets: ... Fk = the set of truly frequent k - itemsets. C1. F1. C2. F2. C3. Filter. Filter. Construct ... Fk-1 Fk-1 Method ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 30
Provided by: jeff480
Category:

less

Transcript and Presenter's Notes

Title: Mining Associations


1
Mining Associations
  • Apriori Algorithm

2
Computation Model
  • Typically, data is kept in a flat file rather
    than a database system.
  • Stored on disk.
  • Stored basket-by-basket.
  • The true cost of mining disk-resident data is
    usually the number of disk I/Os.
  • In practice, association-rule algorithms read the
    data in passes all baskets read in turn.
  • Thus, we measure the cost by the number of passes
    an algorithm takes.

3
Main-Memory Bottleneck
  • For many frequent-itemset algorithms, main memory
    is the critical resource.
  • As we read baskets, we need to count something,
    e.g., occurrences of pairs.
  • The number of different things we can count is
    limited by main memory.
  • Swapping counts in/out is a disaster.

4
Finding Frequent Pairs
  • The hardest problem often turns out to be finding
    the frequent pairs.
  • Well concentrate on how to do that, then discuss
    extensions to finding frequent triples, etc.

5
Naïve Algorithm
  • Read file once, counting in main memory the
    occurrences of each pair.
  • From each basket of n items, generate its
    n (n -1)/2 pairs by two nested loops.
  • Fails if (items)2 exceeds main memory.
  • Remember items can be 100K (Wal-Mart) or 10B
    (Web pages).

6
Details of Main-Memory Counting
  • Two approaches
  • Count all pairs, using a triangular matrix.
  • Keep a table of triples i, j, c the count of
    the pair of items i,j is c.
  • (1) requires only 4 bytes/pair.
  • Note assume integers are 4 bytes.
  • (2) requires 12 bytes, but only for those pairs
    with count gt 0.

7
4 per pair
12 per occurring pair
Method (1)
Method (2)
8
Triangular-Matrix Approach (1)
  • Number items 1, 2, ,n
  • Count i, j only if i lt j.
  • Keep pairs in the order
  • 1,2, 1,3,, 1,n,
  • 2,3, 2,4,,2,n,
  • 3,4,, 3, n,
  • n -1,n.

9
Triangular-Matrix Approach (2)
  • Let n be the number of items. Count for pair i,
    j is at position
  • T(i,j) (i-1)n - i(i1)/2 j
  • 1,2, 1,3, 1,4,
  • 2,3, 2,4
  • 3,4
  • Total number of pairs n (n 1)/2 total bytes
    about 2n 2.

10
Details of Approach 2
  • Total bytes used is about 12p, where p is the
    number of pairs that actually occur.
  • Beats triangular matrix if at most 1/3 of
    possible pairs actually occur.
  • May require extra space for retrieval structure,
    e.g., a hash table.

11
Apriori Algorithm for pairs (1)
  • A two-pass approach called a-priori limits the
    need for main memory.
  • Key idea monotonicity if a set of items
    appears at least s times, so does every subset.
  • Contrapositive for pairs if item i does not
    appear in s baskets, then no pair including i
    can appear in s baskets.

12
Apriori Algorithm for pairs (2)
  • Pass 1 Read baskets and count in main memory the
    occurrences of each item.
  • Requires only memory proportional to items.
  • Pass 2 Read baskets again and count in main
    memory only those pairs whose both elements were
    found in Pass 1 to be frequent.
  • Requires memory proportional to square of
    frequent items only.

13
Detail for A-Priori
  • You can use the triangular matrix method with
  • n number of frequent items.
  • Trick number frequent items 1,2, and keep a
    table relating new numbers to original item
    numbers.

14
Frequent Triples, Etc.
  • For each k, we construct two sets of k itemsets
  • Ck candidate k - itemsets those that might
    be frequent (support gt s ) based on information
    from the pass for k 1.
  • Fk the set of truly frequent k - itemsets.

15
Full Apriori Algorithm
  • Let k1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are found
  • kk1
  • Generate length k candidate itemsets from length
    k-1 frequent itemsets
  • Prune candidate itemsets containing subsets of
    length k-1 that are infrequent
  • Count the support of each candidate by scanning
    the DB and eliminate candidates that are
    infrequent, leaving only those that are frequent

16
Illustrating Apriori
17
Candidate generation
  • Must ensure that the candidate set is complete.
  • Should not generate the same candidate itemset
    more than once.

18
Data Set Example
s3
19
Fk-1?F1 Method
  • Extend each frequent (k - 1)itemset with a
    frequent 1-itemset.
  • Is it complete?
  • Yes, because every frequent kitemset is composed
    of
  • a frequent (k-1)itemset and
  • a frequent 1itemset.
  • However, it doesnt prevent the same candidate
    itemset from being generated more than once.
  • E.g., Bread, Diapers, Milk can be generated by
    merging
  • Bread, Diapers with Milk,
  • Bread, Milk with Diapers, or
  • Diapers, Milk with Bread.

20
Lexicographic Order
  • Keep frequent itemset sorted in lexicographic
    order.
  • Each frequent (k-1)itemset X is extended with
    frequent items that are lexicographically larger
    than the items in X.
  • Example
  • Bread, Diapers can be extended with Milk
  • Bread, Milk cant be extended with Diapers
  • Diapers, Milk cant be extended with Bread
  • Why is it complete?

21
Prunning
  • Merging Beer, Diapers with Milk is
    unnecessary. Why?
  • Because one of its subsets, Beer, Milk, is
    infrequent.
  • Solution Prune!
  • How?

22
Fk-1?F1 Example
Beer,Diapers,Bread and Bread,Milk,Beer
aren't in fact generated if lexicographical ord.
is considered.
23
Fk-1?Fk-1 Method
  • Merge a pair of frequent (k-1) itemsets only if
    their first k-2 items are identical.
  • E.g. frequent itemsets
  • Bread, Diapers and Bread, Milk
  • are merged to form a candidate 3itemset
  • Bread, Diapers, Milk.

24
Fk-1?Fk-1 Method
  • We dont merge Beer, Diapers with Diapers,
    Milk because the first item in both itemsets is
    different.
  • But, is this "don't merge" decision Ok?
  • Indeed, if Beer, Diapers, Milk is a viable
    candidate, it would have been obtained by merging
    Beer, Diapers with Beer, Milk instead.
  • Pruning
  • Because each candidate is obtained by merging a
    pair of frequent (k-1)itemsets, an additional
    candidate pruning step is needed to ensure that
    the remaining k-2 subsets of k-1 elements are
    frequent.

25
Fk-1?Fk-1 Example
26
Another Example
Min_sup_count 2
27
Generate C2 from F1?F1
Min_sup_count 2
F1
28
Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
C3
F3
29
Generate C4 from F3?F3
Min_sup_count 2
C4
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
F3
Write a Comment
User Comments (0)
About PowerShow.com