Mining Associations - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Mining Associations

Description:

For each k, we construct two sets of k itemsets: ... Fk = the set of truly frequent k - itemsets. C1. F1. C2. F2. C3. Filter. Filter. Construct ... Fk-1 Fk-1 Method ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 30

Provided by: jeff480

Category:

more less

Transcript and Presenter's Notes

Title: Mining Associations

1
Mining Associations

Apriori Algorithm

2
Computation Model

Typically, data is kept in a flat file rather
than a database system.
Stored on disk.
Stored basket-by-basket.
The true cost of mining disk-resident data is
usually the number of disk I/Os.
In practice, association-rule algorithms read the
data in passes all baskets read in turn.
Thus, we measure the cost by the number of passes
an algorithm takes.

3
Main-Memory Bottleneck

For many frequent-itemset algorithms, main memory
is the critical resource.
As we read baskets, we need to count something,
e.g., occurrences of pairs.
The number of different things we can count is
limited by main memory.
Swapping counts in/out is a disaster.

4
Finding Frequent Pairs

The hardest problem often turns out to be finding
the frequent pairs.
Well concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.

5
Naïve Algorithm

Read file once, counting in main memory the
occurrences of each pair.
From each basket of n items, generate its
n (n -1)/2 pairs by two nested loops.
Fails if (items)2 exceeds main memory.
Remember items can be 100K (Wal-Mart) or 10B
(Web pages).

6
Details of Main-Memory Counting

Two approaches
Count all pairs, using a triangular matrix.
Keep a table of triples i, j, c the count of
the pair of items i,j is c.
(1) requires only 4 bytes/pair.
Note assume integers are 4 bytes.
(2) requires 12 bytes, but only for those pairs
with count gt 0.

7
4 per pair
12 per occurring pair
Method (1)
Method (2)
8
Triangular-Matrix Approach (1)

Number items 1, 2, ,n
Count i, j only if i lt j.
Keep pairs in the order
1,2, 1,3,, 1,n,
2,3, 2,4,,2,n,
3,4,, 3, n,
n -1,n.

9
Triangular-Matrix Approach (2)

Let n be the number of items. Count for pair i,
j is at position
T(i,j) (i-1)n - i(i1)/2 j
1,2, 1,3, 1,4,
2,3, 2,4
3,4
Total number of pairs n (n 1)/2 total bytes
about 2n 2.

10
Details of Approach 2

Total bytes used is about 12p, where p is the
number of pairs that actually occur.
Beats triangular matrix if at most 1/3 of
possible pairs actually occur.
May require extra space for retrieval structure,
e.g., a hash table.

11
Apriori Algorithm for pairs (1)

A two-pass approach called a-priori limits the
need for main memory.
Key idea monotonicity if a set of items
appears at least s times, so does every subset.
Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.

12
Apriori Algorithm for pairs (2)

Pass 1 Read baskets and count in main memory the
occurrences of each item.
Requires only memory proportional to items.
Pass 2 Read baskets again and count in main
memory only those pairs whose both elements were
found in Pass 1 to be frequent.
Requires memory proportional to square of
frequent items only.

13
Detail for A-Priori

You can use the triangular matrix method with
n number of frequent items.
Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.

14
Frequent Triples, Etc.

For each k, we construct two sets of k itemsets
Ck candidate k - itemsets those that might
be frequent (support gt s ) based on information
from the pass for k 1.
Fk the set of truly frequent k - itemsets.

15
Full Apriori Algorithm

Let k1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are found
kk1
Generate length k candidate itemsets from length
k-1 frequent itemsets
Prune candidate itemsets containing subsets of
length k-1 that are infrequent
Count the support of each candidate by scanning
the DB and eliminate candidates that are
infrequent, leaving only those that are frequent

16
Illustrating Apriori
17
Candidate generation

Must ensure that the candidate set is complete.
Should not generate the same candidate itemset
more than once.

18
Data Set Example
s3
19
Fk-1?F1 Method

Extend each frequent (k - 1)itemset with a
frequent 1-itemset.
Is it complete?
Yes, because every frequent kitemset is composed
of
a frequent (k-1)itemset and
a frequent 1itemset.
However, it doesnt prevent the same candidate
itemset from being generated more than once.
E.g., Bread, Diapers, Milk can be generated by
merging
Bread, Diapers with Milk,
Bread, Milk with Diapers, or
Diapers, Milk with Bread.

20
Lexicographic Order

Keep frequent itemset sorted in lexicographic
order.
Each frequent (k-1)itemset X is extended with
frequent items that are lexicographically larger
than the items in X.
Example
Bread, Diapers can be extended with Milk
Bread, Milk cant be extended with Diapers
Diapers, Milk cant be extended with Bread
Why is it complete?

21
Prunning

Merging Beer, Diapers with Milk is
unnecessary. Why?
Because one of its subsets, Beer, Milk, is
infrequent.
Solution Prune!
How?

22
Fk-1?F1 Example
Beer,Diapers,Bread and Bread,Milk,Beer
aren't in fact generated if lexicographical ord.
is considered.
23
Fk-1?Fk-1 Method

Merge a pair of frequent (k-1) itemsets only if
their first k-2 items are identical.
E.g. frequent itemsets
Bread, Diapers and Bread, Milk
are merged to form a candidate 3itemset
Bread, Diapers, Milk.

24
Fk-1?Fk-1 Method

We dont merge Beer, Diapers with Diapers,
Milk because the first item in both itemsets is
different.
But, is this "don't merge" decision Ok?
Indeed, if Beer, Diapers, Milk is a viable
candidate, it would have been obtained by merging
Beer, Diapers with Beer, Milk instead.
Pruning
Because each candidate is obtained by merging a
pair of frequent (k-1)itemsets, an additional
candidate pruning step is needed to ensure that
the remaining k-2 subsets of k-1 elements are
frequent.

25
Fk-1?Fk-1 Example
26
Another Example
Min_sup_count 2
27
Generate C2 from F1?F1
Min_sup_count 2
F1
28
Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
C3
F3
29
Generate C4 from F3?F3
Min_sup_count 2
C4
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
F3

Write a Comment

User Comments (0)