Title: Association rule mining
1Association rule mining
- Prof. Navneet Goyal
- CSIS Department, BITS-Pilani
2Association Rule Mining
- Find all rules of the form Itemset1? Itemset2
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
3Association Rule Mining
- 2 step process
- FI generation
- Rule generation
4FI Generation
- Brute-force approach
- Each itemset in the lattice is a candidate FI
- Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
5Computational Complexity
- Given d unique items
- Total number of itemsets 2d
- Total number of possible association rules
If d6, R 602 rules
6Association Rule Mining
- 2 step process
- FI generation
- Rule generation
7Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search M2d
- Use pruning techniques to reduce M
- Reduce the number of transactions (N)
- Reduce size of N as the size of itemset increases
- Used by DHP and vertical-based mining algorithms
- Reduce the number of comparisons (NM)
- Use efficient data structures to store the
candidates or transactions - No need to match every candidate against every
transaction
8Reducing Number of Candidates
- Apriori Principle
- If an itemset is frequent, then all its subsets
must be frequent - Apriori principle holds due to the following
property of the support measure - Support of on itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
9Illustrating Apriori Principle
10Sampling Algorithm
- Tx. DB can get very big!
- Sample the DB and apply apriori to the sample
- Use reduced minsup (smalls)
- Find large (frequent) itemsets from the sample
using smalls - Call this set of large itemsets as Potentially
Large (PL) - Find the negative border (BD-) of PL
- Minimal set of itemsets which are not in PL, but
whose subsets are all in PL.
11Negative Border Example
- Let Items A,,F and there are
- itemsets
- A, B, C, F, A,B, A,C, A,F,
C,F, A,C,F - The whole negative border is
- B,C, B,F, D, E
12Sampling Algorithm Example
- Sample Database t1, t2
- Smalls 20 Min_sup 40
- PL Br,PB,J,Br, J,Br, PR,J,PB
- BD- (PL) M,Be
- C1 PL U BD- (PL) Br,PB,J,M,Be,Br,
J,Br, PR,J,PB - First scan of the DB to find L with min_sup 40
(itemset must appear in 2 txs.)
13Sampling Algorithm Example
- L Br,PB,M,Be,Br, PB
- Set C2 L
- BD- (C2) Br, M, Br, Be, PB,M,
PB,Be,M,Be - (ignore those itemsets which we known are not
large, for eg. J and its supersets) - C3 C2 U BD- (C2) Br,PB,M,Be,Br,
PB, Br, M, Br, Be, PB,M, PB,Be,M,Be
14Sampling Algorithm Example
- Now again find the negative border of C3
- BD- (C3) Br, PB,M, Br, M, Be, Br, PB,
Be, PB,M,Be - C4 C3 U BD- (C3) Br,PB,M,Be,Br,
PB, Br, M, Br, Be, PB,M,
PB,Be,M,Be,Br, PB,M, Br, M, Be, Br, PB,
Be, PB,M,Be - BD- (C4) Br, PB, M, Be
15Sampling Algorithm Example
- So finally C5 Br,PB,M,Be,Br, PB,
Br, M, Br, Be, PB,M, PB,Be,M,Be,Br,
PB,M, Br, M, Be, Br, PB, Be, PB,M,Be,Br,
PB, M, Be - Now it is easy to see that BD- (C5) ?
- DO the scan of the DB (second scan) to find out
frequent itemsets. While doing this scan you need
not check itemsets in L - Final L Br,PB,M,Be,Br, PB
16Toivonens Algorithm
- Start as in the simple algorithm, but lower the
threshold slightly for the sample. - Example if the sample is 1 of the baskets, use
0.008 as the support threshold rather than 0.01
. - Goal is to avoid missing any itemset that is
frequent in the full set of baskets.
17Toivonens Algorithm (contd.)
- Add to the itemsets that are frequent in the
sample the negative border of these itemsets. - An itemset is in the negative border if it is not
deemed frequent in the sample, but all its
immediate subsets are. - Example ABCD is in the negative border if and
only if it is not frequent, but all of ABC , BCD
, ACD , and ABD are.
18Toivonens Algorithm (contd.)
- In a second pass, count all candidate frequent
itemsets from the first pass, and also count the
negative border. - If no itemset from the negative border turns out
to be frequent, then the candidates found to be
frequent in the whole data are exactly the
frequent itemsets.
19Toivonens Algorithm (contd.)
- What if we find something in the negative border
is actually frequent? - We must start over again!
- But by choosing the support threshold for the
sample wisely, we can make the probability of
failure low, while still keeping the number of
itemsets checked on the second pass low enough
for main-memory.
20Conclusions
- Advantages
- Reduced failure probability, while keeping
candidate-count low enough for memory - Disadvantages
- Potentially large number of candidates
- in second pass
21Partitioning
- Divide database into partitions D1,D2,,Dp
- Apply Apriori to each partition
- Any large itemset must be large in at least one
partition - DO YOU AGREE?
- Lets do the proof!
- Remember proof by contradiction?
22Partitioning Algorithm
- Divide D into partitions D1,D2,,Dp
- For I 1 to p do
- Li Apriori(Di)
- C L1 ? ? Lp
- Count C on D to generate L
- Do we need to count?
- Is CL?
23Partitioning Example
L1 Bread, Jelly, PeanutButter,
Bread,Jelly, Bread,PeanutButter, Jelly,
PeanutButter, Bread,Jelly,PeanutButter
D1
L2 Bread, Milk, PeanutButter,
Bread,Milk, Bread,PeanutButter, Milk,
PeanutButter, Bread,Milk,PeanutButter, Beer,
Beer,Bread, Beer,Milk
D2
S10
24Partitioning
- Advantages
- Adapts to available main memory
- Easily parallelized
- Maximum number of database scans is two.
- Disadvantages
- May have many candidates during second scan.
25AR Generation from FIs
- So far we have seen algorithms for finding FIs
- Lets now look at how we can generate the ARs from
FIs - FIs are concerned only with support
- Time to bring in the concept of confidence
- For each FI l, generate all non-empty subsets of
l - For each non-empty subset s of l, output the rule
- s ? (l-s) if
26AR Generation from FIs
- For each k-FI, Y, we can have up to 2k-2 ARs.
- Ignore empty antecedents/consequents
- Partition Y into 2 non-empty subsets X Y-X,
such that X?Y-X satisfies min_conf - We need not worry about min_sup!
- Y1,2,3
- 6 candidate ARs 1,2 ?3, 1,3 ?2, 2,3
?1, - 1 ?2,3, 2 ?1,3, 3 ?1,2
- DO we need any additional scans to find
confidence? - For 1,2 ?3, the confidence is
?(1,2,3)/?(1,2) - 123 is frequent, therefore 12 is also frequent.
So no need to find support counts again
27Rule Generation
- Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement - If A,B,C,D is a frequent itemset, candidate
rules - ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB, - If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)
28Rule Generation
- How to efficiently generate rules from frequent
itemsets? - In general, confidence does not have an
anti-monotone property - c(ABC ?D) can be larger or smaller than c(AB ?D)
- But confidence of rules generated from the same
itemset has an anti-monotone property - e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD) ?
c(A ? BCD) - Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
29Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
30Next Class
- Time Complexity of algorithms for finding FIs
- Efficient Counting for FIs using hash tree
- PCY algorithm for FI
31Computational Complexity
- Factors affecting computational complexity of
Apriori - Min_sup
- No. of items (dimensionality)
- No. of transactions
- Average transaction width
32(No Transcript)
33Computational Complexity
- No. of items (dimensionality)
- More space for storing support counts of items
- If the no. of FIs grow with dim., the computation
I/O costs will increase because of the large
no. of candidates generated by the algo. - No. of Transactions
- Apriori makes repeated passes of the tr. DB
- Run time increases as a result
- Average Transaction width
- Max. size of FIs increase as avg size of tx.
Increases - More itemsets need to be examined during
candidate generation and support counting - As width increases, more itemsets are contained
in the tx. Will inc. the hash tree traversal
34(No Transcript)
35Support Counting
- Compare each tx. against every candidate itemset
update the support count of candidates
contained in the tx. - Computationally expensive when no. of txs. no.
of candidates is large - How to make it efficient?
- Enumerate all itemsets contained in a tx. use
them to update support counts of their respective
CIs - T1 has 1,2,3,5,6. 5C3 10 itemsets of size 3.
some of these 10 will correspond to C3. Others
are ignored. - How to make matching operation efficient?
- Use HASH TREE!!!
36Support Counting
Given a transaction t, what are the possible
subsets of size 3?
37Hash Tree
- Partition CIs into different buckets and store
them in hast tree - During support counting, itemsets in each tx. are
also hashed into their appropriate buckets using
the same hash finction
38Hash Tree
- Example 3-itemset
- All candidate 3-itemsets are hashed
- Enumerate all the 3-itemsets of the tx.
- All 3-itemsets contained in a transaction are
also hashed - Comparison of a 3-itemset of tx. with all
candidate 3-itemsets is avoided - Comparison is required to be done only in the
appropriate bucket - Saves time ?
39Hash Tree
- For each internal node use hash fn. h(p) p mod
3 - All candidate itemsets are stored at the leaf
nodes of the hash tree - Suppose you have 15 candidate 3-itemsets
- 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8
40Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
41Hash Tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 2, 5 or 8
42Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 3, 6 or 9
43Subset Operation Using Hash Tree
transaction
44Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
45Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 11 out of 15 candidates
46Compact Representation of FIs
- Generally, the no. of FIs generated by a tx. DB
can be very large - Good if we could identify a small representative
set of FIs from which all other FIs could be
generated - 2 such representations
- Maximal FIs
- Closed FIs
47Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
48Maximal FIs
- Maximal FIs are the smallest set of itemsets from
which all the FIs can be derived - Maximal FIs do not contain support information of
their subsets - An additional scan of the DB is needed to
determine the support count of the non-maximal
FIs
49Closed Itemset
- An itemset is closed if none of its immediate
supersets has the same support as the itemset
50Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
51Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
52Maximal vs Closed Itemsets
53AR Topics Remaining
- Breadth-first vs. Depth-first
- Horizontal vs. Vertical data layout
- Types of ARs
- Boolean/Quantitative
- Single/Multi-Dimensional
- Single/Multi-Level
- Measuring Quality of Rules
54Breadth-first vs. Depth-first
55Breadth-first vs. Depth-first
56Breadth-first vs. Depth-first
- For finding maximal frequent itemsets, which
approach you would take? - BFS
- DFS
?
57Breadth-first vs. Depth-first
- Quick FI border detection!
- Once a maximal FI is found, substantial pruning
can be performed on its subsets - bcde is MFI, then we need not visit the subtrees
rooted at bd, be, c,d, e because they will not
contain any MFI - If abc is MFI, then only the nodes such as ac
bc are not MFI. - If support for abc is identical to ab, then abd
abe can be skipped because they will not have any
MFI