Association rule mining - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Association rule mining

Description:

ASSOCIATION RULE MINING Prof. Navneet Goyal CSIS Department, BITS-Pilani * * * * * * * * * * * * * * * * * * * * Dr. Navneet Goyal, BITS,Pilani * * Dr. Navneet Goyal ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 58
Provided by: csisBits6
Category:

less

Transcript and Presenter's Notes

Title: Association rule mining


1
Association rule mining
  • Prof. Navneet Goyal
  • CSIS Department, BITS-Pilani

2
Association Rule Mining
  • Find all rules of the form Itemset1? Itemset2
    having
  • support minsup threshold
  • confidence minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

3
Association Rule Mining
  • 2 step process
  • FI generation
  • Rule generation

4
FI Generation
  • Brute-force approach
  • Each itemset in the lattice is a candidate FI
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!

5
Computational Complexity
  • Given d unique items
  • Total number of itemsets 2d
  • Total number of possible association rules

If d6, R 602 rules
6
Association Rule Mining
  • 2 step process
  • FI generation
  • Rule generation

7
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by DHP and vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the
    candidates or transactions
  • No need to match every candidate against every
    transaction

8
Reducing Number of Candidates
  • Apriori Principle
  • If an itemset is frequent, then all its subsets
    must be frequent
  • Apriori principle holds due to the following
    property of the support measure
  • Support of on itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

9
Illustrating Apriori Principle
10
Sampling Algorithm
  • Tx. DB can get very big!
  • Sample the DB and apply apriori to the sample
  • Use reduced minsup (smalls)
  • Find large (frequent) itemsets from the sample
    using smalls
  • Call this set of large itemsets as Potentially
    Large (PL)
  • Find the negative border (BD-) of PL
  • Minimal set of itemsets which are not in PL, but
    whose subsets are all in PL.

11
Negative Border Example
  • Let Items A,,F and there are
  • itemsets
  • A, B, C, F, A,B, A,C, A,F,
    C,F, A,C,F
  • The whole negative border is
  • B,C, B,F, D, E

12
Sampling Algorithm Example
  • Sample Database t1, t2
  • Smalls 20 Min_sup 40
  • PL Br,PB,J,Br, J,Br, PR,J,PB
  • BD- (PL) M,Be
  • C1 PL U BD- (PL) Br,PB,J,M,Be,Br,
    J,Br, PR,J,PB
  • First scan of the DB to find L with min_sup 40
    (itemset must appear in 2 txs.)

13
Sampling Algorithm Example
  • L Br,PB,M,Be,Br, PB
  • Set C2 L
  • BD- (C2) Br, M, Br, Be, PB,M,
    PB,Be,M,Be
  • (ignore those itemsets which we known are not
    large, for eg. J and its supersets)
  • C3 C2 U BD- (C2) Br,PB,M,Be,Br,
    PB, Br, M, Br, Be, PB,M, PB,Be,M,Be

14
Sampling Algorithm Example
  • Now again find the negative border of C3
  • BD- (C3) Br, PB,M, Br, M, Be, Br, PB,
    Be, PB,M,Be
  • C4 C3 U BD- (C3) Br,PB,M,Be,Br,
    PB, Br, M, Br, Be, PB,M,
    PB,Be,M,Be,Br, PB,M, Br, M, Be, Br, PB,
    Be, PB,M,Be
  • BD- (C4) Br, PB, M, Be

15
Sampling Algorithm Example
  • So finally C5 Br,PB,M,Be,Br, PB,
    Br, M, Br, Be, PB,M, PB,Be,M,Be,Br,
    PB,M, Br, M, Be, Br, PB, Be, PB,M,Be,Br,
    PB, M, Be
  • Now it is easy to see that BD- (C5) ?
  • DO the scan of the DB (second scan) to find out
    frequent itemsets. While doing this scan you need
    not check itemsets in L
  • Final L Br,PB,M,Be,Br, PB

16
Toivonens Algorithm
  • Start as in the simple algorithm, but lower the
    threshold slightly for the sample.
  • Example if the sample is 1 of the baskets, use
    0.008 as the support threshold rather than 0.01
    .
  • Goal is to avoid missing any itemset that is
    frequent in the full set of baskets.

17
Toivonens Algorithm (contd.)
  • Add to the itemsets that are frequent in the
    sample the negative border of these itemsets.
  • An itemset is in the negative border if it is not
    deemed frequent in the sample, but all its
    immediate subsets are.
  • Example ABCD is in the negative border if and
    only if it is not frequent, but all of ABC , BCD
    , ACD , and ABD are.

18
Toivonens Algorithm (contd.)
  • In a second pass, count all candidate frequent
    itemsets from the first pass, and also count the
    negative border.
  • If no itemset from the negative border turns out
    to be frequent, then the candidates found to be
    frequent in the whole data are exactly the
    frequent itemsets.

19
Toivonens Algorithm (contd.)
  • What if we find something in the negative border
    is actually frequent?
  • We must start over again!
  • But by choosing the support threshold for the
    sample wisely, we can make the probability of
    failure low, while still keeping the number of
    itemsets checked on the second pass low enough
    for main-memory.

20
Conclusions
  • Advantages
  • Reduced failure probability, while keeping
    candidate-count low enough for memory
  • Disadvantages
  • Potentially large number of candidates
  • in second pass

21
Partitioning
  • Divide database into partitions D1,D2,,Dp
  • Apply Apriori to each partition
  • Any large itemset must be large in at least one
    partition
  • DO YOU AGREE?
  • Lets do the proof!
  • Remember proof by contradiction?

22
Partitioning Algorithm
  1. Divide D into partitions D1,D2,,Dp
  2. For I 1 to p do
  3. Li Apriori(Di)
  4. C L1 ? ? Lp
  5. Count C on D to generate L
  6. Do we need to count?
  7. Is CL?

23
Partitioning Example
L1 Bread, Jelly, PeanutButter,
Bread,Jelly, Bread,PeanutButter, Jelly,
PeanutButter, Bread,Jelly,PeanutButter
D1
L2 Bread, Milk, PeanutButter,
Bread,Milk, Bread,PeanutButter, Milk,
PeanutButter, Bread,Milk,PeanutButter, Beer,
Beer,Bread, Beer,Milk
D2
S10
24
Partitioning
  • Advantages
  • Adapts to available main memory
  • Easily parallelized
  • Maximum number of database scans is two.
  • Disadvantages
  • May have many candidates during second scan.

25
AR Generation from FIs
  • So far we have seen algorithms for finding FIs
  • Lets now look at how we can generate the ARs from
    FIs
  • FIs are concerned only with support
  • Time to bring in the concept of confidence
  • For each FI l, generate all non-empty subsets of
    l
  • For each non-empty subset s of l, output the rule
  • s ? (l-s) if

26
AR Generation from FIs
  • For each k-FI, Y, we can have up to 2k-2 ARs.
  • Ignore empty antecedents/consequents
  • Partition Y into 2 non-empty subsets X Y-X,
    such that X?Y-X satisfies min_conf
  • We need not worry about min_sup!
  • Y1,2,3
  • 6 candidate ARs 1,2 ?3, 1,3 ?2, 2,3
    ?1,
  • 1 ?2,3, 2 ?1,3, 3 ?1,2
  • DO we need any additional scans to find
    confidence?
  • For 1,2 ?3, the confidence is
    ?(1,2,3)/?(1,2)
  • 123 is frequent, therefore 12 is also frequent.
    So no need to find support counts again

27
Rule Generation
  • Given a frequent itemset L, find all non-empty
    subsets f ? L such that f ? L f satisfies the
    minimum confidence requirement
  • If A,B,C,D is a frequent itemset, candidate
    rules
  • ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
    ?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
    BC ?AD, BD ?AC, CD ?AB,
  • If L k, then there are 2k 2 candidate
    association rules (ignoring L ? ? and ? ? L)

28
Rule Generation
  • How to efficiently generate rules from frequent
    itemsets?
  • In general, confidence does not have an
    anti-monotone property
  • c(ABC ?D) can be larger or smaller than c(AB ?D)
  • But confidence of rules generated from the same
    itemset has an anti-monotone property
  • e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD) ?
    c(A ? BCD)
  • Confidence is anti-monotone w.r.t. number of
    items on the RHS of the rule

29
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
30
Next Class
  • Time Complexity of algorithms for finding FIs
  • Efficient Counting for FIs using hash tree
  • PCY algorithm for FI

31
Computational Complexity
  • Factors affecting computational complexity of
    Apriori
  • Min_sup
  • No. of items (dimensionality)
  • No. of transactions
  • Average transaction width

32
(No Transcript)
33
Computational Complexity
  • No. of items (dimensionality)
  • More space for storing support counts of items
  • If the no. of FIs grow with dim., the computation
    I/O costs will increase because of the large
    no. of candidates generated by the algo.
  • No. of Transactions
  • Apriori makes repeated passes of the tr. DB
  • Run time increases as a result
  • Average Transaction width
  • Max. size of FIs increase as avg size of tx.
    Increases
  • More itemsets need to be examined during
    candidate generation and support counting
  • As width increases, more itemsets are contained
    in the tx. Will inc. the hash tree traversal

34
(No Transcript)
35
Support Counting
  • Compare each tx. against every candidate itemset
    update the support count of candidates
    contained in the tx.
  • Computationally expensive when no. of txs. no.
    of candidates is large
  • How to make it efficient?
  • Enumerate all itemsets contained in a tx. use
    them to update support counts of their respective
    CIs
  • T1 has 1,2,3,5,6. 5C3 10 itemsets of size 3.
    some of these 10 will correspond to C3. Others
    are ignored.
  • How to make matching operation efficient?
  • Use HASH TREE!!!

36
Support Counting
Given a transaction t, what are the possible
subsets of size 3?
37
Hash Tree
  • Partition CIs into different buckets and store
    them in hast tree
  • During support counting, itemsets in each tx. are
    also hashed into their appropriate buckets using
    the same hash finction

38
Hash Tree
  • Example 3-itemset
  • All candidate 3-itemsets are hashed
  • Enumerate all the 3-itemsets of the tx.
  • All 3-itemsets contained in a transaction are
    also hashed
  • Comparison of a 3-itemset of tx. with all
    candidate 3-itemsets is avoided
  • Comparison is required to be done only in the
    appropriate bucket
  • Saves time ?

39
Hash Tree
  • For each internal node use hash fn. h(p) p mod
    3
  • All candidate itemsets are stored at the leaf
    nodes of the hash tree
  • Suppose you have 15 candidate 3-itemsets
  • 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
    9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
    3 5 7, 6 8 9, 3 6 7, 3 6 8

40
Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
41
Hash Tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 2, 5 or 8
42
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 3, 6 or 9
43
Subset Operation Using Hash Tree
transaction
44
Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
45
Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 11 out of 15 candidates
46
Compact Representation of FIs
  • Generally, the no. of FIs generated by a tx. DB
    can be very large
  • Good if we could identify a small representative
    set of FIs from which all other FIs could be
    generated
  • 2 such representations
  • Maximal FIs
  • Closed FIs

47
Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
48
Maximal FIs
  • Maximal FIs are the smallest set of itemsets from
    which all the FIs can be derived
  • Maximal FIs do not contain support information of
    their subsets
  • An additional scan of the DB is needed to
    determine the support count of the non-maximal
    FIs

49
Closed Itemset
  • An itemset is closed if none of its immediate
    supersets has the same support as the itemset

50
Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
51
Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
52
Maximal vs Closed Itemsets
53
AR Topics Remaining
  • Breadth-first vs. Depth-first
  • Horizontal vs. Vertical data layout
  • Types of ARs
  • Boolean/Quantitative
  • Single/Multi-Dimensional
  • Single/Multi-Level
  • Measuring Quality of Rules

54
Breadth-first vs. Depth-first
55
Breadth-first vs. Depth-first
56
Breadth-first vs. Depth-first
  • For finding maximal frequent itemsets, which
    approach you would take?
  • BFS
  • DFS

?
57
Breadth-first vs. Depth-first
  • Quick FI border detection!
  • Once a maximal FI is found, substantial pruning
    can be performed on its subsets
  • bcde is MFI, then we need not visit the subtrees
    rooted at bd, be, c,d, e because they will not
    contain any MFI
  • If abc is MFI, then only the nodes such as ac
    bc are not MFI.
  • If support for abc is identical to ab, then abd
    abe can be skipped because they will not have any
    MFI
Write a Comment
User Comments (0)
About PowerShow.com