Title: Association Analysis
1Association Analysis
2Association Rule Mining Definition
- Given a set of records each of which contain some
number of items from a given collection - Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
3Association Rules
- Marketing and Sales Promotion
- Let the rule discovered be
- Bagels, --gt Potato Chips
- Potato Chips as consequent gt
- Can be used to determine what should be done to
boost its sales. - Bagels in the antecedent gt
- Can be used to see which products would be
affected if the store discontinues selling bagels.
4Two key issues
- First
- discovering patterns from a large transaction
data set can be computationally expensive. - Second
- some of the discovered patterns are potentially
spurious - because they may happen simply by chance.
5Items and transactions
- Let
- I i1, i2,,id be the set of all items in a
market basket data and - T t1, t2 ,, tN be the set of all
transactions. - Each transaction ti contains a subset of items
chosen from I. - Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Transaction width
- The number of items present in a transaction.
- A transaction tj is said to contain an itemset X,
if X is a subset of tj. - E.g., the second transaction contains the itemset
Bread, Diapers but not Bread, Milk.
6Definition Frequent Itemset
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5 ?/N
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
7Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics (X ? Y)
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
8Why Use Support and Confidence?
- Support
- A very low support rule ? may occur simply by
chance. - A very low support rule ? uninteresting rules.
- Confidence for a given rule X ? Y
- the higher the confidence ? the more likely it is
for Y to be present in transactions that contain
X - Measures the reliability of the inference made by
a rule. - is an estimate of the conditional probability of
Y given X.
9Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support ? minsup threshold
- confidence ? minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
10Brute-force approach
- Suppose there are d items. We first choose k of
the items to form the left hand side of the rule.
There are Cd,k ways for doing this. - Now, there are Cd-k,i ways to choose the
remaining items to form the right hand side of
the rule, where 1 i d-k.
11Brute-force approach
- R3d-2d11
- For d6,
- 36-271602 possible rules
- However, 80 of the rules are discarded after
applying minsup20 and minconf50, thus making
most of the computations become wasted. - So, it would be useful to prune the rules early
without having to compute their support and
confidence values.
An initial step toward improving the performance
decouple the support and confidence requirements.
12Basic Observations
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the rules are binary partitions of the
itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support - but can have different confidence
- We may decouple the support and confidence
requirements - If the itemset is infrequent, then all six
candidate rules can be pruned immediately without
our having to compute their confidence values.
13Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- these itemsets are called frequent itemset
- Rule Generation
- Generate high confidence rules from each frequent
itemset - where each rule is a binary partitioning of a
frequent itemset (these rules are called strong
rules) - The computational requirements for frequent
itemset generation are more expensive than those
of rule generation. - We focus first on frequent itemset generation.
14Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
15Frequent Itemset Generation
- Brute-force approach
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
- w is max transaction width.
16Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search M2d
- Use pruning techniques to reduce M
- Apriori principle is an effective way to
eliminate candidate itemsets without counting
their support values. - Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent
17Reducing Number of Candidates
- Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent - conversely
- If an itemset such as a, b is infrequent, then
all of its supersets must be infrequent too. - Apriori principle holds due to the following
property of the support measure - Support of an itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
18Illustrating Apriori Principle
19Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor
Eggs)
Triplets (3-itemsets)
With the Apriori principle we need to keep only
this triplet, because its the only one whose
subsets are all frequent.
Minimum Support 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
20Apriori Algorithm
- Method
- Let k1
- Generate frequent itemsets of length 1
- Repeat until no new frequent itemsets are
identified - kk1
- Generate length-k candidate itemsets
- from length-k-1 frequent itemsets
- Prune candidate itemsets
- that contain infrequent subsets of length-k-1
- Count the support of each candidate
- by scanning the DB and eliminate candidates that
are infrequent
21Important Details of Apriori
- How to generate length-k candidates?
- Step 1 self-joining Lk-1
- Step 2 pruning
- How to count supports of candidates?
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
22Challenges of Frequent Pattern Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidate itemsets
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
23Candidate generation and prunning
- An effective candidate generation procedure
should - avoid generating too many unnecessary candidates.
- unnecessary candidate itemset ? at least one of
its subsets is infrequent. - ensure that the candidate set is complete,
- no frequent itemsets are left out by the
candidate generation procedure. - not generate the same candidate itemset more than
once. - E.g., the candidate itemset a, b, c, d can be
generated in many ways--- - by merging a, b, c with d,
- c with a, b, d, etc.
24Brute force
- A bruteforce method considers every kitemset as
a potential candidate and then applies the
candidate pruning step to remove any unnecessary
candidates.
25Fk-1?F1 Method
- Extend each frequent (k - 1)itemset with a
frequent 1-itemset. - Complete?
- Yes, because every frequent kitemset is composed
of a frequent (k - 1)itemset and a frequent
1itemset. - Problem
- doesnt prevent the same candidate itemset from
being generated more than once. - E.g., Bread, Diapers, Milk can be generated by
merging - Bread, Diapers with Milk,
- Bread, Milk with Diapers, or
- Diapers, Milk with Bread.
26Lexicographic Order
- To avoid generating duplicate candidates
- ensure that the items in each frequent itemset
are kept sorted in their lexicographic order. - Each frequent (k-1)itemset X is then extended
with frequent items that are lexicographically
larger than the items in X. - For example
- the itemset Bread, Diapers can be augmented
with Milk since Milk is lexicographically
larger than Bread and Diapers. - we dont augment Diapers, Milk with Bread nor
Bread, Milk with Diapers because they violate
the lexicographic ordering condition. - Is it complete?
27Lexicographic Order - Completeness
- Complete? Yes
- Let (i1,, ik-1, ik) be a frequent k-itemset
sorted in lexicographic order. - Since it is frequent, by the Apriori principle,
(i1,, ik-1) and (ik) are frequent as well. - I.e. (i1,, ik-1) ?Fk-1 and (ik) ?F1.
- Since, (ik) is lexicographically bigger than
i1,, ik-1 - (i1,, ik-1) would be joined with (ik)
- and give (i1,, ik-1, ik) as a candidate
k-itemset.
28Still too many candidates
- E.g. merging Beer, Diapers with Milk is
unnecessary because one of its subsets, Beer,
Milk, is infrequent. - Heuristics available to reduce (prune) the number
of unnecessary candidates. - E.g., for a candidate kitemset to be worthy,
- every item in the candidate must be contained in
at least k-1 of the frequent (k-1)itemsets. - Beer, Diapers, Milk is a viable candidate
3itemset only if - every item in the candidate, including Beer, is
contained in at least 2 frequent 2itemsets. - Since there is only one frequent 2itemset
containing Beer, all candidate itemsets involving
Beer must be infrequent. - Why?
- Because each of k-1subsets containing an item
must be frequent.
29Fk-1?F1
30Fk-1?Fk-1 Method
- Merge a pair of frequent (k-1)itemsets only if
their first k-2 items are identical. - E.g. Bread, Diapers,Bread, Milk ? candidate
3itemset Bread, Diapers, Milk. - Beer, Diapers Diapers, Milk not merged
- If Beer, Diapers, Milk is a viable candidate,
it would have been obtained by merging Beer,
Diapers with Beer, Milk instead. - an additional candidate pruning step is needed to
ensure - the remaining k-2 subsets of k-1 elements are
frequent.
31Fk-1?Fk-1
32Example
Min_sup_count 2
33Generate C2 from F1?F1
Min_sup_count 2
F1
34Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
C3
F3
35Generate C4 from F3?F3
Min_sup_count 2
F3
C4
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
36Support counting for Candidate
- Scan the database of transactions to determine
the support of each candidate itemset - Brute force Match each transaction against every
candidate. - Too many comparisons!
- Better method Store the candidate itemsets in a
hash structure - A transaction will be tested for match only
against candidates contained in a few buckets
For candidate itemsets
37Hash Tree for Storing Candidate Itemsets
- Store the candidate itemsets in a hash structure
- A transaction will be tested for match only
against candidates contained in a few buckets - Hash tree can also be used for candidate
generation
For candidate itemsets
38Hash Tree For candidate itemsets
- Suppose you have 15 candidate itemsets of length
3 to be stored - 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8 - You need
- A hash function (e.g. p mod 3)
- Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)
39Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
Split nodes with more than 3 candidates using
the second item
40Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
1 4 5
1 3 6
1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
Now split nodes using the third item
41Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
Now, split this similarly.
42Hash Tree For candidate itemsets
Now, split this similarly.
43Enumerate all Subsets of a transaction
Given a (lexicographically ordered) transaction
t, say 1,2,3,5,6 how to enumerate all subsets
of size 3?
44Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
45Matching transaction against candidates
Hash Function
transaction
1,4,7
3,6,9
2,5,8
1 3 6
3 4 5
1 5 9
46Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 7 out of 15 candidates
47Trie for Storing Candidate Itemsets
Suppose you have 5 candidate itemsets of length
3 A,C,D, A,E,G, A,E,L, A,E,M, K,M,N.
- To match transaction t against candidates
- we take all ordered k-subsets X of t
- search for them in the trie structure
- If X is found(as a candidate), then support count
of this candidate 1
48Tries can store frequent itemsets too
- Candidate generation becomes easy and fast
- We can generate candidates from pairs of nodes
that have the same parents - except for the last item, the two sets are the
same - Association rules are produced much faster
- retrieving a support of an itemset is quicker
- Remember the trie was originally developed to
quickly decide if a word is included in a
dictionary