Title: Association Rule Mining
1Association Rule Mining
- Instructor Qiang Yang
- Thanks Jiawei Han and Jian Pei
2What Is Frequent Pattern Mining?
- Frequent pattern pattern (set of items,
sequence, etc.) that occurs frequently in a
database AIS93 - Frequent pattern mining finding regularities in
data - What products were often purchased together?
Beer and diapers?! - What are the subsequent purchases after buying a
PC? - What kinds of DNA are sensitive to this new drug?
- Can we automatically classify web documents?
3Why Is Frequent Pattern Mining an Essential Task
in Data Mining?
- Foundation for many essential data mining tasks
- Association, correlation, causality
- Sequential patterns, temporal or cyclic
association, partial periodicity, spatial and
multimedia association - Associative classification, cluster analysis,
iceberg cube, fascicles (semantic data
compression) - Broad applications
- Basket data analysis, cross-marketing, catalog
design, sale campaign analysis - Web log (click stream) analysis, DNA sequence
analysis, etc.
4Basic Concepts Frequent Patterns and Association
Rules
- Itemset Xx1, , xk
- Find all the rules X?Y with min confidence and
support - support, s, probability that a transaction
contains X?Y - confidence, c, conditional probability that a
transaction having X also contains Y.
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Let min_support 50, min_conf 50 A ? C
(50, 66.7) C ? A (50, 100)
5Concept Frequent Itemsets
Outlook Temperature Humidity Play
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
- Minimum support2
- sunny, hot, no
- sunny, hot, high, no
- rainy, normal
- Min Support 3
- ?
- How strong is sunny, no?
- Count
- Percentage
6Concept Itemset ? Rules
- sunny, hot, no OutlookSunny, Temphot,
Playno - Generate a rule
- Outlooksunny and Temphot ? Playno
- How strong is this rule?
- Support of the rule
- support of the itemset sunny, hot, no 2
Pr(sunny, hot, no) - Either expressed in count form or percentage form
- Confidence Pr(Playno Outlooksunny,
Temphot) - In general LHS? RHS, Confidence Pr(RHSLHS)
- Confidence
- Pr(RHSLHS)
- count(LHS and RHS) / count(LHS)
- What is the confidence of Outlooksunny?Playno?
76.1.3 Types of Association Rules
- Quantitative
- Age(X, 3039) and income(X, 42K48K) ?
buys(X, TV) - Single vs. Multi dimensions
- Buys(X, computer) ? buys(X, financial soft)
- Multi above example
- Levels of abstraction
- Age(X, ..) ? buys(X, laptop computer)
- Age(X, ..) ? buys(X, computer)
- Extensions
- Max Pattern
- Closed Itemset
8Frequent Patterns
- Patterns Item Sets
- i1, i2, in, where each item is a pair
(Attributevalue) - Frequent Patterns
- Itemsets whose support gt minimum support
- Support
- count(itemset)/count(database)
9Max-patterns
- Max-pattern frequent patterns without proper
frequent super pattern - BCDE, ACD are max-patterns
- BCD is not a max-pattern
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Min_sup2
10Frequent Max Patterns
- Succinct Expression of frequent patterns
- Let a, b, c be frequent
- Then, a, b, b, c, a, c must also be
frequent - Then a, b, c, must also be frequent
- By writing down a, b, c once, we save lots of
computation - Max Pattern
- If a, b, c is a frequent max pattern, then a,
b, c, x is NOT a frequent pattern, for any other
item x.
11Find frequent Max Patterns
Outlook Temperature Humidity Play
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
- Minimum support2
- sunny, hot, no ??
12Closed Patterns
- A closed itemset X has no superset X such that
every transaction containing X also contains X - a, b, a, b, d, a, b, c are frequent closed
patterns - But, a, b is not a max pattern
- Concise rep. of freq pats
- Reduce of patterns and rules
- N. Pasquier et al. In ICDT99
Min_sup2
TID Items
10 a, b, c
20 a, b, c
30 a, b, d
40 a, b, d,
50 c, e, f
13Mining Association Rulesan Example
Min. support 50 Min. confidence 50
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
A 75
B 50
C 50
A, C 50
- For rule A ? C
- support support(A?C) 50
- confidence support(A?C)/support(A) 66.6
14Method 1Apriori A Candidate Generation-and-test
Approach
- Any subset of a frequent itemset must be frequent
- if beer, diaper, nuts is frequent, so is beer,
diaper - Every transaction having beer, diaper, nuts
also contains beer, diaper - Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested! - Method
- generate length (k1) candidate itemsets from
length k frequent itemsets, and - test the candidates against DB
- The performance studies show its efficiency and
scalability - Agrawal Srikant 1994, Mannila, et al. 1994
15The Apriori Algorithm An Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
16The Apriori Algorithm
- Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
17Important Details of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- How to count supports of candidates?
18Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
19How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
20How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
21Speeding up Association rules
- Dynamic Hashing and Pruning technique
Thanks to Cheng Hong Hu Haibo
22DHP Reduce the Number of Candidates
- A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent - Candidates a, b, c, d, e
- Hash entries ab, ad, ae bd, be, de
- Frequent 1-itemset a, b, d, e
- ab is not a candidate 2-itemset if the sum of
count of ab, ad, ae is below support threshold - J. Park, M. Chen, and P. Yu. An effective
hash-based algorithm for mining association
rules. In SIGMOD95
23Still challenging, the niche for DHP
- DHP ( Park 95 ) Dynamic Hashing and Pruning
- Candidate large 2-itemsets are huge.
- DHP trim them using hashing
- Transaction database is huge that one scan per
iteration is costly - DHP prune both number of transactions and number
of items in each transaction after each iteration
24How does it look like?
DHP
Apriori
Generate candidate set
Generate candidate set
Count support
Count support
Make new hash table
25Hash Table Construction
- Consider two items sets, all itesms are numbered
as i1, i2, in. For any any pair (x, y), has
according to - Hash function bucket
- h(x y) ((order of x)10(order of y)) 7
- Example
- Items A, B, C, D, E, Order 1, 2, 3
4, 5, - H(C, E) (310 5) 7 0
- Thus, C, E belong to bucket 0.
26How to trim candidate itemsets
- In k-iteration, hash all appearing k1
itemsets in a hashtable, count all the
occurrences of an itemset in the correspondent
bucket. - In k1 iteration, examine each of the candidate
itemset to see if its correspondent bucket value
is above the support ( necessary condition )
27Example
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Figure1. An example transaction database
28Generation of C1 L1(1st iteration)
Itemset Sup
A 2
B 3
C 3
D 1
E 3
Itemset Sup
A 2
B 3
C 3
E 3
29Hash Table Construction
- Find all 2-itemset of each transaction
TID 2-itemset
100 A C A D C D
200 B C B E C E
300 A B A C A E B C B E C E
400 B E
30Hash Table Construction (2)
- Hash function
- h(x y) ((order of x)10(order of y)) 7
- Hash table
- C E A E B C
B E A B A C - C E B
C B E C D - A D
B E A
C -
- bucket 0 1
2 3 4 5
6
3 1 2 0 3 1 3
31C2 Generation (2nd iteration)
C2 of Apriori
A B
A C
A E
B C
B E
C E
L1L1 in the bucket
A B 1
A C 3
A E 1
B C 2
B E 3
C E 3
Resulted C2
A C
B C
B E
C E
32Effective Database Pruning
- Apriori
- Dont prune database.
- Prune Ck by support counting on the original
database.
- DHP
- More efficient support counting can be achieved
on pruned database.
33Performance Comparison
34Performance Comparison (2)
35Conclusion
- Effective hash-based algorithm for the candidate
itemset generation - Two phase transaction database pruning
- Much more efficient ( time space ) than Apriori
algorithm