Title: Mining Association Rules
1Mining Association Rules
- Association Rules
- Apriori Algorithm
- AprioriTid and AprioriHybrid Algorithm
- Introduction of generalized association rules
- Director Professor Walid Aref
- By Bin Li
- 01/29/03
2Data Mining - Knowledge Discovery in Databases
(KDD)
Databases to be mined Relational, transactional,
object-oriented, spatial, text, multi-media, WWW,
etc. Technical approaches Associations,
Sequential patterns, Classifiers, Clustering,
etc. Techniques utilized Database-oriented, data
warehouse (OLAP), machine learning, statistics,
neural network, etc. Applications
adapted Retail, telecommunication, banking,
fraud analysis, Molecular biology, DNA mining,
stock market analysis, Web mining, Astronomy, Web
log analysis, etc.
3General Data mining technical approaches
Associations A customer buying item X
often also buys item Y. X gt Y 90 of
transactions that purchase bread and butter also
purchase milk. bread,butter gt
milk Sequential patterns Discover the
set of purchases that frequently precede the
purchase of a microwave oven Classifiers
Credit Card Analysis the customer record may be
classified with a Good, Medium or Poor tag by his
credit history. Clustering Clustering is to
group the similar items together. Its like if
you have some points in the space and you
partition the points such that the points that
are close to each other fall into the same
partition.
4Definitions
Let i1, i2, , im be a set of ordered
literals, called items. Let D be a set of
transactions with a unique identifier TID. Each
transaction T is a set of items that T ? . If a
set of items X ? T, we say T contains X. An
association rule is a implication of the form X
gt Y, where X ? , Y ? , and X ? Y ?. The
rule XgtY holds in the transaction set D with
confidence c if c of transactions in D that
contain X also contain Y. The rule XgtY has
support s in the transaction set D if s of
transactions in D contain X ? Y.
5Samples
Database
- Items 1, 2, 3, 4, 5, 6
- One possible association rule
- 1 gt 3
- Confidence Percentage of transactions that
contain 1 also contains 3. - 66
- Support Percentage of transactions contains
both 1 and 3. - 50
Also in statistic view A gt B Support
p(AB) Confidence p(BA) p probability
6Association Rules
Given A set of transactions Each transaction
is a set of items User-specified minimum
support User-specified minimum confidence Find
All association rules that have support and
confidence greater than the minimum support and
minimum confidence.
7Problem Decomposition
Step 1. Finding all frequent itemsets sets of
items whose support is greater than the
user-specified minimum support. Step 2. Generate
the association rules from the frequent itemsets
sets of items whose confidence is greater than
the user-specified minimum confidence. The step
2 is quite straightforward. For every large
itemsets l, All the subsets of the large itemset
are considered. Its not a process that can be
optimized. gtwe focus on the step 1.
8Direct algorithms
Ck All k-itemsets Lk frequent k-itemsets  L1
frequent 1-itemsets for (k 1 Lk ! Ø
k) do begin  // notice! Ck1
(k1)-itemsets for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk ( For example for items
1,2,3,4,5 C1 1,2,3,4,5 C2
1,2,1,3,1,4,1,5 C3
1,2,3,1,2,4,1,2,5, )
9Sample Items 1, 2, 3, 4, 5 Minimum support 50
Answer L1 ? L2 ? L3
10How many itemsets to be verified?
For 4-items 1, 2, 3, 4 1, 2, 3, 4,
1,2, 1,3, 1,4, 2,3, 2,4, 3,4,
1,2,3, 1,2,4, 1,3,4, 2,3,4, 1,2,3,4
Total 15 24-1 itemsets For 5-items 1, 2, 3,
5 1, 2, 3, 4, 5, 1,2, 1,3, 1,4,
1,5, 2,3, 2,4, 2,5 Total 31 25-1
itemsets How many itemsets for 1000 items?
21000-1!!! Correct but not feasible!
11Enhancement
- If A, B is frequent, we must measure AB to
determine if it is frequent or not. But if AB is
determined not to be frequent, it is unnecessary
to measure ABC, ABD, ABCD, etc. - Every subset of a frequent itemset is also
frequent. - Apriori Algorithm !
12Apriori Algorithm
Ck Candidate itemsets of size k Lk
frequent itemsets of size k  L1 frequent
1-itemsets for (k 1 Lk ! Ø k) do
begin  Ck1 Apriori-gen(Lk) // the only
difference! for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk Apriori-gen ?
13Apriori-gen(Lk)
- Step 1 Self-Join
- insert into Ck1
- select p.item1, p.item2, , p.itemk, q.itemk
- from Lk p, Lk q
- where p.item1q.item1, , p.itemk-1q.itemk-1,
p.itemk lt q.itemk - Step 2 Prune
- forall itemsets c in Ck1do
- forall k-subsets s of c do
- if (s is not in Lk) then delete c from Ck1
14Sample
- Apriori-gen(L21,2, 1,4, 2,3, 2,5,
3,5) - 1) Self join L2L2
- 1,2 and 1,4 gt 1,2,4
- 2,3 and 2,5 gt 2,3,5
- 2) Prune
- 2,4 not in L2,delete 1,2,4
- C32,3,5
15Sample
Answer L1 ? L2 ? L3
16Apriori algorithm series optimization
AprioriTid and AprioriHybrid
- AprioriTid algorithm
- Apriori Apriori-gen on each of the
transaction in database. - Regarding the database as 1-itemsets, rebuild and
use (Instead of using the database) the
(k1)-itemsets from the k-itemsets, discard the
itemsets that can not be large itemsets. - Example of build from 1-itemsets to 2-itemsets
- 2,3,5 (Origin database) gt
2,3,2,5,3,5 (New database) when 2,3
and 5 are determined to be large itemsets. - 1, 3, 4 gt 1,3 Discard the item 4
that can not be large itemsets when 1 and 3
are determined to be large itemsets.
17AprioriTid VS. Apriori
Per pass execution times of Apriori and
AprioriTid(T10.I4.D100K, minsup0.75)
18AprioriHybrid algorithm Mixture of Apriori and
AprioriTid algorithm
- From the comparison of AprioriTid and Apriori
Algorithm, we can see that Apriori does better
than AprioriTid in the earlier passes. However
AprioriTid beats Apriori in the later passes. A
hybrid algorithm, AprioriHybrid, uses Apriori in
the initial passes and switches to AprioriTid in
the later pass.
19Mining generalized association rules -
Introduction of three heuristic conceptions.
Concept 1. Taxonomy Items Jackets, Shirts,
Shoes Minimum support 15 One possible
result Jackets gt Shoes (10) Not hold. Shirts
gt Shoes (10) Not hold. When we consider
Jackets and shirts is-a Clothes and add the
clothes into items. One possible
result Clothes gt Shoes (18) Hold! Why not
20? ? Is it valuable? Sure! ?
20Whats more?
- Jackets and shirts is-a Clothes is a taxonomy
for the category. The items could also be
classified on the price, or the brands, or the
product groups, etc. - More interesting and useful associate rules could
be found in this way!
21Concept 2. Interestingness
- Items Jackets, Shirts, Shoes, Clothes
- Minimum support 10
- Two possible results
- Shirts gt Shoes (10) hold.
- Clothes gt Shoes (20) hold.
- When we knew
- Support(Shirts)
- Support(Clothes)
- Are both of these two results interesting? Or
just need to keep one?
22Concept 3. Algorithm Basic
- To mining generalized association rules, We can
use Sample algorithm on extended transactions!
23Recent research Frequent Patterns Mining A
more efficient algorithm
From Mining Frequent Patterns without Candidate
Generation - Han, Pei, Yin (1999)
24Papers
Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami
Mining Association Rules between Sets of Items
in Large Databases. SIGMOD Conference 1993
207-216 DBLPconf/sigmod/AgrawalIS93 Rakesh
Agrawal, Ramakrishnan Srikant Fast Algorithms
for Mining Association Rules in Large Databases.
VLDB 1994 487-499 DBLPconf/vldb/AgrawalS94 R
amakrishnan Srikant, Rakesh Agrawal Mining
Generalized Association Rules. VLDB 1995
407-419 DBLPconf/vldb/SrikantA95