Mining Association Rules - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Mining Association Rules

Description:

AprioriTid and AprioriHybrid Algorithm. Introduction of generalized association rules ... Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 25

Provided by: cspu3

Category:

more less

Transcript and Presenter's Notes

Title: Mining Association Rules

1
Mining Association Rules

Association Rules
Apriori Algorithm
AprioriTid and AprioriHybrid Algorithm
Introduction of generalized association rules
Director Professor Walid Aref
By Bin Li
01/29/03

2
Data Mining - Knowledge Discovery in Databases
(KDD)
Databases to be mined Relational, transactional,
object-oriented, spatial, text, multi-media, WWW,
etc. Technical approaches Associations,
Sequential patterns, Classifiers, Clustering,
etc. Techniques utilized Database-oriented, data
warehouse (OLAP), machine learning, statistics,
neural network, etc. Applications
adapted Retail, telecommunication, banking,
fraud analysis, Molecular biology, DNA mining,
stock market analysis, Web mining, Astronomy, Web
log analysis, etc.
3
General Data mining technical approaches
Associations A customer buying item X
often also buys item Y. X gt Y 90 of
transactions that purchase bread and butter also
purchase milk. bread,butter gt
milk Sequential patterns Discover the
set of purchases that frequently precede the
purchase of a microwave oven Classifiers
Credit Card Analysis the customer record may be
classified with a Good, Medium or Poor tag by his
credit history. Clustering Clustering is to
group the similar items together. Its like if
you have some points in the space and you
partition the points such that the points that
are close to each other fall into the same
partition.
4
Definitions
Let i1, i2, , im be a set of ordered
literals, called items. Let D be a set of
transactions with a unique identifier TID. Each
transaction T is a set of items that T ? . If a
set of items X ? T, we say T contains X. An
association rule is a implication of the form X
gt Y, where X ? , Y ? , and X ? Y ?. The
rule XgtY holds in the transaction set D with
confidence c if c of transactions in D that
contain X also contain Y. The rule XgtY has
support s in the transaction set D if s of
transactions in D contain X ? Y.
5
Samples
Database

Items 1, 2, 3, 4, 5, 6
One possible association rule
1 gt 3
Confidence Percentage of transactions that
contain 1 also contains 3.
66
Support Percentage of transactions contains
both 1 and 3.
50

Also in statistic view A gt B Support
p(AB) Confidence p(BA) p probability
6
Association Rules
Given A set of transactions Each transaction
is a set of items User-specified minimum
support User-specified minimum confidence Find
All association rules that have support and
confidence greater than the minimum support and
minimum confidence.
7
Problem Decomposition
Step 1. Finding all frequent itemsets sets of
items whose support is greater than the
user-specified minimum support. Step 2. Generate
the association rules from the frequent itemsets
sets of items whose confidence is greater than
the user-specified minimum confidence. The step
2 is quite straightforward. For every large
itemsets l, All the subsets of the large itemset
are considered. Its not a process that can be
optimized. gtwe focus on the step 1.
8
Direct algorithms
Ck All k-itemsets Lk frequent k-itemsets L1
frequent 1-itemsets for (k 1 Lk ! Ø
k) do begin // notice! Ck1
(k1)-itemsets for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk ( For example for items
1,2,3,4,5 C1 1,2,3,4,5 C2
1,2,1,3,1,4,1,5 C3
1,2,3,1,2,4,1,2,5, )
9
Sample Items 1, 2, 3, 4, 5 Minimum support 50
Answer L1 ? L2 ? L3
10
How many itemsets to be verified?
For 4-items 1, 2, 3, 4 1, 2, 3, 4,
1,2, 1,3, 1,4, 2,3, 2,4, 3,4,
1,2,3, 1,2,4, 1,3,4, 2,3,4, 1,2,3,4
Total 15 24-1 itemsets For 5-items 1, 2, 3,
5 1, 2, 3, 4, 5, 1,2, 1,3, 1,4,
1,5, 2,3, 2,4, 2,5 Total 31 25-1
itemsets How many itemsets for 1000 items?
21000-1!!! Correct but not feasible!
11
Enhancement

If A, B is frequent, we must measure AB to
determine if it is frequent or not. But if AB is
determined not to be frequent, it is unnecessary
to measure ABC, ABD, ABCD, etc.
Every subset of a frequent itemset is also
frequent.
Apriori Algorithm !

12
Apriori Algorithm
Ck Candidate itemsets of size k Lk
frequent itemsets of size k L1 frequent
1-itemsets for (k 1 Lk ! Ø k) do
begin Ck1 Apriori-gen(Lk) // the only
difference! for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk Apriori-gen ?
13
Apriori-gen(Lk)

Step 1 Self-Join
insert into Ck1
select p.item1, p.item2, , p.itemk, q.itemk
from Lk p, Lk q
where p.item1q.item1, , p.itemk-1q.itemk-1,
p.itemk lt q.itemk
Step 2 Prune
forall itemsets c in Ck1do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck1

14
Sample

Apriori-gen(L21,2, 1,4, 2,3, 2,5,
3,5)
1) Self join L2L2
1,2 and 1,4 gt 1,2,4
2,3 and 2,5 gt 2,3,5
2) Prune
2,4 not in L2,delete 1,2,4
C32,3,5

15
Sample
Answer L1 ? L2 ? L3
16
Apriori algorithm series optimization
AprioriTid and AprioriHybrid

AprioriTid algorithm
Apriori Apriori-gen on each of the
transaction in database.
Regarding the database as 1-itemsets, rebuild and
use (Instead of using the database) the
(k1)-itemsets from the k-itemsets, discard the
itemsets that can not be large itemsets.
Example of build from 1-itemsets to 2-itemsets
2,3,5 (Origin database) gt
2,3,2,5,3,5 (New database) when 2,3
and 5 are determined to be large itemsets.
1, 3, 4 gt 1,3 Discard the item 4
that can not be large itemsets when 1 and 3
are determined to be large itemsets.

17
AprioriTid VS. Apriori
Per pass execution times of Apriori and
AprioriTid(T10.I4.D100K, minsup0.75)
18
AprioriHybrid algorithm Mixture of Apriori and
AprioriTid algorithm

From the comparison of AprioriTid and Apriori
Algorithm, we can see that Apriori does better
than AprioriTid in the earlier passes. However
AprioriTid beats Apriori in the later passes. A
hybrid algorithm, AprioriHybrid, uses Apriori in
the initial passes and switches to AprioriTid in
the later pass.

19
Mining generalized association rules -
Introduction of three heuristic conceptions.
Concept 1. Taxonomy Items Jackets, Shirts,
Shoes Minimum support 15 One possible
result Jackets gt Shoes (10) Not hold. Shirts
gt Shoes (10) Not hold. When we consider
Jackets and shirts is-a Clothes and add the
clothes into items. One possible
result Clothes gt Shoes (18) Hold! Why not
20? ? Is it valuable? Sure! ?
20
Whats more?

Jackets and shirts is-a Clothes is a taxonomy
for the category. The items could also be
classified on the price, or the brands, or the
product groups, etc.
More interesting and useful associate rules could
be found in this way!

21
Concept 2. Interestingness

Items Jackets, Shirts, Shoes, Clothes
Minimum support 10
Two possible results
Shirts gt Shoes (10) hold.
Clothes gt Shoes (20) hold.
When we knew
Support(Shirts)
Support(Clothes)
Are both of these two results interesting? Or
just need to keep one?

22
Concept 3. Algorithm Basic

To mining generalized association rules, We can
use Sample algorithm on extended transactions!

23
Recent research Frequent Patterns Mining A
more efficient algorithm
From Mining Frequent Patterns without Candidate
Generation - Han, Pei, Yin (1999)
24
Papers
Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami
Mining Association Rules between Sets of Items
in Large Databases. SIGMOD Conference 1993
207-216 DBLPconf/sigmod/AgrawalIS93 Rakesh
Agrawal, Ramakrishnan Srikant Fast Algorithms
for Mining Association Rules in Large Databases.
VLDB 1994 487-499 DBLPconf/vldb/AgrawalS94 R
amakrishnan Srikant, Rakesh Agrawal Mining
Generalized Association Rules. VLDB 1995
407-419 DBLPconf/vldb/SrikantA95

Write a Comment

User Comments (0)