Mining Association Rules - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Mining Association Rules

Description:

AprioriTid and AprioriHybrid Algorithm. Introduction of generalized association rules ... Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 25
Provided by: cspu3
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules


1
Mining Association Rules
  • Association Rules
  • Apriori Algorithm
  • AprioriTid and AprioriHybrid Algorithm
  • Introduction of generalized association rules
  • Director Professor Walid Aref
  • By Bin Li
  • 01/29/03

2
Data Mining - Knowledge Discovery in Databases
(KDD)
Databases to be mined Relational, transactional,
object-oriented, spatial, text, multi-media, WWW,
etc. Technical approaches Associations,
Sequential patterns, Classifiers, Clustering,
etc. Techniques utilized Database-oriented, data
warehouse (OLAP), machine learning, statistics,
neural network, etc. Applications
adapted Retail, telecommunication, banking,
fraud analysis, Molecular biology, DNA mining,
stock market analysis, Web mining, Astronomy, Web
log analysis, etc.
3
General Data mining technical approaches
Associations A customer buying item X
often also buys item Y. X gt Y 90 of
transactions that purchase bread and butter also
purchase milk. bread,butter gt
milk Sequential patterns Discover the
set of purchases that frequently precede the
purchase of a microwave oven Classifiers
Credit Card Analysis the customer record may be
classified with a Good, Medium or Poor tag by his
credit history. Clustering Clustering is to
group the similar items together. Its like if
you have some points in the space and you
partition the points such that the points that
are close to each other fall into the same
partition.
4
Definitions
Let i1, i2, , im be a set of ordered
literals, called items. Let D be a set of
transactions with a unique identifier TID. Each
transaction T is a set of items that T ? . If a
set of items X ? T, we say T contains X. An
association rule is a implication of the form X
gt Y, where X ? , Y ? , and X ? Y ?. The
rule XgtY holds in the transaction set D with
confidence c if c of transactions in D that
contain X also contain Y. The rule XgtY has
support s in the transaction set D if s of
transactions in D contain X ? Y.
5
Samples
Database
  • Items 1, 2, 3, 4, 5, 6
  • One possible association rule
  • 1 gt 3
  • Confidence Percentage of transactions that
    contain 1 also contains 3.
  • 66
  • Support Percentage of transactions contains
    both 1 and 3.
  • 50

Also in statistic view A gt B Support
p(AB) Confidence p(BA) p probability
6
Association Rules
Given A set of transactions Each transaction
is a set of items User-specified minimum
support User-specified minimum confidence Find
All association rules that have support and
confidence greater than the minimum support and
minimum confidence.
7
Problem Decomposition
Step 1. Finding all frequent itemsets sets of
items whose support is greater than the
user-specified minimum support. Step 2. Generate
the association rules from the frequent itemsets
sets of items whose confidence is greater than
the user-specified minimum confidence. The step
2 is quite straightforward. For every large
itemsets l, All the subsets of the large itemset
are considered. Its not a process that can be
optimized. gtwe focus on the step 1.
8
Direct algorithms
Ck All k-itemsets Lk frequent k-itemsets   L1
frequent 1-itemsets for (k 1 Lk ! Ø
k) do begin   // notice! Ck1
(k1)-itemsets for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk  ( For example for items
1,2,3,4,5 C1 1,2,3,4,5 C2
1,2,1,3,1,4,1,5 C3
1,2,3,1,2,4,1,2,5, )
9
Sample Items 1, 2, 3, 4, 5 Minimum support 50
Answer L1 ? L2 ? L3
10
How many itemsets to be verified?
For 4-items 1, 2, 3, 4 1, 2, 3, 4,
1,2, 1,3, 1,4, 2,3, 2,4, 3,4,
1,2,3, 1,2,4, 1,3,4, 2,3,4, 1,2,3,4
Total 15 24-1 itemsets For 5-items 1, 2, 3,
5 1, 2, 3, 4, 5, 1,2, 1,3, 1,4,
1,5, 2,3, 2,4, 2,5 Total 31 25-1
itemsets How many itemsets for 1000 items?
21000-1!!! Correct but not feasible!
11
Enhancement
  • If A, B is frequent, we must measure AB to
    determine if it is frequent or not. But if AB is
    determined not to be frequent, it is unnecessary
    to measure ABC, ABD, ABCD, etc.
  • Every subset of a frequent itemset is also
    frequent.
  • Apriori Algorithm !

12
Apriori Algorithm
Ck Candidate itemsets of size k Lk
frequent itemsets of size k   L1 frequent
1-itemsets for (k 1 Lk ! Ø k) do
begin   Ck1 Apriori-gen(Lk) // the only
difference! for each transaction t in
database do Increment the count of all
candidates in Ck1 that are contained in t
Lk1 candidates in Ck1 with minimum support
End Answer ?k Lk  Apriori-gen ?
13
Apriori-gen(Lk)
  • Step 1 Self-Join
  • insert into Ck1
  • select p.item1, p.item2, , p.itemk, q.itemk
  • from Lk p, Lk q
  • where p.item1q.item1, , p.itemk-1q.itemk-1,
    p.itemk lt q.itemk
  • Step 2 Prune
  • forall itemsets c in Ck1do
  • forall k-subsets s of c do
  • if (s is not in Lk) then delete c from Ck1

14
Sample
  • Apriori-gen(L21,2, 1,4, 2,3, 2,5,
    3,5)
  • 1) Self join L2L2
  • 1,2 and 1,4 gt 1,2,4
  • 2,3 and 2,5 gt 2,3,5
  • 2) Prune
  • 2,4 not in L2,delete 1,2,4
  • C32,3,5

15
Sample
Answer L1 ? L2 ? L3
16
Apriori algorithm series optimization
AprioriTid and AprioriHybrid
  • AprioriTid algorithm
  • Apriori Apriori-gen on each of the
    transaction in database.
  • Regarding the database as 1-itemsets, rebuild and
    use (Instead of using the database) the
    (k1)-itemsets from the k-itemsets, discard the
    itemsets that can not be large itemsets.
  • Example of build from 1-itemsets to 2-itemsets
  • 2,3,5 (Origin database) gt
    2,3,2,5,3,5 (New database) when 2,3
    and 5 are determined to be large itemsets.
  • 1, 3, 4 gt 1,3 Discard the item 4
    that can not be large itemsets when 1 and 3
    are determined to be large itemsets.

17
AprioriTid VS. Apriori
Per pass execution times of Apriori and
AprioriTid(T10.I4.D100K, minsup0.75)
18
AprioriHybrid algorithm Mixture of Apriori and
AprioriTid algorithm
  • From the comparison of AprioriTid and Apriori
    Algorithm, we can see that Apriori does better
    than AprioriTid in the earlier passes. However
    AprioriTid beats Apriori in the later passes. A
    hybrid algorithm, AprioriHybrid, uses Apriori in
    the initial passes and switches to AprioriTid in
    the later pass.

19
Mining generalized association rules -
Introduction of three heuristic conceptions.
Concept 1. Taxonomy Items Jackets, Shirts,
Shoes Minimum support 15 One possible
result Jackets gt Shoes (10) Not hold. Shirts
gt Shoes (10) Not hold. When we consider
Jackets and shirts is-a Clothes and add the
clothes into items. One possible
result Clothes gt Shoes (18) Hold! Why not
20? ? Is it valuable? Sure! ?
20
Whats more?
  • Jackets and shirts is-a Clothes is a taxonomy
    for the category. The items could also be
    classified on the price, or the brands, or the
    product groups, etc.
  • More interesting and useful associate rules could
    be found in this way!

21
Concept 2. Interestingness
  • Items Jackets, Shirts, Shoes, Clothes
  • Minimum support 10
  • Two possible results
  • Shirts gt Shoes (10) hold.
  • Clothes gt Shoes (20) hold.
  • When we knew
  • Support(Shirts)
  • Support(Clothes)
  • Are both of these two results interesting? Or
    just need to keep one?

22
Concept 3. Algorithm Basic
  • To mining generalized association rules, We can
    use Sample algorithm on extended transactions!

23
Recent research Frequent Patterns Mining A
more efficient algorithm
From Mining Frequent Patterns without Candidate
Generation - Han, Pei, Yin (1999)
24
Papers
Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami
Mining Association Rules between Sets of Items
in Large Databases. SIGMOD Conference 1993
207-216 DBLPconf/sigmod/AgrawalIS93 Rakesh
Agrawal, Ramakrishnan Srikant Fast Algorithms
for Mining Association Rules in Large Databases.
VLDB 1994 487-499 DBLPconf/vldb/AgrawalS94 R
amakrishnan Srikant, Rakesh Agrawal Mining
Generalized Association Rules. VLDB 1995
407-419 DBLPconf/vldb/SrikantA95
Write a Comment
User Comments (0)
About PowerShow.com