Title: Fast Algorithm for Mining Association Rules
1Fast Algorithm for Mining Association Rules
- Rakesh Agrawal Ramakrishnan Srikant
- Presenter Zhiwei Zhan
2Overview
- Introduction
- Problem Decomposition
- Old and new Algorithms
- Performance Analysis
- Conclusion
3Introduction
- Data mining is motivated by decision problem of
large retail organizations. - Massive amounts of sales data is referred as
basket data - Purpose mine the association rules over basket
data - --An example of the rule 98 of customers that
purchase tires and auto accessories also get
automotive services done.
4Introduction
- Formal statement in the paper
- set of items
- a set of transactions
- Association rule
-
- has support s in D if s of transaction
in D contain -
- has confidence c if c of transactions in
D that contain X also contain Y
5Introduction
- Example of support and confidence
- If we have 100 transactions, 80 of which contains
diaper, 50 of which contains both beer and diaper - Support(diaper beer)50/100
- Confidence(diaper beer)support(di-aperbeer)/su
pport(diaper)50/80
6Introduction
- The problem of mining association rules is to
generate all association rules that have support
and confidence greater than the minsup and
minconf - Minsup minconf user-specified minimum support
and confidence
7Problem DecompositionPhase one
- Find itemsets that have transaction support above
minimum support large itemsets. - The support for an itemset is the number of
transaction that contain the itemset.
8Problem DecompositionPhase Two
- Use the large itemsets to generate the desired
rules. - E.g if ABCD and AB are large itemsets, then we
can determine the rule AB CD holds by computing
its confidencesupp(ABCD)/supp(AB), if
confidencegtminconf, then the rule holds.
9Phase One
- Discovering Large Itemsets
10Basic intuition
- Any subset of a large itemset must be large.
E.g if A B is large itemset ,then A B must
also be large. - So, candidate itemsets having k items can be
generated by joining large itemsets having k-1
items, and deleting those that contain any subset
that is not large.
11Algorithm Apriori-Notation
12Algorithm Apriori
13Algorithm Apriori-gen
14AIS/SETMs candidate generation
- AIS/SETM
- In pass k of algorithm, a transaction t is read,
some large items in t will be added to the
elements of - The large items in t are not in any elements in
yet. - The large items in t occurs later in
lexicographic ordering than any of the items in
(for number n1 occurs later than n)
15Example of AIS/SETMs candidate generation
- L31,2,3 1,2,4 1,3,4 1,3,5 2,3,4
- T1,2,3,4,5
- From above, we generate the candidate large
itemsets as below - 1,2,3?1,2,3,4,1,2,3,5
- 1,2,4?1,2,4,5
- 1,3,4?1,3,4,5
- 1,3,5?None
- 2,3,4?2,3,4,5
16Apriori-gen vs AIS/SETM
- In Apriori-gen , only one candidate itemset will
be considered in 4th pass 1,2,3,4 - In AIS/SETM Five candidate itemsets will be
considered. (as shown in previous slide)
17Algorithm AprioriTid
18 Algorithm AprioriTid-Example
19Important feature of AprioriTid
- The database is not used at all for counting the
support of candidate itemsets after the first
pass. Rather, an encoding of the candidate
itemsets used in the previous pass is employed
for this purpose. (The set of )
20A summary for Aprior/AprioriTid AIS/SETM
- Apriori AprioriTid differ fundamentally from
AIS STEM in terms of which candidate itemsets
are counted in a pass and the way how the
itemsets are generated. - The result is Apriori/AprioriTid can generate
less candidate large itemsets which could
possibly be the real large itemsets than
AIS/SETM.
21Phase Two
22Discovering rules
- Two algorithms will be introduced
- 1. The simple algorithm
- 2. The fast algorithm
23Simple Algorithmideas behind
- If a subset a of a large itemset L does not
generate a rule, then no need to consider
generating rules from subset of a - E.g If ABC D does not hold, then
- AB CD does not hold either (since
support(ABC)ltsupport(AB), so conf(ABC
D)gtconf(AB CD) )
24Simple AlgorithmBasic Steps
- 1.For every large itemset L, find all non-empty
subsets of L (using a recursive depth-first
fashion). - 2.For every such subset a ,output the rule a
(L-a) if confgtminconf (confsupport(L)/support(a)
)
25 Simple Algorithm
26Example of simple algorithm
- Conditions
- Large itemset ABCDE
- Two rules ACDE B, ABCE D
- For genrules(ABCDE,ACDE ), it will test the
following rules by computing the confidence of
each one - ACD BE ADE BC CDE BA ACE BD
27Fast algorithmideas behind
- If a rule (L-c) c hold, then all the rules of
the form (L-c) c also holds, where c is a
non-empey subset of c. - E.g if AB CD holds, then ABC D and ABD C
must also hold (Since conf(AB CD)ltconf(ABC D)
)
28Fast algorithmBasic steps
- 1. Generate all rules with one item in the
consequent. - 2. Use the consequents of these rules and
apriori-gen function to generate all possible
consequents with two items - 3. Generate the rules with results from step 2 as
possible consequents (computing the confidence of
candidate rules). - 4. Go to step 2, but generate all possible
consequents with three(four,five,etc) items.
29Fast algorithm
30Example of fast algorithm
- Here we use the same condition as the example
from simple algorithm! - Large itemset ABCDE
- Two rules with one item in the consequent ACDE
B, ABCE D - HmB D
- Hm1B D
- So, ACE BD will be the only rule that can
possibly hold.
31Comparison of simple and fast algorithm
- Large itemset ABCDE
- Two rules with one item in the consequent ACDE
B, ABCE D - For genrules(ABCDE,ACDE ), it will test the
following rules - ACD BE ADE BC CDE BA ACE BD
- But the first three rules do not hold for sure!
Only ACE BD can possibly hold. - E.g if ACD BE hold, then ABCD E must also
hold, but actually it does not. - Similar with the situation of genrules
(ABCDE,ABCE )
32Summary for simple/fast algorithm
- Fast algorithm can generate less candidate
association rules which could be the real rules,
which means some candidate rules from simple
algorithm will be determined useless in fast
algorithm.
33Good News
- Now We have seen almost all the algorithms, so no
more pseudo code!
34Performance AnalysisSynthetic data
35Performance AnalysisSynthetic data
36Performance Analysis Reality check-Retail
sales data
- The data consists of the sales transactions from
one store over a short period of time
37Performance Analysis Reality check-Mail
order data
- The data consists of items ordered by a customer
in a single mail order and all the items ordered
by a customer in all orders
38Comparison of Apriori/AprioriTid in different
passes over same dataset
39Why AprioriTid is better in later passes?
- In the later passes, the number of candidate
itemsets reduces, but Apriori still examines
every transaction in database, while AprioriTid
only scan , and at this time, has become
smaller than the size of the database.
40Algorithm AprioriHybird
- Based on the observations from previous slides,
AprioriHybird algorithm is introduced - It uses Apriori in the initial passes and
switches to AprioriTid when it expects that the
set at the end of the pass will fit in the
memory. - The experiment results shows AprioriHybird
performs better in most cases.
41Conclusion
- Two new algorithms, Apriori AprioriTid, for
discovering all important association rules in a
large database of transactions. - The new algorithms always perform better than the
old ones. - The best features of the new algorithms can be
combined into a hybrid algorithm(AprioriHybrid),
it has high feasibility in real applications
involving very large databases.