Fast Algorithm for Mining Association Rules - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Fast Algorithm for Mining Association Rules

Description:

... that purchase tires and auto accessories also get automotive services done. ... L, find all non-empty subsets of L (using a recursive depth-first fashion) ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 42

Provided by: joey57

Category:

more less

Transcript and Presenter's Notes

Title: Fast Algorithm for Mining Association Rules

1
Fast Algorithm for Mining Association Rules

Rakesh Agrawal Ramakrishnan Srikant
Presenter Zhiwei Zhan

2
Overview

Introduction
Problem Decomposition
Old and new Algorithms
Performance Analysis
Conclusion

3
Introduction

Data mining is motivated by decision problem of
large retail organizations.
Massive amounts of sales data is referred as
basket data
Purpose mine the association rules over basket
data
--An example of the rule 98 of customers that
purchase tires and auto accessories also get
automotive services done.

4
Introduction

Formal statement in the paper
set of items
a set of transactions
Association rule
has support s in D if s of transaction
in D contain
has confidence c if c of transactions in
D that contain X also contain Y

5
Introduction

Example of support and confidence
If we have 100 transactions, 80 of which contains
diaper, 50 of which contains both beer and diaper
Support(diaper beer)50/100
Confidence(diaper beer)support(di-aperbeer)/su
pport(diaper)50/80

6
Introduction

The problem of mining association rules is to
generate all association rules that have support
and confidence greater than the minsup and
minconf
Minsup minconf user-specified minimum support
and confidence

7
Problem DecompositionPhase one

Find itemsets that have transaction support above
minimum support large itemsets.
The support for an itemset is the number of
transaction that contain the itemset.

8
Problem DecompositionPhase Two

Use the large itemsets to generate the desired
rules.
E.g if ABCD and AB are large itemsets, then we
can determine the rule AB CD holds by computing
its confidencesupp(ABCD)/supp(AB), if
confidencegtminconf, then the rule holds.

9
Phase One

Discovering Large Itemsets

10
Basic intuition

Any subset of a large itemset must be large.
E.g if A B is large itemset ,then A B must
also be large.
So, candidate itemsets having k items can be
generated by joining large itemsets having k-1
items, and deleting those that contain any subset
that is not large.

11
Algorithm Apriori-Notation

12
Algorithm Apriori
13
Algorithm Apriori-gen
14
AIS/SETMs candidate generation

AIS/SETM
In pass k of algorithm, a transaction t is read,
some large items in t will be added to the
elements of
The large items in t are not in any elements in
yet.
The large items in t occurs later in
lexicographic ordering than any of the items in
(for number n1 occurs later than n)

15
Example of AIS/SETMs candidate generation

L31,2,3 1,2,4 1,3,4 1,3,5 2,3,4
T1,2,3,4,5
From above, we generate the candidate large
itemsets as below
1,2,3?1,2,3,4,1,2,3,5
1,2,4?1,2,4,5
1,3,4?1,3,4,5
1,3,5?None
2,3,4?2,3,4,5

16
Apriori-gen vs AIS/SETM

In Apriori-gen , only one candidate itemset will
be considered in 4th pass 1,2,3,4
In AIS/SETM Five candidate itemsets will be
considered. (as shown in previous slide)

17
Algorithm AprioriTid
18
Algorithm AprioriTid-Example
19
Important feature of AprioriTid

The database is not used at all for counting the
support of candidate itemsets after the first
pass. Rather, an encoding of the candidate
itemsets used in the previous pass is employed
for this purpose. (The set of )

20
A summary for Aprior/AprioriTid AIS/SETM

Apriori AprioriTid differ fundamentally from
AIS STEM in terms of which candidate itemsets
are counted in a pass and the way how the
itemsets are generated.
The result is Apriori/AprioriTid can generate
less candidate large itemsets which could
possibly be the real large itemsets than
AIS/SETM.

21
Phase Two

Discovering Rules

22
Discovering rules

Two algorithms will be introduced
1. The simple algorithm
2. The fast algorithm

23
Simple Algorithmideas behind

If a subset a of a large itemset L does not
generate a rule, then no need to consider
generating rules from subset of a
E.g If ABC D does not hold, then
AB CD does not hold either (since
support(ABC)ltsupport(AB), so conf(ABC
D)gtconf(AB CD) )

24
Simple AlgorithmBasic Steps

1.For every large itemset L, find all non-empty
subsets of L (using a recursive depth-first
fashion).
2.For every such subset a ,output the rule a
(L-a) if confgtminconf (confsupport(L)/support(a)
)

25
Simple Algorithm
26
Example of simple algorithm

Conditions
Large itemset ABCDE
Two rules ACDE B, ABCE D
For genrules(ABCDE,ACDE ), it will test the
following rules by computing the confidence of
each one
ACD BE ADE BC CDE BA ACE BD

27
Fast algorithmideas behind

If a rule (L-c) c hold, then all the rules of
the form (L-c) c also holds, where c is a
non-empey subset of c.
E.g if AB CD holds, then ABC D and ABD C
must also hold (Since conf(AB CD)ltconf(ABC D)
)

28
Fast algorithmBasic steps

1. Generate all rules with one item in the
consequent.
2. Use the consequents of these rules and
apriori-gen function to generate all possible
consequents with two items
3. Generate the rules with results from step 2 as
possible consequents (computing the confidence of
candidate rules).
4. Go to step 2, but generate all possible
consequents with three(four,five,etc) items.

29
Fast algorithm
30
Example of fast algorithm

Here we use the same condition as the example
from simple algorithm!
Large itemset ABCDE
Two rules with one item in the consequent ACDE
B, ABCE D
HmB D
Hm1B D
So, ACE BD will be the only rule that can
possibly hold.

31
Comparison of simple and fast algorithm

Large itemset ABCDE
Two rules with one item in the consequent ACDE
B, ABCE D
For genrules(ABCDE,ACDE ), it will test the
following rules
ACD BE ADE BC CDE BA ACE BD
But the first three rules do not hold for sure!
Only ACE BD can possibly hold.
E.g if ACD BE hold, then ABCD E must also
hold, but actually it does not.
Similar with the situation of genrules
(ABCDE,ABCE )

32
Summary for simple/fast algorithm

Fast algorithm can generate less candidate
association rules which could be the real rules,
which means some candidate rules from simple
algorithm will be determined useless in fast
algorithm.

33
Good News

Now We have seen almost all the algorithms, so no
more pseudo code!

34
Performance AnalysisSynthetic data
35
Performance AnalysisSynthetic data
36
Performance Analysis Reality check-Retail
sales data

The data consists of the sales transactions from
one store over a short period of time

37
Performance Analysis Reality check-Mail
order data

The data consists of items ordered by a customer
in a single mail order and all the items ordered
by a customer in all orders

38
Comparison of Apriori/AprioriTid in different
passes over same dataset
39
Why AprioriTid is better in later passes?

In the later passes, the number of candidate
itemsets reduces, but Apriori still examines
every transaction in database, while AprioriTid
only scan , and at this time, has become
smaller than the size of the database.

40
Algorithm AprioriHybird

Based on the observations from previous slides,
AprioriHybird algorithm is introduced
It uses Apriori in the initial passes and
switches to AprioriTid when it expects that the
set at the end of the pass will fit in the
memory.
The experiment results shows AprioriHybird
performs better in most cases.

41
Conclusion

Two new algorithms, Apriori AprioriTid, for
discovering all important association rules in a
large database of transactions.
The new algorithms always perform better than the
old ones.
The best features of the new algorithms can be
combined into a hybrid algorithm(AprioriHybrid),
it has high feasibility in real applications
involving very large databases.

Write a Comment

User Comments (0)