Fast Algorithm for Mining Association Rules - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Fast Algorithm for Mining Association Rules

Description:

... that purchase tires and auto accessories also get automotive services done. ... L, find all non-empty subsets of L (using a recursive depth-first fashion) ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 42
Provided by: joey57
Category:

less

Transcript and Presenter's Notes

Title: Fast Algorithm for Mining Association Rules


1
Fast Algorithm for Mining Association Rules
  • Rakesh Agrawal Ramakrishnan Srikant
  • Presenter Zhiwei Zhan

2
Overview
  • Introduction
  • Problem Decomposition
  • Old and new Algorithms
  • Performance Analysis
  • Conclusion

3
Introduction
  • Data mining is motivated by decision problem of
    large retail organizations.
  • Massive amounts of sales data is referred as
    basket data
  • Purpose mine the association rules over basket
    data
  • --An example of the rule 98 of customers that
    purchase tires and auto accessories also get
    automotive services done.

4
Introduction
  • Formal statement in the paper
  • set of items
  • a set of transactions
  • Association rule
  • has support s in D if s of transaction
    in D contain
  • has confidence c if c of transactions in
    D that contain X also contain Y

5
Introduction
  • Example of support and confidence
  • If we have 100 transactions, 80 of which contains
    diaper, 50 of which contains both beer and diaper
  • Support(diaper beer)50/100
  • Confidence(diaper beer)support(di-aperbeer)/su
    pport(diaper)50/80

6
Introduction
  • The problem of mining association rules is to
    generate all association rules that have support
    and confidence greater than the minsup and
    minconf
  • Minsup minconf user-specified minimum support
    and confidence

7
Problem DecompositionPhase one
  • Find itemsets that have transaction support above
    minimum support large itemsets.
  • The support for an itemset is the number of
    transaction that contain the itemset.

8
Problem DecompositionPhase Two
  • Use the large itemsets to generate the desired
    rules.
  • E.g if ABCD and AB are large itemsets, then we
    can determine the rule AB CD holds by computing
    its confidencesupp(ABCD)/supp(AB), if
    confidencegtminconf, then the rule holds.

9
Phase One
  • Discovering Large Itemsets

10
Basic intuition
  • Any subset of a large itemset must be large.
    E.g if A B is large itemset ,then A B must
    also be large.
  • So, candidate itemsets having k items can be
    generated by joining large itemsets having k-1
    items, and deleting those that contain any subset
    that is not large.

11
Algorithm Apriori-Notation

12
Algorithm Apriori
13
Algorithm Apriori-gen
14
AIS/SETMs candidate generation
  • AIS/SETM
  • In pass k of algorithm, a transaction t is read,
    some large items in t will be added to the
    elements of
  • The large items in t are not in any elements in
    yet.
  • The large items in t occurs later in
    lexicographic ordering than any of the items in
    (for number n1 occurs later than n)

15
Example of AIS/SETMs candidate generation
  • L31,2,3 1,2,4 1,3,4 1,3,5 2,3,4
  • T1,2,3,4,5
  • From above, we generate the candidate large
    itemsets as below
  • 1,2,3?1,2,3,4,1,2,3,5
  • 1,2,4?1,2,4,5
  • 1,3,4?1,3,4,5
  • 1,3,5?None
  • 2,3,4?2,3,4,5

16
Apriori-gen vs AIS/SETM
  • In Apriori-gen , only one candidate itemset will
    be considered in 4th pass 1,2,3,4
  • In AIS/SETM Five candidate itemsets will be
    considered. (as shown in previous slide)

17
Algorithm AprioriTid
18
Algorithm AprioriTid-Example
19
Important feature of AprioriTid
  • The database is not used at all for counting the
    support of candidate itemsets after the first
    pass. Rather, an encoding of the candidate
    itemsets used in the previous pass is employed
    for this purpose. (The set of )

20
A summary for Aprior/AprioriTid AIS/SETM
  • Apriori AprioriTid differ fundamentally from
    AIS STEM in terms of which candidate itemsets
    are counted in a pass and the way how the
    itemsets are generated.
  • The result is Apriori/AprioriTid can generate
    less candidate large itemsets which could
    possibly be the real large itemsets than
    AIS/SETM.

21
Phase Two
  • Discovering Rules

22
Discovering rules
  • Two algorithms will be introduced
  • 1. The simple algorithm
  • 2. The fast algorithm

23
Simple Algorithmideas behind
  • If a subset a of a large itemset L does not
    generate a rule, then no need to consider
    generating rules from subset of a
  • E.g If ABC D does not hold, then
  • AB CD does not hold either (since
    support(ABC)ltsupport(AB), so conf(ABC
    D)gtconf(AB CD) )

24
Simple AlgorithmBasic Steps
  • 1.For every large itemset L, find all non-empty
    subsets of L (using a recursive depth-first
    fashion).
  • 2.For every such subset a ,output the rule a
    (L-a) if confgtminconf (confsupport(L)/support(a)
    )

25
Simple Algorithm
26
Example of simple algorithm
  • Conditions
  • Large itemset ABCDE
  • Two rules ACDE B, ABCE D
  • For genrules(ABCDE,ACDE ), it will test the
    following rules by computing the confidence of
    each one
  • ACD BE ADE BC CDE BA ACE BD

27
Fast algorithmideas behind
  • If a rule (L-c) c hold, then all the rules of
    the form (L-c) c also holds, where c is a
    non-empey subset of c.
  • E.g if AB CD holds, then ABC D and ABD C
    must also hold (Since conf(AB CD)ltconf(ABC D)
    )

28
Fast algorithmBasic steps
  • 1. Generate all rules with one item in the
    consequent.
  • 2. Use the consequents of these rules and
    apriori-gen function to generate all possible
    consequents with two items
  • 3. Generate the rules with results from step 2 as
    possible consequents (computing the confidence of
    candidate rules).
  • 4. Go to step 2, but generate all possible
    consequents with three(four,five,etc) items.

29
Fast algorithm
30
Example of fast algorithm
  • Here we use the same condition as the example
    from simple algorithm!
  • Large itemset ABCDE
  • Two rules with one item in the consequent ACDE
    B, ABCE D
  • HmB D
  • Hm1B D
  • So, ACE BD will be the only rule that can
    possibly hold.

31
Comparison of simple and fast algorithm
  • Large itemset ABCDE
  • Two rules with one item in the consequent ACDE
    B, ABCE D
  • For genrules(ABCDE,ACDE ), it will test the
    following rules
  • ACD BE ADE BC CDE BA ACE BD
  • But the first three rules do not hold for sure!
    Only ACE BD can possibly hold.
  • E.g if ACD BE hold, then ABCD E must also
    hold, but actually it does not.
  • Similar with the situation of genrules
    (ABCDE,ABCE )

32
Summary for simple/fast algorithm
  • Fast algorithm can generate less candidate
    association rules which could be the real rules,
    which means some candidate rules from simple
    algorithm will be determined useless in fast
    algorithm.

33
Good News
  • Now We have seen almost all the algorithms, so no
    more pseudo code!

34
Performance AnalysisSynthetic data
35
Performance AnalysisSynthetic data
36
Performance Analysis Reality check-Retail
sales data
  • The data consists of the sales transactions from
    one store over a short period of time

37
Performance Analysis Reality check-Mail
order data
  • The data consists of items ordered by a customer
    in a single mail order and all the items ordered
    by a customer in all orders

38
Comparison of Apriori/AprioriTid in different
passes over same dataset
39
Why AprioriTid is better in later passes?
  • In the later passes, the number of candidate
    itemsets reduces, but Apriori still examines
    every transaction in database, while AprioriTid
    only scan , and at this time, has become
    smaller than the size of the database.

40
Algorithm AprioriHybird
  • Based on the observations from previous slides,
    AprioriHybird algorithm is introduced
  • It uses Apriori in the initial passes and
    switches to AprioriTid when it expects that the
    set at the end of the pass will fit in the
    memory.
  • The experiment results shows AprioriHybird
    performs better in most cases.

41
Conclusion
  • Two new algorithms, Apriori AprioriTid, for
    discovering all important association rules in a
    large database of transactions.
  • The new algorithms always perform better than the
    old ones.
  • The best features of the new algorithms can be
    combined into a hybrid algorithm(AprioriHybrid),
    it has high feasibility in real applications
    involving very large databases.
Write a Comment
User Comments (0)
About PowerShow.com