David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com

Description:

The main technical material (the Apriori algorithm and its ... So A B = {cat, dog, eel. rat} When X is a subset of Y, I use ... by a basic scan of the DB. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 38
Provided by: macs6
Category:

less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com


1
Data Mining(and machine learning)
  • DM Lecture 4 The A Priori Algorithm

2
Reading
The main technical material (the Apriori
algorithm and its variants) in this lecture is
based on Fast Algorithms for Mining Association
Rules, by Rakesh Agrawal and Ramakrishan Sikant,
IBM Almaden Research Center The pdf is on my
teaching resources page.
3
Basket data
  • A very common type of data often also called
    transaction data.
  • Next slide shows example transaction database,
    where each record represents a transaction
    between (usually) a customer and a shop. Each
    record in a supermarkets transaction DB, for
    example, corresponds to a basket of specific
    items.

4
ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
5
Numbers
  • Our example transaction DB has 20 records of
    supermarket transactions, from a supermarket that
    only sells 9 things
  • One month in a large supermarket with five stores
    spread around a reasonably sized city might
    easily yield a DB of 20,000,000 baskets, each
    containing a set of products from a pool of
    around 1,000

6
Discovering RulesA common and useful application
of data mining
  • A rule is something like this
  • If a basket contains apples and cheese, then it
    also contains beer
  • Any such rule has two associated measures
  • confidence when the if part is true, how
    often is the then bit true? This is the same as
    accuracy.
  • coverage or support how much of the database
    contains the if part?

7
Example
  • What is the confidence and coverage of
  • If the basket contains beer and cheese, then it
    also contains honey

2/20 of the records contain both beer and cheese,
so coverage is 10
Of these 2, 1 contains honey, so confidence is 50
8
ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
9
Interesting/Useful rules
  • Statistically, anything that is interesting is
    something that happens significantly more than
    you would expect by chance.
  • E.g. basic statistical analysis of basket data
    may show that 10 of baskets contain bread, and
    4 of baskets contain washing-up powder. I.e if
    you choose a basket at random
  • There is a probability 0.1 that it contains
    bread.
  • There is a probability 0.04 that it contains
    washing-up powder.

10
Bread and washing up powder
  • What is the probability of a basket containing
    both bread and washing-up powder? The laws of
    probability say
  • If these two things are independent, chance is
    0.1 0.04 0.004
  • That is, we would expect 0.4 of baskets to
    contain both bread and washing up powder

11
Interesting means surprising
  • We therefore have a prior expectation that just
    4 in 1,000 baskets should contain both bread and
    washing up powder.
  • If we investigate, and discover that really it
    is 20 in 1,000 baskets, then we will be very
    surprised. It tells us that
  • Something is going on in shoppers minds bread
    and washing-up powder are connected in some way.
  • There may be ways to exploit this discovery put
    the powder and bread at opposite ends of the
    supermarket?

12
Finding surprising rules
  • Suppose we ask what is the most surprising rule
    in this database? This would be, presumably, a
    rule whose accuracy is more different from its
    expected accuracy than any others. But it also
    has to have a suitable level of coverage, or else
    it may be just a statistical blip, and/or
    unexploitable.
  • Looking only at rules of the form
  • if basket contains X and Y, then it also
    contains Z
  • our realistic numbers tell us that there may be
    around 500,000,000 distinct possible rules. For
    each of these we need to work out its accuracy
    and coverage, by trawling through a database of
    around 20,000,000 basket records. c 1016
    operations
  • Yes, its easy to use simple statistics to work
    out the confidence and coverage of a given rule.
    But interesting DM is all about searching
    through, somehow, 500,000,000 (or usually
    immensely more) rules to sniff out what may be
    the interesting ones.

13
Here are some interesting onesin our mini basket
DB
  • If a basket contains glue, then it also contains
    either beer or eggs
  • confidence 100 coverage 25
  • If a basket contains apples and dates, then it
    also contains honey
  • confidence 100 coverage 20

14
The A Priori Algorithm
  • There is nothing very special or clever about
    this algorithm however it is simple, fast, and
    very good at finding interesting rules of a
    specific kind in baskets or other transaction
    data. It is used a lot in the RD Depts of
    retailers in industry (or by consultancies who do
    work for them).
  • But note that we will now talk about itemsets
    instead of rules. Also, the coverage of a rule is
    the same as the support of an itemset.
  • Dont get confused!

15
Find rules in two stages
  • Agarwal and colleagues divided the problem of
    finding good rules into two phases
  • Find all itemsets with a specified minimal
    support (coverage). An itemset is just a
    specific set of items, e.g. apples, cheese. The
    Apriori algorithm can efficiently find all
    itemsets whose coverage is above a given minimum.
  • Use these itemsets to help generate interersting
    rules. Having done stage 1, we have considerably
    narrowed down the possibilities, and can do
    reasonably fast processing of the large itemsets
    to generate candidate rules.

16
Terminology
  • k-itemset a set of k items. E.g.
  • beer, cheese, eggs is a 3-itemset
  • cheese is a 1-itemset
  • honey, ice-cream is a 2-itemset
  • support an itemset has support s if s of the
    records in the DB contain that itemset.
  • minimum support the Apriori algorithm starts
    with the specification of a minimum level of
    support, and will focus on itemsets with this
    level or above.

17
Terminology
  • large itemset doesnt mean an itemset with many
    items. It means one whose support is at least
    minimum support.
  • Lk the set of all large k-itemsets in the DB.
  • Ck a set of candidate large k-itemsets. In the
    algorithm we will look at, it generates this set,
    which contains all the k-itemsets that might be
    large, and then eventually generates the set
    above.

18
Terminology
  • sets Let A be a set (A cat, dog) and
  • let B be a set (B dog, eel, rat)
    and
  • let C eel, rat
  • I use A B to mean A union B.
  • So A B cat, dog, eel. rat
  • When X is a subset of Y, I use Y X to
    mean
  • the set of things in Y which are not in
    X. E.g.
  • B C dog

19
ID a, b, c, d, e, f,
g, h, i
E.g. 3-itemset a,b,h has support
15 2-itemset a, i has support 0 4-itemset
b, c, d, h has support 5 If minimum support
is 10, then b is a large itemset, but b, c,
d, h Is a small itemset!
20
The Apriori algorithm for finding large itemsets
efficiently in big DBs
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • 4 For each c in Ck, initialise c.count
    to zero
  • 5 For all records r in the DB
  • Cr subset(Ck, r) For each c in Cr ,
    c.count
  • 7 Set Lk all c in Ck whose count
    gt minsup
  • 8 / end -- return all of the Lk
    sets.

21
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • To start off, we simply find all of the large
    1-itemsets. This is done by a basic
    scan of the DB. We take each item in turn, and
    count the number of times that item appears in a
    basket. In our running example, suppose minimum
    support was 60, then the only large 1-itemsets
    would be a, b, c, d and f. So we get
  • L1 a, b, c, d, f

22
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • We already have L1. This next bit just means
    that the remainder of the algorithm generates L2,
    L3 , and so on until we get to an Lk thats
    empty.
  • How these are generated is like this

23
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • Given the large k-1-itemsets, this step
    generates some candidate k-itemsets that might be
    large. Because of how apriori-gen works, the set
    Ck is guaranteed to contain all the large
    k-itemsets, but also contains some that will turn
    out not to be large.

24
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • 4 For each c in Ck, initialise c.count to
    zero
  • We are going to work out the support for each of
    the candidate k-itemsets in Ck, by working out
    how many times each of these itemsets appears in
    a record in the DB. this step starts us off by
    initialising these counts to zero.

25
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • 4 For each c in Ck, initialise c.count to
    zero
  • 5 For all records r in the DB
  • Cr subset(Ck, r) For each c in Cr ,
    c.count
  • We now take each record r in the DB and do this
    get all the candidate k-itemsets from Ck that
    are contained in r. For each of these, update its
    count.

26
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • 4 For each c in Ck, initialise c.count to
    zero
  • 5 For all records r in the DB
  • Cr subset(Ck, r) For each c in Cr ,
    c.count
  • 7 Set Lk all c in Ck whose count
    gt minsup
  • Now we have the count for every candidate. Those
    whose count is big enough are valid large
    itemsets of the right size. We therefore now have
    Lk, We now go back into the for loop of line 2
    and start working towards finding Lk1

27
Explaining the Apriori Algorithm
  • 1 Find all large 1-itemsets
  • 2 For (k 2 while Lk-1 is non-empty k)
  • 3 Ck apriori-gen(Lk-1)
  • 4 For each c in Ck, initialise c.count to
    zero
  • 5 For all records r in the DB
  • Cr subset(Ck, r) For each c in Cr ,
    c.count
  • Set Lk all c in Ck whose count gt
    minsup
  • 8 / end -- return all of the Lk
    sets.
  • We finish at the point where we get an empty Lk .
    The algorithm returns all of the (non-empty) Lk
    sets, which gives us an excellent start in
    finding interesting rules (although the large
    itemsets themselves will usually be very
    interesting and useful.

28
apriori-gen notes
  • Suppose we have worked out that the large
    2-itemsets are
  • L2 milk, noodles, milk, tights,
    noodles, quorn
  • apriori-gen now generates 3-itemsets that all may
    be large.
  • An obvious way to do this would be to generate
    all of the possible 3-itemsets that you can make
    from milk, noodles, tights, quorn.


  • But this would include, for
    example, milk, tights, quorn. Now, if this
    really was a large 3-itemset, that would mean the
    number of records containing all three is gt
    minsup
  • this means it would have to be true that the
    number of records containing tights, quorn is
    gt minsup. But, it cant be, because this is not
    one of the large 2-itemsets.

29
apriori-gen the join step
  • apriori-gen is clever in generating not too many
    candidate large itemsets, but making sure to not
    lose any that do turn out to be large.
  • To explain it, we need to note that there is
    always an ordering of the items. We will assume
    alphabetical order, and that the datastructures
    used always keep members of a set in alphabetical
    order. a lt b will mean that a comes before b in
    alphabetical order.
  • Suppose we have Lk and wish to generate Ck1
  • First we take every distinct pair of sets in
    Lk
  • a1, a2 , ak and b1, b2 , bk, and do
    this
  • in all cases where
  • a1, a2 , ak-1 b1, b2 , bk-1, and ak lt
    bk, a1, a2 , ak, bk is a candidate
    k1-itemset.

30
An illustration of that
  • Suppose the 2-itemsets are
  • L2 milk, noodles, milk, tights,
    noodles, quorn,
  • noodles, peas, noodles, tights
  • The pairs that satisfy this
  • a1, a2 , ak-1 b1, b2 , bk-1, and ak lt
    bk,
  • are milk, noodlesmilk, tights
    noodles, peasnoodles, quorn
  • noodles, peasnoodles,
    tights noodles, quornnoodles, tights
  • So the candidate 3-itemsets are
  • milk, noodles, tights, noodles, peas,
    quorn
  • noodles, peas, tights, noodles, quorn,
    tights

31
apriori-gen the prune step
  • Now we have some candidate k1 itemsets, and are
    guaranteed to have all of the ones that possibly
    could be large, but we have the chance to maybe
    prune out some more before we enter the next
    stage of Apriori that counts their support.
  • In the prune step, we take the candidate k1
    itemsets we have, and remove any for which some
    2-subset of it is not a large k-itemset. Such
    couldnt possibly be a large k1-itemset.
  • E.g. in the current example, we have (n
    noodles, etc)
  • L2 milk, n, milk, tights, n, quorn,
    n, peas, n, tights
  • And candidate k1-itemsets so far m, n, t,
    n, p, q, n, p, t, n, q, t
  • Now, p, q is not a 2-itemset, so n,p,q is
    pruned.
  • p,t is not a 2-itemset, so n,p,t is pruned
  • q,t is not a 2-itemset, so n,q,t is pruned.
  • After this we finally have C3 milk, noodles,
    tights

32
Appendix

33
A full run through of Apriori
ID a, b, c, d, e, f,
g
We will assume this is our transaction database D
and we will assume minsup is 4 (20)
This will not be run through in the lecture it
is here to help with revision
34
First we find all the large 1-itemsets. I.e., in
this case, all the 1-itemsets that are contained
by at least 4 records in the DB. In this example,
thats all of them. So, L1 a, b, c,
d, e, f, g Now we set k 2 and run
apriori-gen to generate C2 The join step when
k2 just gives us the set of all
alphabetically ordered pairs from L1, and we
cannot prune any away, so we have C2 a, b,
a, c, a, d, a, e, a, f, a, g, b, c,
b, d, b, e, b, f, b, g, c, d,
c, e, c, f, c, g, d, e, d, f,
d, g, e, f, e, g, f, g
35
So we have C2 a, b, a, c, a, d, a,
e, a, f, a, g, b, c, b, d, b, e, b,
f, b, g, c, d, c, e, c, f, c, g, d,
e, d, f, d, g, e, f, e, g, f, g Line
4 of the Apriori algorithm now tells us set a
counter for each of these to 0. Line 5 now
prepares us to take each record in the DB in
turn, and find which of those in C2 are contained
in it. The first record r1 is a, b, d, g.
Those of C2 it contains are a, b, a, d,
a, g, a, d, a, g, b, d, b, g, d, g.
Hence Cr1 a, b, a, d, a, g, a, d,
a, g, b, d, b, g, d, g and the rest
of line 6 tells us to increment the counters of
these itemsets. The second record r2 isc,
d, e Cr2 c, d, c, e, d, e, and
we increment the counters for these three
itemsets. After all 20 records, we look
at the counters, and in this case we will find
that the itemsets with gt minsup (4) counters
are a, d, c, e. So, L2 a, c, a,
d, c, d, c, e, c, f
36
So we have L2 a, c, a, d, c, d, c, e,
c, f We now set k 3 and run apriori-gen on
L2 . The join step finds the following pairs
that meet the required pattern a, ca, d
c, dc, e c, dc, f c, ec,
f This leads to the candidates 3-itemsets
a, c, d, c, d, e, c, d, f, c, e, f We
prune c, d, e since d, e is not in L2 We
prune c, d, f since d, f is not in L2 We
prune c, e, f since e, f is not in L2 We are
left with C3 a, c, d We now run lines 57,
to count how many records contain a, c, d. The
count is 4, so L3 a, c, d
37
So we have L3 a, c, d We now set k 4, but
when we run apriori-gen on L3 we get the empty
set, and hence eventually we find L4 This
means we now finish, and return the set of all of
the non-empty Ls these are all of the large
itemsets Result a, b, c, d, e,
f, g, a, c, a, d, c, d, c, e, c,
f, a, c, d Each large itemset is
intrinsically interesting, and may be of business
value. Simple rule-generation algorithms can now
use the large itemsets as a starting point.
Write a Comment
User Comments (0)
About PowerShow.com