David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com - PowerPoint PPT Presentation

About This Presentation

Title:

David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com

Description:

The main technical material (the Apriori algorithm and its ... So A B = {cat, dog, eel. rat} When X is a subset of Y, I use ... by a basic scan of the DB. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 38

Provided by: macs6

Category:

more less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, HeriotWatt University dwcornegmail'com

1
Data Mining(and machine learning)

DM Lecture 4 The A Priori Algorithm

2
Reading
The main technical material (the Apriori
algorithm and its variants) in this lecture is
based on Fast Algorithms for Mining Association
Rules, by Rakesh Agrawal and Ramakrishan Sikant,
IBM Almaden Research Center The pdf is on my
teaching resources page.
3
Basket data

A very common type of data often also called
transaction data.
Next slide shows example transaction database,
where each record represents a transaction
between (usually) a customer and a shop. Each
record in a supermarkets transaction DB, for
example, corresponds to a basket of specific
items.

4
ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
5
Numbers

Our example transaction DB has 20 records of
supermarket transactions, from a supermarket that
only sells 9 things
One month in a large supermarket with five stores
spread around a reasonably sized city might
easily yield a DB of 20,000,000 baskets, each
containing a set of products from a pool of
around 1,000

6
Discovering RulesA common and useful application
of data mining

A rule is something like this
If a basket contains apples and cheese, then it
also contains beer
Any such rule has two associated measures
confidence when the if part is true, how
often is the then bit true? This is the same as
accuracy.
coverage or support how much of the database
contains the if part?

7
Example

What is the confidence and coverage of
If the basket contains beer and cheese, then it
also contains honey

2/20 of the records contain both beer and cheese,
so coverage is 10
Of these 2, 1 contains honey, so confidence is 50
8
ID apples, beer, cheese, dates, eggs,
fish, glue, honey, ice-cream
9
Interesting/Useful rules

Statistically, anything that is interesting is
something that happens significantly more than
you would expect by chance.
E.g. basic statistical analysis of basket data
may show that 10 of baskets contain bread, and
4 of baskets contain washing-up powder. I.e if
you choose a basket at random
There is a probability 0.1 that it contains
bread.
There is a probability 0.04 that it contains
washing-up powder.

10
Bread and washing up powder

What is the probability of a basket containing
both bread and washing-up powder? The laws of
probability say
If these two things are independent, chance is
0.1 0.04 0.004
That is, we would expect 0.4 of baskets to
contain both bread and washing up powder

11
Interesting means surprising

We therefore have a prior expectation that just
4 in 1,000 baskets should contain both bread and
washing up powder.
If we investigate, and discover that really it
is 20 in 1,000 baskets, then we will be very
surprised. It tells us that
Something is going on in shoppers minds bread
and washing-up powder are connected in some way.
There may be ways to exploit this discovery put
the powder and bread at opposite ends of the
supermarket?

12
Finding surprising rules

Suppose we ask what is the most surprising rule
in this database? This would be, presumably, a
rule whose accuracy is more different from its
expected accuracy than any others. But it also
has to have a suitable level of coverage, or else
it may be just a statistical blip, and/or
unexploitable.
Looking only at rules of the form
if basket contains X and Y, then it also
contains Z
our realistic numbers tell us that there may be
around 500,000,000 distinct possible rules. For
each of these we need to work out its accuracy
and coverage, by trawling through a database of
around 20,000,000 basket records. c 1016
operations
Yes, its easy to use simple statistics to work
out the confidence and coverage of a given rule.
But interesting DM is all about searching
through, somehow, 500,000,000 (or usually
immensely more) rules to sniff out what may be
the interesting ones.

13
Here are some interesting onesin our mini basket
DB

If a basket contains glue, then it also contains
either beer or eggs
confidence 100 coverage 25
If a basket contains apples and dates, then it
also contains honey
confidence 100 coverage 20

14
The A Priori Algorithm

There is nothing very special or clever about
this algorithm however it is simple, fast, and
very good at finding interesting rules of a
specific kind in baskets or other transaction
data. It is used a lot in the RD Depts of
retailers in industry (or by consultancies who do
work for them).
But note that we will now talk about itemsets
instead of rules. Also, the coverage of a rule is
the same as the support of an itemset.
Dont get confused!

15
Find rules in two stages

Agarwal and colleagues divided the problem of
finding good rules into two phases
Find all itemsets with a specified minimal
support (coverage). An itemset is just a
specific set of items, e.g. apples, cheese. The
Apriori algorithm can efficiently find all
itemsets whose coverage is above a given minimum.
Use these itemsets to help generate interersting
rules. Having done stage 1, we have considerably
narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets
to generate candidate rules.

16
Terminology

k-itemset a set of k items. E.g.
beer, cheese, eggs is a 3-itemset
cheese is a 1-itemset
honey, ice-cream is a 2-itemset
support an itemset has support s if s of the
records in the DB contain that itemset.
minimum support the Apriori algorithm starts
with the specification of a minimum level of
support, and will focus on itemsets with this
level or above.

17
Terminology

large itemset doesnt mean an itemset with many
items. It means one whose support is at least
minimum support.
Lk the set of all large k-itemsets in the DB.
Ck a set of candidate large k-itemsets. In the
algorithm we will look at, it generates this set,
which contains all the k-itemsets that might be
large, and then eventually generates the set
above.

18
Terminology

sets Let A be a set (A cat, dog) and
let B be a set (B dog, eel, rat)
and
let C eel, rat
I use A B to mean A union B.
So A B cat, dog, eel. rat
When X is a subset of Y, I use Y X to
mean
the set of things in Y which are not in
X. E.g.
B C dog

19
ID a, b, c, d, e, f,
g, h, i
E.g. 3-itemset a,b,h has support
15 2-itemset a, i has support 0 4-itemset
b, c, d, h has support 5 If minimum support
is 10, then b is a large itemset, but b, c,
d, h Is a small itemset!
20
The Apriori algorithm for finding large itemsets
efficiently in big DBs

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count
to zero
5 For all records r in the DB
Cr subset(Ck, r) For each c in Cr ,
c.count
7 Set Lk all c in Ck whose count
gt minsup
8 / end -- return all of the Lk
sets.

21
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
To start off, we simply find all of the large
1-itemsets. This is done by a basic
scan of the DB. We take each item in turn, and
count the number of times that item appears in a
basket. In our running example, suppose minimum
support was 60, then the only large 1-itemsets
would be a, b, c, d and f. So we get
L1 a, b, c, d, f

22
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
We already have L1. This next bit just means
that the remainder of the algorithm generates L2,
L3 , and so on until we get to an Lk thats
empty.
How these are generated is like this

23
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
Given the large k-1-itemsets, this step
generates some candidate k-itemsets that might be
large. Because of how apriori-gen works, the set
Ck is guaranteed to contain all the large
k-itemsets, but also contains some that will turn
out not to be large.

24
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count to
zero
We are going to work out the support for each of
the candidate k-itemsets in Ck, by working out
how many times each of these itemsets appears in
a record in the DB. this step starts us off by
initialising these counts to zero.

25
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count to
zero
5 For all records r in the DB
Cr subset(Ck, r) For each c in Cr ,
c.count
We now take each record r in the DB and do this
get all the candidate k-itemsets from Ck that
are contained in r. For each of these, update its
count.

26
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count to
zero
5 For all records r in the DB
Cr subset(Ck, r) For each c in Cr ,
c.count
7 Set Lk all c in Ck whose count
gt minsup
Now we have the count for every candidate. Those
whose count is big enough are valid large
itemsets of the right size. We therefore now have
Lk, We now go back into the for loop of line 2
and start working towards finding Lk1

27
Explaining the Apriori Algorithm

1 Find all large 1-itemsets
2 For (k 2 while Lk-1 is non-empty k)
3 Ck apriori-gen(Lk-1)
4 For each c in Ck, initialise c.count to
zero
5 For all records r in the DB
Cr subset(Ck, r) For each c in Cr ,
c.count
Set Lk all c in Ck whose count gt
minsup
8 / end -- return all of the Lk
sets.
We finish at the point where we get an empty Lk .
The algorithm returns all of the (non-empty) Lk
sets, which gives us an excellent start in
finding interesting rules (although the large
itemsets themselves will usually be very
interesting and useful.

28
apriori-gen notes

Suppose we have worked out that the large
2-itemsets are
L2 milk, noodles, milk, tights,
noodles, quorn
apriori-gen now generates 3-itemsets that all may
be large.
An obvious way to do this would be to generate
all of the possible 3-itemsets that you can make
from milk, noodles, tights, quorn.
But this would include, for
example, milk, tights, quorn. Now, if this
really was a large 3-itemset, that would mean the
number of records containing all three is gt
minsup
this means it would have to be true that the
number of records containing tights, quorn is
gt minsup. But, it cant be, because this is not
one of the large 2-itemsets.

29
apriori-gen the join step

apriori-gen is clever in generating not too many
candidate large itemsets, but making sure to not
lose any that do turn out to be large.
To explain it, we need to note that there is
always an ordering of the items. We will assume
alphabetical order, and that the datastructures
used always keep members of a set in alphabetical
order. a lt b will mean that a comes before b in
alphabetical order.
Suppose we have Lk and wish to generate Ck1
First we take every distinct pair of sets in
Lk
a1, a2 , ak and b1, b2 , bk, and do
this
in all cases where
a1, a2 , ak-1 b1, b2 , bk-1, and ak lt
bk, a1, a2 , ak, bk is a candidate
k1-itemset.

30
An illustration of that

Suppose the 2-itemsets are
L2 milk, noodles, milk, tights,
noodles, quorn,
noodles, peas, noodles, tights
The pairs that satisfy this
a1, a2 , ak-1 b1, b2 , bk-1, and ak lt
bk,
are milk, noodlesmilk, tights
noodles, peasnoodles, quorn
noodles, peasnoodles,
tights noodles, quornnoodles, tights
So the candidate 3-itemsets are
milk, noodles, tights, noodles, peas,
quorn
noodles, peas, tights, noodles, quorn,
tights

31
apriori-gen the prune step

Now we have some candidate k1 itemsets, and are
guaranteed to have all of the ones that possibly
could be large, but we have the chance to maybe
prune out some more before we enter the next
stage of Apriori that counts their support.
In the prune step, we take the candidate k1
itemsets we have, and remove any for which some
2-subset of it is not a large k-itemset. Such
couldnt possibly be a large k1-itemset.
E.g. in the current example, we have (n
noodles, etc)
L2 milk, n, milk, tights, n, quorn,
n, peas, n, tights
And candidate k1-itemsets so far m, n, t,
n, p, q, n, p, t, n, q, t
Now, p, q is not a 2-itemset, so n,p,q is
pruned.
p,t is not a 2-itemset, so n,p,t is pruned
q,t is not a 2-itemset, so n,q,t is pruned.
After this we finally have C3 milk, noodles,
tights

32
Appendix

33
A full run through of Apriori
ID a, b, c, d, e, f,
g
We will assume this is our transaction database D
and we will assume minsup is 4 (20)
This will not be run through in the lecture it
is here to help with revision
34
First we find all the large 1-itemsets. I.e., in
this case, all the 1-itemsets that are contained
by at least 4 records in the DB. In this example,
thats all of them. So, L1 a, b, c,
d, e, f, g Now we set k 2 and run
apriori-gen to generate C2 The join step when
k2 just gives us the set of all
alphabetically ordered pairs from L1, and we
cannot prune any away, so we have C2 a, b,
a, c, a, d, a, e, a, f, a, g, b, c,
b, d, b, e, b, f, b, g, c, d,
c, e, c, f, c, g, d, e, d, f,
d, g, e, f, e, g, f, g
35
So we have C2 a, b, a, c, a, d, a,
e, a, f, a, g, b, c, b, d, b, e, b,
f, b, g, c, d, c, e, c, f, c, g, d,
e, d, f, d, g, e, f, e, g, f, g Line
4 of the Apriori algorithm now tells us set a
counter for each of these to 0. Line 5 now
prepares us to take each record in the DB in
turn, and find which of those in C2 are contained
in it. The first record r1 is a, b, d, g.
Those of C2 it contains are a, b, a, d,
a, g, a, d, a, g, b, d, b, g, d, g.
Hence Cr1 a, b, a, d, a, g, a, d,
a, g, b, d, b, g, d, g and the rest
of line 6 tells us to increment the counters of
these itemsets. The second record r2 isc,
d, e Cr2 c, d, c, e, d, e, and
we increment the counters for these three
itemsets. After all 20 records, we look
at the counters, and in this case we will find
that the itemsets with gt minsup (4) counters
are a, d, c, e. So, L2 a, c, a,
d, c, d, c, e, c, f
36
So we have L2 a, c, a, d, c, d, c, e,
c, f We now set k 3 and run apriori-gen on
L2 . The join step finds the following pairs
that meet the required pattern a, ca, d
c, dc, e c, dc, f c, ec,
f This leads to the candidates 3-itemsets
a, c, d, c, d, e, c, d, f, c, e, f We
prune c, d, e since d, e is not in L2 We
prune c, d, f since d, f is not in L2 We
prune c, e, f since e, f is not in L2 We are
left with C3 a, c, d We now run lines 57,
to count how many records contain a, c, d. The
count is 4, so L3 a, c, d
37
So we have L3 a, c, d We now set k 4, but
when we run apriori-gen on L3 we get the empty
set, and hence eventually we find L4 This
means we now finish, and return the set of all of
the non-empty Ls these are all of the large
itemsets Result a, b, c, d, e,
f, g, a, c, a, d, c, d, c, e, c,
f, a, c, d Each large itemset is
intrinsically interesting, and may be of business
value. Simple rule-generation algorithms can now
use the large itemsets as a starting point.

Write a Comment

User Comments (0)