Association Analysis

About This Presentation

Title:

Association Analysis

Description:

Given a set of records each of which contain some number of items from a given collection; ... tj is said to contain an itemset X, if X is a subset of tj. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 49

Provided by: alext151

Category:

more less

Transcript and Presenter's Notes

Title: Association Analysis

1
Association Analysis
2
Association Rule Mining Definition

Given a set of records each of which contain some
number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
3
Association Rules

Marketing and Sales Promotion
Let the rule discovered be
Bagels, --gt Potato Chips
Potato Chips as consequent gt
Can be used to determine what should be done to
boost its sales.
Bagels in the antecedent gt
Can be used to see which products would be
affected if the store discontinues selling bagels.

4
Two key issues

First
discovering patterns from a large transaction
data set can be computationally expensive.
Second
some of the discovered patterns are potentially
spurious
because they may happen simply by chance.

5
Items and transactions

Let
I i1, i2,,id be the set of all items in a
market basket data and
T t1, t2 ,, tN be the set of all
transactions.
Each transaction ti contains a subset of items
chosen from I.
Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Transaction width
The number of items present in a transaction.
A transaction tj is said to contain an itemset X,
if X is a subset of tj.
E.g., the second transaction contains the itemset
Bread, Diapers but not Bread, Milk.

6
Definition Frequent Itemset

Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5 ?/N
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

7
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics (X ? Y)
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

8
Why Use Support and Confidence?

Support
A very low support rule ? may occur simply by
chance.
A very low support rule ? uninteresting rules.
Confidence for a given rule X ? Y
the higher the confidence ? the more likely it is
for Y to be present in transactions that contain
X
Measures the reliability of the inference made by
a rule.
is an estimate of the conditional probability of
Y given X.

9
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support ? minsup threshold
confidence ? minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

10
Brute-force approach

Suppose there are d items. We first choose k of
the items to form the left hand side of the rule.
There are Cd,k ways for doing this.
Now, there are Cd-k,i ways to choose the
remaining items to form the right hand side of
the rule, where 1 i d-k.

11
Brute-force approach

R3d-2d11
For d6,
36-271602 possible rules
However, 80 of the rules are discarded after
applying minsup20 and minconf50, thus making
most of the computations become wasted.
So, it would be useful to prune the rules early
without having to compute their support and
confidence values.

An initial step toward improving the performance
decouple the support and confidence requirements.
12
Basic Observations
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the rules are binary partitions of the
itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support
but can have different confidence
We may decouple the support and confidence
requirements
If the itemset is infrequent, then all six
candidate rules can be pruned immediately without
our having to compute their confidence values.

13
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
these itemsets are called frequent itemset
Rule Generation
Generate high confidence rules from each frequent
itemset
where each rule is a binary partitioning of a
frequent itemset (these rules are called strong
rules)
The computational requirements for frequent
itemset generation are more expensive than those
of rule generation.
We focus first on frequent itemset generation.

14
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
15
Frequent Itemset Generation

Brute-force approach
Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!
w is max transaction width.

16
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)
Complete search M2d
Use pruning techniques to reduce M
Apriori principle is an effective way to
eliminate candidate itemsets without counting
their support values.
Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent

17
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
conversely
If an itemset such as a, b is infrequent, then
all of its supersets must be infrequent too.
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

18
Illustrating Apriori Principle
19
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor
Eggs)
Triplets (3-itemsets)
With the Apriori principle we need to keep only
this triplet, because its the only one whose
subsets are all frequent.
Minimum Support 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
20
Apriori Algorithm

Method
Let k1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are
identified
kk1
Generate length-k candidate itemsets
from length-k-1 frequent itemsets
Prune candidate itemsets
that contain infrequent subsets of length-k-1
Count the support of each candidate
by scanning the DB and eliminate candidates that
are infrequent

21
Important Details of Apriori

How to generate length-k candidates?
Step 1 self-joining Lk-1
Step 2 pruning
How to count supports of candidates?
Example of Candidate-generation
L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

22
Challenges of Frequent Pattern Mining

Challenges
Multiple scans of transaction database
Huge number of candidate itemsets
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

23
Candidate generation and prunning

An effective candidate generation procedure
should
avoid generating too many unnecessary candidates.
unnecessary candidate itemset ? at least one of
its subsets is infrequent.
ensure that the candidate set is complete,
no frequent itemsets are left out by the
candidate generation procedure.
not generate the same candidate itemset more than
once.
E.g., the candidate itemset a, b, c, d can be
generated in many ways---
by merging a, b, c with d,
c with a, b, d, etc.

24
Brute force

A bruteforce method considers every kitemset as
a potential candidate and then applies the
candidate pruning step to remove any unnecessary
candidates.

25
Fk-1?F1 Method

Extend each frequent (k - 1)itemset with a
frequent 1-itemset.
Complete?
Yes, because every frequent kitemset is composed
of a frequent (k - 1)itemset and a frequent
1itemset.
Problem
doesnt prevent the same candidate itemset from
being generated more than once.
E.g., Bread, Diapers, Milk can be generated by
merging
Bread, Diapers with Milk,
Bread, Milk with Diapers, or
Diapers, Milk with Bread.

26
Lexicographic Order

To avoid generating duplicate candidates
ensure that the items in each frequent itemset
are kept sorted in their lexicographic order.
Each frequent (k-1)itemset X is then extended
with frequent items that are lexicographically
larger than the items in X.
For example
the itemset Bread, Diapers can be augmented
with Milk since Milk is lexicographically
larger than Bread and Diapers.
we dont augment Diapers, Milk with Bread nor
Bread, Milk with Diapers because they violate
the lexicographic ordering condition.
Is it complete?

27
Lexicographic Order - Completeness

Complete? Yes
Let (i1,, ik-1, ik) be a frequent k-itemset
sorted in lexicographic order.
Since it is frequent, by the Apriori principle,
(i1,, ik-1) and (ik) are frequent as well.
I.e. (i1,, ik-1) ?Fk-1 and (ik) ?F1.
Since, (ik) is lexicographically bigger than
i1,, ik-1
(i1,, ik-1) would be joined with (ik)
and give (i1,, ik-1, ik) as a candidate
k-itemset.

28
Still too many candidates

E.g. merging Beer, Diapers with Milk is
unnecessary because one of its subsets, Beer,
Milk, is infrequent.
Heuristics available to reduce (prune) the number
of unnecessary candidates.
E.g., for a candidate kitemset to be worthy,
every item in the candidate must be contained in
at least k-1 of the frequent (k-1)itemsets.
Beer, Diapers, Milk is a viable candidate
3itemset only if
every item in the candidate, including Beer, is
contained in at least 2 frequent 2itemsets.
Since there is only one frequent 2itemset
containing Beer, all candidate itemsets involving
Beer must be infrequent.
Why?
Because each of k-1subsets containing an item
must be frequent.

29
Fk-1?F1
30
Fk-1?Fk-1 Method

Merge a pair of frequent (k-1)itemsets only if
their first k-2 items are identical.
E.g. Bread, Diapers,Bread, Milk ? candidate
3itemset Bread, Diapers, Milk.
Beer, Diapers Diapers, Milk not merged
If Beer, Diapers, Milk is a viable candidate,
it would have been obtained by merging Beer,
Diapers with Beer, Milk instead.
an additional candidate pruning step is needed to
ensure
the remaining k-2 subsets of k-1 elements are
frequent.

31
Fk-1?Fk-1
32
Example
Min_sup_count 2
33
Generate C2 from F1?F1
Min_sup_count 2
F1
34
Generate C3 from F2?F2
Min_sup_count 2
F2
Prune
C3
F3
35
Generate C4 from F3?F3
Min_sup_count 2
F3
C4
I1,I2,I3,I5 is pruned because I2,I3,I5 is
infrequent
36
Support counting for Candidate

Scan the database of transactions to determine
the support of each candidate itemset
Brute force Match each transaction against every
candidate.
Too many comparisons!
Better method Store the candidate itemsets in a
hash structure
A transaction will be tested for match only
against candidates contained in a few buckets

For candidate itemsets
37
Hash Tree for Storing Candidate Itemsets

Store the candidate itemsets in a hash structure
A transaction will be tested for match only
against candidates contained in a few buckets
Hash tree can also be used for candidate
generation

For candidate itemsets
38
Hash Tree For candidate itemsets

Suppose you have 15 candidate itemsets of length
3 to be stored
1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8
You need
A hash function (e.g. p mod 3)
Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)

39
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
Split nodes with more than 3 candidates using
the second item
40
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
2 3 4 5 6 7
3 5 6 3 5 7 6 8 9 3 4 5 3 6 7 3 6 8
1 4 5
1 3 6
1 2 4 4 5 7 1 2 5 4 5 8 1 5 9
Now split nodes using the third item
41
Hash Tree For candidate itemsets
Suppose you have 15 candidate itemsets of length
3 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8,
1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3
5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
Now, split this similarly.
42
Hash Tree For candidate itemsets
Now, split this similarly.
43
Enumerate all Subsets of a transaction
Given a (lexicographically ordered) transaction
t, say 1,2,3,5,6 how to enumerate all subsets
of size 3?
44
Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
45
Matching transaction against candidates
Hash Function
transaction
1,4,7
3,6,9
2,5,8
1 3 6
3 4 5
1 5 9
46
Matching transaction against candidates
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 7 out of 15 candidates
47
Trie for Storing Candidate Itemsets
Suppose you have 5 candidate itemsets of length
3 A,C,D, A,E,G, A,E,L, A,E,M, K,M,N.

To match transaction t against candidates
we take all ordered k-subsets X of t
search for them in the trie structure
If X is found(as a candidate), then support count
of this candidate 1

48
Tries can store frequent itemsets too

Candidate generation becomes easy and fast
We can generate candidates from pairs of nodes
that have the same parents
except for the last item, the two sets are the
same
Association rules are produced much faster
retrieving a support of an itemset is quicker
Remember the trie was originally developed to
quickly decide if a word is included in a
dictionary