Title: Data Mining Session 5 Fast Discovery of Association Rules
1Data MiningSession 5Fast Discovery of
Association Rules
- Luc Dehaspe
- K.U.L. Computer Science Department
2Course overview
Session 2-3 Data preparation
Data Mining
3Previous session classification task
- decision trees
- scaling up
- this session descriptive task
4Overview
Rakesh Agrawal, Heikki Mannila, Ramakrishnan
Srikant, Hannu Toivonen, and A. Inkeri
Verkamo. Chapter 12 in Advances in Knowledge
Discovery and Data Mining, Fayyad et al (Eds.),
MIT Press, 1996.
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid
- Sampling
5Association rules
- IF-THEN rules show relationships
- e.g., Which products bought together?
6Representation market baskets
sparsity
7Representation
- I i1,i2, , im a set of literals called
items - e.g., I banana, cheese, floppy, pizza, wine
- D T1,T2, , Tn a set of transactions with Tj
? I (1? j ? n) - e.g., D
- a set of items X ? I is called an itemset
- transaction T contains itemset X iff X ? T
- e.g., transaction 100 contains itemset X
banana,pizza
8Representation
- an association rule is an implication of the form
- X ? Y
- where itemset X ? I, itemset Y ? I, and X ? Y
? - e.g., banana,floppy ? cheese,wine
- rule X ? Y holds in transaction set D with
confidence c iff c of the transactions in D that
contain X also contain Y - e.g., banana,floppy ? cheese,wine
confidence 1/2 50
9Representation
- rule X ? Y has support s in transaction set D iff
s of the transactions in D contain X ? Y - e.g., banana,floppy ? cheese,wine
support 1/4 25
10Task
Given set of transactions D generate all
association rules that have minimum support and
confidence
- User-specified thresholds minsup and minconf
- 0 lt minsup ? 100
- 0 lt minconf ? 100
- E.g., given D
- generate all association rules with support at
least 50 and confidence at least 25
11Fast Discovery of Association Rules
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid
- Sampling
12Naïve Algorithm generate-and-test
- For each association rule
- Compute support and confidence
- If both are sufficient, add to result
- Problematic complexity
- exponential in nr of items
- m items ? (3m - 2m1 1) rules
- e.g, 5 items 180 rules
- 20 items 3109 rules
- 1 pass over data per rule
13Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
14Phase 1 Finding all large itemsets
- multiple passes over data, 1 per level in the
space of potential itemsets (breadth-first
search) - until no new itemsets are found
- start with seed set of large itemsets
- use seed to generate potentially large itemsets
called candidate itemsets - evaluate candidate itemsets in single pass over
data - use support counts to select candidate itemsets
that are actually large - actually large itemsets become seed for next pass
- initial seed the set of all large singleton
itemsets
15Apriori
forall transactions t ? D do
forall candidates c ? Ck contained in t do
c.count
16AprioriIteration candidate generation/evaluation
17Apriori Candidate generation
- input the set of all large (k-1)-itemsets
- output superset of set of all large k-itemsets
- join step
- join large (k-1)-itemsets with large
(k-1)-itemsets - produce superset of final set of level k
candidates - prune step
- delete all itemsets that have a (k-1)-itemsubset
that is not large
18Apriori Candidate generationjoin step
- select 2 large (k-1) itemsets that share first
k-2 items
- construct level k candidate by appending last
item of second selected itemset to first selected
itemset
19Apriori Candidate generationprune step
Property for any itemset in Lk with minimum
support, any subset of size Lk-1 must also have
minimum support
- delete candidates that have a (k-1) subset that
is not large
20Aprioriprune step
subset of
less frequent
21Apriori Candidate evaluation
- input Candidate itemsets stored in hash-tree
- output support counts of all candidates
22Apriori Candidate evaluationBuilding the
hash-tree
23Apriori Candidate evaluationBuilding the
hash-tree
hash-tree of candidates
counter associated with each leaf node
24Apriori Candidate evaluationFinding candidates
contained in transaction
BCF
counter associated with each leaf node
25Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCW
BCF
BCP
counter associated with each leaf node
26Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
27Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
28Alternative to Apriori AprioriTid
- Database D not used for counting support after
first pass - Associate list of candidates tC with each
transaction t - Initially set TC1 of all candidates for level 1
equals D - Initially L1 contains all large 1-itemsets
29Alternative to Apriori AprioriTid
- Apply candidate generation to L1 to generate C2
30Alternative to Apriori AprioriTid
- take entry 100 from TC1 and determine candidates
from C2 contained in transaction 100
t100C2 c ? C2 c\last item ? t100C1 and
c\ one but last item ? t100C1 BF
1
- increment counters in C2
- repeat for all entries from TC1
31Alternative to Apriori AprioriTid
- copy itemsets with sufficient support to L2
32Alternative to Apriori AprioriTid
33Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
34Phase 2 generating rules
- minsup requirement
- start from large itemset X
- consider rules A ? (X \ A) , where A ? X
- definition
- rule A ? (X \ A) has support s in transaction set
D iff s of the transactions in D contain A ? (X
\ A) X - minconf requirement
- (support X) / (support A) at least minconf
- e.g., X CFW, A CF, rule CF ? W
- support rule support CFW 2
- conf rule (support CFW) / support (CF)) 1
- if cheese and floppy then always wine
35Phase 2 generating rules
- support subset AS of A ? support A
- confidence of AS ? (X \ AS) (support X / support
AS) cannot be more than confidence of A ? (X \
A) (support X / support A) - if A ? (X \ A) bad then AS ? (X \ AS) also bad
- AS ? (X \ AS) ok then A ? (X \ A) also ok
- e.g. X cheese, floppy,wine
- If cheese ? floppy,wine ok Then
- cheese,floppy ? wine ok and
- cheese,wine ? floppy ok
36Phase 2 generating rules
- Select large k-itemset (k gt 1)
- First generate all rules with one item in the
consequent - Apply candidate generation (see above Apriori)
to generate all possible consequents with 2
items, - compute confidence of rules and store those that
are ok - repeat until all rules with one item in condition
have been tried
37Phase 2 generating rules
L2
- Select large k-itemset (k gt 1) e.g., CFW
- 1-item consequents C,F,W
- FW ? C (conf 100)
- CW ? F (conf 66)
- CF ? W (conf 100)
- apriori-gen (C,F,W) CF,CW,FW
- W ? CF (conf 66)
- F ? CW (conf 66)
- C ? FW (conf 66)
38Fast Discovery of Association Rules
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid
- Sampling
39Empirical resultsSynthetic data
- synthetic transaction data
- transaction size clustered around mean
- size large itemsets clustered around mean
- Method
- generate 2000 large itemsets from 1000 items
- size of set picked from Poisson distribution,
mean I 2,4, or 6 - weight of itemset probability that it will be
picked (sum 1) - generate D 100,000 transactions
- size picked from Poisson distribution, mean T
5, 10, or 20 - for scale-up experiment D 10 million
transactions
40Empirical resultsSynthetic data
apriori runtimes
41Empirical results
42Fast Discovery of Association Rules
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid start
- Sampling
43Algorithm AprioriHybrid
- AprioriTid replaces pass over data by pass over
TCk - effective when TCk becomes small compared to size
of database - AprioriTid beats Apriori
- when TCk sets fit in memory
- distribution of large itemsets has long tail
- Hybrid algorithm AprioriHybrid
- use Apriori in initial passes
- switch to AprioriTid when TCk expected to fit in
memory
44Algorithm AprioriHybrid
- Heuristic used for switching
- estimate size of TCk from Ck
- size(TCk ) ? candidates c ? Ck support(c)
number of transactions - if TCk fits in memory and nr of candidates
decreasing then switch to AprioriTid - AprioriHybrid outperforms Apriori and AprioriTid
in almost all cases - little worse if switch pass is last one
- cost of switching without benefits
- AprioriHybrid up to 30 better than Apriori, up
to 60 better than AprioriTid
45Algorithm AprioriHybridScale-up Experiment
46Fast Discovery of Association Rules
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid
- Sampling
47Sampling
- running time Apriori family algorithms bounded by
O(C . D) - C denotes sum of sizes of candidates
considered - D denotes size of database
- sampling possible way to reduce running time
- Let s be true support of itemset X
- random sample with replacement of size h from
database - x number of transactions in sample containing X
- x binomially distributed h trials, prob success
s - probability estimate support off by at least ?
bounded by quantity exponential in h - Prx gt h(s ?) lt e-2sqr(?)h
48Sampling
- Support off 1 thousands of examples sufficient
- sampling not effective for fractions of percent
support - completeness guarantee of finding all rules
satisfying minsup and minconf lost
49Fast Discovery of Association RulesConclusions
- Introduction representation task
- Algorithms
- generate test
- two-step approach
- Empirical results
- AprioriHybrid
- Sampling
50Fast Discovery of Association RulesNext
- Parallel algorithms
- Mining sequence data
- Case-study