Data Mining Session 5 Fast Discovery of Association Rules - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Data Mining Session 5 Fast Discovery of Association Rules

Description:

join step. join large (k-1)-itemsets with large (k-1)-itemsets ... join step. select 2 large (k-1) itemsets that share first k-2 items ... join. prune ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 51

Provided by: lucde8

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining Session 5 Fast Discovery of Association Rules

1
Data MiningSession 5Fast Discovery of
Association Rules

Luc Dehaspe
K.U.L. Computer Science Department

2
Course overview
Session 2-3 Data preparation
Data Mining
3
Previous session classification task

decision trees
scaling up
this session descriptive task

4
Overview
Rakesh Agrawal, Heikki Mannila, Ramakrishnan
Srikant, Hannu Toivonen, and A. Inkeri
Verkamo. Chapter 12 in Advances in Knowledge
Discovery and Data Mining, Fayyad et al (Eds.),
MIT Press, 1996.

Introduction representation task
Algorithms
generate test
two-step approach
Empirical results
AprioriHybrid
Sampling

5
Association rules

IF-THEN rules show relationships
e.g., Which products bought together?

6
Representation market baskets
sparsity
7
Representation

I i1,i2, , im a set of literals called
items
e.g., I banana, cheese, floppy, pizza, wine
D T1,T2, , Tn a set of transactions with Tj
? I (1? j ? n)
e.g., D
a set of items X ? I is called an itemset
transaction T contains itemset X iff X ? T
e.g., transaction 100 contains itemset X
banana,pizza

8
Representation

an association rule is an implication of the form
X ? Y
where itemset X ? I, itemset Y ? I, and X ? Y
?
e.g., banana,floppy ? cheese,wine
rule X ? Y holds in transaction set D with
confidence c iff c of the transactions in D that
contain X also contain Y
e.g., banana,floppy ? cheese,wine

confidence 1/2 50
9
Representation

rule X ? Y has support s in transaction set D iff
s of the transactions in D contain X ? Y
e.g., banana,floppy ? cheese,wine

support 1/4 25
10
Task
Given set of transactions D generate all
association rules that have minimum support and
confidence

User-specified thresholds minsup and minconf
0 lt minsup ? 100
0 lt minconf ? 100
E.g., given D
generate all association rules with support at
least 50 and confidence at least 25

11
Fast Discovery of Association Rules

Introduction representation task
Algorithms
generate test
two-step approach
Empirical results
AprioriHybrid
Sampling

12
Naïve Algorithm generate-and-test

For each association rule
Compute support and confidence
If both are sufficient, add to result
Problematic complexity
exponential in nr of items
m items ? (3m - 2m1 1) rules
e.g, 5 items 180 rules
20 items 3109 rules
1 pass over data per rule

13
Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
14
Phase 1 Finding all large itemsets

multiple passes over data, 1 per level in the
space of potential itemsets (breadth-first
search)
until no new itemsets are found
start with seed set of large itemsets
use seed to generate potentially large itemsets
called candidate itemsets
evaluate candidate itemsets in single pass over
data
use support counts to select candidate itemsets
that are actually large
actually large itemsets become seed for next pass
initial seed the set of all large singleton
itemsets

15
Apriori
forall transactions t ? D do
forall candidates c ? Ck contained in t do
c.count
16
AprioriIteration candidate generation/evaluation
17
Apriori Candidate generation

input the set of all large (k-1)-itemsets
output superset of set of all large k-itemsets
join step
join large (k-1)-itemsets with large
(k-1)-itemsets
produce superset of final set of level k
candidates
prune step
delete all itemsets that have a (k-1)-itemsubset
that is not large

18
Apriori Candidate generationjoin step

select 2 large (k-1) itemsets that share first
k-2 items

construct level k candidate by appending last
item of second selected itemset to first selected
itemset

19
Apriori Candidate generationprune step
Property for any itemset in Lk with minimum
support, any subset of size Lk-1 must also have
minimum support

delete candidates that have a (k-1) subset that
is not large

20
Aprioriprune step
subset of
less frequent
21
Apriori Candidate evaluation

input Candidate itemsets stored in hash-tree
output support counts of all candidates

22
Apriori Candidate evaluationBuilding the
hash-tree
23
Apriori Candidate evaluationBuilding the
hash-tree
hash-tree of candidates
counter associated with each leaf node
24
Apriori Candidate evaluationFinding candidates
contained in transaction
BCF
counter associated with each leaf node
25
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCW
BCF
BCP
counter associated with each leaf node
26
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
27
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
28
Alternative to Apriori AprioriTid

Database D not used for counting support after
first pass
Associate list of candidates tC with each
transaction t
Initially set TC1 of all candidates for level 1
equals D
Initially L1 contains all large 1-itemsets

29
Alternative to Apriori AprioriTid

Apply candidate generation to L1 to generate C2

30
Alternative to Apriori AprioriTid

take entry 100 from TC1 and determine candidates
from C2 contained in transaction 100

t100C2 c ? C2 c\last item ? t100C1 and
c\ one but last item ? t100C1 BF
1

increment counters in C2
repeat for all entries from TC1

31
Alternative to Apriori AprioriTid

copy itemsets with sufficient support to L2

32
Alternative to Apriori AprioriTid
33
Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
34
Phase 2 generating rules

minsup requirement
start from large itemset X
consider rules A ? (X \ A) , where A ? X
definition
rule A ? (X \ A) has support s in transaction set
D iff s of the transactions in D contain A ? (X
\ A) X
minconf requirement
(support X) / (support A) at least minconf
e.g., X CFW, A CF, rule CF ? W
support rule support CFW 2
conf rule (support CFW) / support (CF)) 1
if cheese and floppy then always wine

35
Phase 2 generating rules

support subset AS of A ? support A
confidence of AS ? (X \ AS) (support X / support
AS) cannot be more than confidence of A ? (X \
A) (support X / support A)
if A ? (X \ A) bad then AS ? (X \ AS) also bad
AS ? (X \ AS) ok then A ? (X \ A) also ok
e.g. X cheese, floppy,wine
If cheese ? floppy,wine ok Then
cheese,floppy ? wine ok and
cheese,wine ? floppy ok

36
Phase 2 generating rules

Select large k-itemset (k gt 1)
First generate all rules with one item in the
consequent
Apply candidate generation (see above Apriori)
to generate all possible consequents with 2
items,
compute confidence of rules and store those that
are ok
repeat until all rules with one item in condition
have been tried

37
Phase 2 generating rules
L2

Select large k-itemset (k gt 1) e.g., CFW
1-item consequents C,F,W
FW ? C (conf 100)
CW ? F (conf 66)
CF ? W (conf 100)
apriori-gen (C,F,W) CF,CW,FW
W ? CF (conf 66)
F ? CW (conf 66)
C ? FW (conf 66)

38
Fast Discovery of Association Rules

Introduction representation task
Algorithms
generate test
two-step approach
Empirical results
AprioriHybrid
Sampling

39
Empirical resultsSynthetic data

synthetic transaction data
transaction size clustered around mean
size large itemsets clustered around mean
Method
generate 2000 large itemsets from 1000 items
size of set picked from Poisson distribution,
mean I 2,4, or 6
weight of itemset probability that it will be
picked (sum 1)
generate D 100,000 transactions
size picked from Poisson distribution, mean T
5, 10, or 20
for scale-up experiment D 10 million
transactions

40
Empirical resultsSynthetic data
apriori runtimes
41
Empirical results
42
Fast Discovery of Association Rules

Introduction representation task
Algorithms
generate test
two-step approach
Empirical results
AprioriHybrid start
Sampling

43
Algorithm AprioriHybrid

AprioriTid replaces pass over data by pass over
TCk
effective when TCk becomes small compared to size
of database
AprioriTid beats Apriori
when TCk sets fit in memory
distribution of large itemsets has long tail
Hybrid algorithm AprioriHybrid
use Apriori in initial passes
switch to AprioriTid when TCk expected to fit in
memory

44
Algorithm AprioriHybrid

Heuristic used for switching
estimate size of TCk from Ck
size(TCk ) ? candidates c ? Ck support(c)
number of transactions
if TCk fits in memory and nr of candidates
decreasing then switch to AprioriTid
AprioriHybrid outperforms Apriori and AprioriTid
in almost all cases
little worse if switch pass is last one
cost of switching without benefits
AprioriHybrid up to 30 better than Apriori, up
to 60 better than AprioriTid

45
Algorithm AprioriHybridScale-up Experiment
46
Fast Discovery of Association Rules

Introduction representation task
Algorithms
generate test
two-step approach
Empirical results
AprioriHybrid
Sampling

47
Sampling

running time Apriori family algorithms bounded by
O(C . D)
C denotes sum of sizes of candidates
considered
D denotes size of database
sampling possible way to reduce running time
Let s be true support of itemset X
random sample with replacement of size h from
database
x number of transactions in sample containing X
x binomially distributed h trials, prob success
s
probability estimate support off by at least ?
bounded by quantity exponential in h
Prx gt h(s ?) lt e-2sqr(?)h