Data Mining Session 5 Fast Discovery of Association Rules - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Data Mining Session 5 Fast Discovery of Association Rules

Description:

join step. join large (k-1)-itemsets with large (k-1)-itemsets ... join step. select 2 large (k-1) itemsets that share first k-2 items ... join. prune ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 51
Provided by: lucde8
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Session 5 Fast Discovery of Association Rules


1
Data MiningSession 5Fast Discovery of
Association Rules
  • Luc Dehaspe
  • K.U.L. Computer Science Department

2
Course overview
Session 2-3 Data preparation
Data Mining
3
Previous session classification task
  • decision trees
  • scaling up
  • this session descriptive task

4
Overview
Rakesh Agrawal, Heikki Mannila, Ramakrishnan
Srikant, Hannu Toivonen, and A. Inkeri
Verkamo. Chapter 12 in Advances in Knowledge
Discovery and Data Mining, Fayyad et al (Eds.),
MIT Press, 1996.
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid
  • Sampling

5
Association rules
  • IF-THEN rules show relationships
  • e.g., Which products bought together?

6
Representation market baskets
sparsity
7
Representation
  • I i1,i2, , im a set of literals called
    items
  • e.g., I banana, cheese, floppy, pizza, wine
  • D T1,T2, , Tn a set of transactions with Tj
    ? I (1? j ? n)
  • e.g., D
  • a set of items X ? I is called an itemset
  • transaction T contains itemset X iff X ? T
  • e.g., transaction 100 contains itemset X
    banana,pizza

8
Representation
  • an association rule is an implication of the form
  • X ? Y
  • where itemset X ? I, itemset Y ? I, and X ? Y
    ?
  • e.g., banana,floppy ? cheese,wine
  • rule X ? Y holds in transaction set D with
    confidence c iff c of the transactions in D that
    contain X also contain Y
  • e.g., banana,floppy ? cheese,wine

confidence 1/2 50
9
Representation
  • rule X ? Y has support s in transaction set D iff
    s of the transactions in D contain X ? Y
  • e.g., banana,floppy ? cheese,wine

support 1/4 25
10
Task
Given set of transactions D generate all
association rules that have minimum support and
confidence
  • User-specified thresholds minsup and minconf
  • 0 lt minsup ? 100
  • 0 lt minconf ? 100
  • E.g., given D
  • generate all association rules with support at
    least 50 and confidence at least 25

11
Fast Discovery of Association Rules
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid
  • Sampling

12
Naïve Algorithm generate-and-test
  • For each association rule
  • Compute support and confidence
  • If both are sufficient, add to result
  • Problematic complexity
  • exponential in nr of items
  • m items ? (3m - 2m1 1) rules
  • e.g, 5 items 180 rules
  • 20 items 3109 rules
  • 1 pass over data per rule

13
Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
14
Phase 1 Finding all large itemsets
  • multiple passes over data, 1 per level in the
    space of potential itemsets (breadth-first
    search)
  • until no new itemsets are found
  • start with seed set of large itemsets
  • use seed to generate potentially large itemsets
    called candidate itemsets
  • evaluate candidate itemsets in single pass over
    data
  • use support counts to select candidate itemsets
    that are actually large
  • actually large itemsets become seed for next pass
  • initial seed the set of all large singleton
    itemsets

15
Apriori
forall transactions t ? D do
forall candidates c ? Ck contained in t do
c.count
16
AprioriIteration candidate generation/evaluation
17
Apriori Candidate generation
  • input the set of all large (k-1)-itemsets
  • output superset of set of all large k-itemsets
  • join step
  • join large (k-1)-itemsets with large
    (k-1)-itemsets
  • produce superset of final set of level k
    candidates
  • prune step
  • delete all itemsets that have a (k-1)-itemsubset
    that is not large

18
Apriori Candidate generationjoin step
  • select 2 large (k-1) itemsets that share first
    k-2 items
  • construct level k candidate by appending last
    item of second selected itemset to first selected
    itemset

19
Apriori Candidate generationprune step
Property for any itemset in Lk with minimum
support, any subset of size Lk-1 must also have
minimum support
  • delete candidates that have a (k-1) subset that
    is not large

20
Aprioriprune step
subset of
less frequent
21
Apriori Candidate evaluation
  • input Candidate itemsets stored in hash-tree
  • output support counts of all candidates

22
Apriori Candidate evaluationBuilding the
hash-tree
23
Apriori Candidate evaluationBuilding the
hash-tree
hash-tree of candidates
counter associated with each leaf node
24
Apriori Candidate evaluationFinding candidates
contained in transaction
BCF
counter associated with each leaf node
25
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCW
BCF
BCP
counter associated with each leaf node
26
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
27
Apriori Candidate evaluationFinding candidates
contained in transaction
hash-tree of candidates
BCFW
TID 300
FPW
CPW
BFP
BPW
CFP
BFW
CFW
BCF
BCP
BCW
counter associated with each leaf node
28
Alternative to Apriori AprioriTid
  • Database D not used for counting support after
    first pass
  • Associate list of candidates tC with each
    transaction t
  • Initially set TC1 of all candidates for level 1
    equals D
  • Initially L1 contains all large 1-itemsets

29
Alternative to Apriori AprioriTid
  • Apply candidate generation to L1 to generate C2

30
Alternative to Apriori AprioriTid
  • take entry 100 from TC1 and determine candidates
    from C2 contained in transaction 100

t100C2 c ? C2 c\last item ? t100C1 and
c\ one but last item ? t100C1 BF
1
  • increment counters in C2
  • repeat for all entries from TC1

31
Alternative to Apriori AprioriTid
  • copy itemsets with sufficient support to L2

32
Alternative to Apriori AprioriTid
33
Two step approach
Phase 1 Find all itemsets with support gt minsup
( large itemsets)
75 cheese
50 cheese,floppy,wine
Phase 2 Generate all association rules with high
confidence and support
34
Phase 2 generating rules
  • minsup requirement
  • start from large itemset X
  • consider rules A ? (X \ A) , where A ? X
  • definition
  • rule A ? (X \ A) has support s in transaction set
    D iff s of the transactions in D contain A ? (X
    \ A) X
  • minconf requirement
  • (support X) / (support A) at least minconf
  • e.g., X CFW, A CF, rule CF ? W
  • support rule support CFW 2
  • conf rule (support CFW) / support (CF)) 1
  • if cheese and floppy then always wine

35
Phase 2 generating rules
  • support subset AS of A ? support A
  • confidence of AS ? (X \ AS) (support X / support
    AS) cannot be more than confidence of A ? (X \
    A) (support X / support A)
  • if A ? (X \ A) bad then AS ? (X \ AS) also bad
  • AS ? (X \ AS) ok then A ? (X \ A) also ok
  • e.g. X cheese, floppy,wine
  • If cheese ? floppy,wine ok Then
  • cheese,floppy ? wine ok and
  • cheese,wine ? floppy ok

36
Phase 2 generating rules
  • Select large k-itemset (k gt 1)
  • First generate all rules with one item in the
    consequent
  • Apply candidate generation (see above Apriori)
    to generate all possible consequents with 2
    items,
  • compute confidence of rules and store those that
    are ok
  • repeat until all rules with one item in condition
    have been tried

37
Phase 2 generating rules
L2
  • Select large k-itemset (k gt 1) e.g., CFW
  • 1-item consequents C,F,W
  • FW ? C (conf 100)
  • CW ? F (conf 66)
  • CF ? W (conf 100)
  • apriori-gen (C,F,W) CF,CW,FW
  • W ? CF (conf 66)
  • F ? CW (conf 66)
  • C ? FW (conf 66)

38
Fast Discovery of Association Rules
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid
  • Sampling

39
Empirical resultsSynthetic data
  • synthetic transaction data
  • transaction size clustered around mean
  • size large itemsets clustered around mean
  • Method
  • generate 2000 large itemsets from 1000 items
  • size of set picked from Poisson distribution,
    mean I 2,4, or 6
  • weight of itemset probability that it will be
    picked (sum 1)
  • generate D 100,000 transactions
  • size picked from Poisson distribution, mean T
    5, 10, or 20
  • for scale-up experiment D 10 million
    transactions

40
Empirical resultsSynthetic data
apriori runtimes
41
Empirical results
42
Fast Discovery of Association Rules
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid start
  • Sampling

43
Algorithm AprioriHybrid
  • AprioriTid replaces pass over data by pass over
    TCk
  • effective when TCk becomes small compared to size
    of database
  • AprioriTid beats Apriori
  • when TCk sets fit in memory
  • distribution of large itemsets has long tail
  • Hybrid algorithm AprioriHybrid
  • use Apriori in initial passes
  • switch to AprioriTid when TCk expected to fit in
    memory

44
Algorithm AprioriHybrid
  • Heuristic used for switching
  • estimate size of TCk from Ck
  • size(TCk ) ? candidates c ? Ck support(c)
    number of transactions
  • if TCk fits in memory and nr of candidates
    decreasing then switch to AprioriTid
  • AprioriHybrid outperforms Apriori and AprioriTid
    in almost all cases
  • little worse if switch pass is last one
  • cost of switching without benefits
  • AprioriHybrid up to 30 better than Apriori, up
    to 60 better than AprioriTid

45
Algorithm AprioriHybridScale-up Experiment
46
Fast Discovery of Association Rules
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid
  • Sampling

47
Sampling
  • running time Apriori family algorithms bounded by
    O(C . D)
  • C denotes sum of sizes of candidates
    considered
  • D denotes size of database
  • sampling possible way to reduce running time
  • Let s be true support of itemset X
  • random sample with replacement of size h from
    database
  • x number of transactions in sample containing X
  • x binomially distributed h trials, prob success
    s
  • probability estimate support off by at least ?
    bounded by quantity exponential in h
  • Prx gt h(s ?) lt e-2sqr(?)h

48
Sampling
  • Support off 1 thousands of examples sufficient
  • sampling not effective for fractions of percent
    support
  • completeness guarantee of finding all rules
    satisfying minsup and minconf lost

49
Fast Discovery of Association RulesConclusions
  • Introduction representation task
  • Algorithms
  • generate test
  • two-step approach
  • Empirical results
  • AprioriHybrid
  • Sampling

50
Fast Discovery of Association RulesNext
  • Parallel algorithms
  • Mining sequence data
  • Case-study
Write a Comment
User Comments (0)
About PowerShow.com