Association Rule Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Association Rule Mining

Description:

cool. sunny. no. high. mild. sunny. yes. normal. cool. overcast. no. normal. cool. rainy. yes. normal. cool. rainy. yes. high. mild. rainy. yes. high. hot ... – PowerPoint PPT presentation

Number of Views:613
Avg rating:3.0/5.0
Slides: 36
Provided by: Jiawe7
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining


1
Association Rule Mining
  • Instructor Qiang Yang
  • Thanks Jiawei Han and Jian Pei

2
What Is Frequent Pattern Mining?
  • Frequent pattern pattern (set of items,
    sequence, etc.) that occurs frequently in a
    database AIS93
  • Frequent pattern mining finding regularities in
    data
  • What products were often purchased together?
    Beer and diapers?!
  • What are the subsequent purchases after buying a
    PC?
  • What kinds of DNA are sensitive to this new drug?
  • Can we automatically classify web documents?

3
Why Is Frequent Pattern Mining an Essential Task
in Data Mining?
  • Foundation for many essential data mining tasks
  • Association, correlation, causality
  • Sequential patterns, temporal or cyclic
    association, partial periodicity, spatial and
    multimedia association
  • Associative classification, cluster analysis,
    iceberg cube, fascicles (semantic data
    compression)
  • Broad applications
  • Basket data analysis, cross-marketing, catalog
    design, sale campaign analysis
  • Web log (click stream) analysis, DNA sequence
    analysis, etc.

4
Basic Concepts Frequent Patterns and Association
Rules
  • Itemset Xx1, , xk
  • Find all the rules X?Y with min confidence and
    support
  • support, s, probability that a transaction
    contains X?Y
  • confidence, c, conditional probability that a
    transaction having X also contains Y.

Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Let min_support 50, min_conf 50 A ? C
(50, 66.7) C ? A (50, 100)
5
Concept Frequent Itemsets
Outlook Temperature Humidity Play
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
  • Minimum support2
  • sunny, hot, no
  • sunny, hot, high, no
  • rainy, normal
  • Min Support 3
  • ?
  • How strong is sunny, no?
  • Count
  • Percentage

6
Concept Itemset ? Rules
  • sunny, hot, no OutlookSunny, Temphot,
    Playno
  • Generate a rule
  • Outlooksunny and Temphot ? Playno
  • How strong is this rule?
  • Support of the rule
  • support of the itemset sunny, hot, no 2
    Pr(sunny, hot, no)
  • Either expressed in count form or percentage form
  • Confidence Pr(Playno Outlooksunny,
    Temphot)
  • In general LHS? RHS, Confidence Pr(RHSLHS)
  • Confidence
  • Pr(RHSLHS)
  • count(LHS and RHS) / count(LHS)
  • What is the confidence of Outlooksunny?Playno?

7
6.1.3 Types of Association Rules
  • Quantitative
  • Age(X, 3039) and income(X, 42K48K) ?
    buys(X, TV)
  • Single vs. Multi dimensions
  • Buys(X, computer) ? buys(X, financial soft)
  • Multi above example
  • Levels of abstraction
  • Age(X, ..) ? buys(X, laptop computer)
  • Age(X, ..) ? buys(X, computer)
  • Extensions
  • Max Pattern
  • Closed Itemset

8
Frequent Patterns
  • Patterns Item Sets
  • i1, i2, in, where each item is a pair
    (Attributevalue)
  • Frequent Patterns
  • Itemsets whose support gt minimum support
  • Support
  • count(itemset)/count(database)

9
Max-patterns
  • Max-pattern frequent patterns without proper
    frequent super pattern
  • BCDE, ACD are max-patterns
  • BCD is not a max-pattern

Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Min_sup2
10
Frequent Max Patterns
  • Succinct Expression of frequent patterns
  • Let a, b, c be frequent
  • Then, a, b, b, c, a, c must also be
    frequent
  • Then a, b, c, must also be frequent
  • By writing down a, b, c once, we save lots of
    computation
  • Max Pattern
  • If a, b, c is a frequent max pattern, then a,
    b, c, x is NOT a frequent pattern, for any other
    item x.

11
Find frequent Max Patterns
Outlook Temperature Humidity Play
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
  • Minimum support2
  • sunny, hot, no ??

12
Closed Patterns
  • A closed itemset X has no superset X such that
    every transaction containing X also contains X
  • a, b, a, b, d, a, b, c are frequent closed
    patterns
  • But, a, b is not a max pattern
  • Concise rep. of freq pats
  • Reduce of patterns and rules
  • N. Pasquier et al. In ICDT99

Min_sup2
TID Items
10 a, b, c
20 a, b, c
30 a, b, d
40 a, b, d,
50 c, e, f
13
Mining Association Rulesan Example
Min. support 50 Min. confidence 50
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
A 75
B 50
C 50
A, C 50
  • For rule A ? C
  • support support(A?C) 50
  • confidence support(A?C)/support(A) 66.6

14
Method 1Apriori A Candidate Generation-and-test
Approach
  • Any subset of a frequent itemset must be frequent
  • if beer, diaper, nuts is frequent, so is beer,
    diaper
  • Every transaction having beer, diaper, nuts
    also contains beer, diaper
  • Apriori pruning principle If there is any
    itemset which is infrequent, its superset should
    not be generated/tested!
  • Method
  • generate length (k1) candidate itemsets from
    length k frequent itemsets, and
  • test the candidates against DB
  • The performance studies show its efficiency and
    scalability
  • Agrawal Srikant 1994, Mannila, et al. 1994

15
The Apriori Algorithm An Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
16
The Apriori Algorithm
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

17
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?

18
Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

19
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

20
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

21
Speeding up Association rules
  • Dynamic Hashing and Pruning technique

Thanks to Cheng Hong Hu Haibo
22
DHP Reduce the Number of Candidates
  • A k-itemset whose corresponding hashing bucket
    count is below the threshold cannot be frequent
  • Candidates a, b, c, d, e
  • Hash entries ab, ad, ae bd, be, de
  • Frequent 1-itemset a, b, d, e
  • ab is not a candidate 2-itemset if the sum of
    count of ab, ad, ae is below support threshold
  • J. Park, M. Chen, and P. Yu. An effective
    hash-based algorithm for mining association
    rules. In SIGMOD95

23
Still challenging, the niche for DHP
  • DHP ( Park 95 ) Dynamic Hashing and Pruning
  • Candidate large 2-itemsets are huge.
  • DHP trim them using hashing
  • Transaction database is huge that one scan per
    iteration is costly
  • DHP prune both number of transactions and number
    of items in each transaction after each iteration

24
How does it look like?
DHP
Apriori
Generate candidate set
Generate candidate set
Count support
Count support
Make new hash table
25
Hash Table Construction
  • Consider two items sets, all itesms are numbered
    as i1, i2, in. For any any pair (x, y), has
    according to
  • Hash function bucket
  • h(x y) ((order of x)10(order of y)) 7
  • Example
  • Items A, B, C, D, E, Order 1, 2, 3
    4, 5,
  • H(C, E) (310 5) 7 0
  • Thus, C, E belong to bucket 0.

26
How to trim candidate itemsets
  • In k-iteration, hash all appearing k1
    itemsets in a hashtable, count all the
    occurrences of an itemset in the correspondent
    bucket.
  • In k1 iteration, examine each of the candidate
    itemset to see if its correspondent bucket value
    is above the support ( necessary condition )

27
Example
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Figure1. An example transaction database
28
Generation of C1 L1(1st iteration)
Itemset Sup
A 2
B 3
C 3
D 1
E 3
  • C1

    L1

Itemset Sup
A 2
B 3
C 3
E 3
29
Hash Table Construction
  • Find all 2-itemset of each transaction

TID 2-itemset
100 A C A D C D
200 B C B E C E
300 A B A C A E B C B E C E
400 B E
30
Hash Table Construction (2)
  • Hash function
  • h(x y) ((order of x)10(order of y)) 7
  • Hash table
  • C E A E B C
    B E A B A C
  • C E B
    C B E C D
  • A D
    B E A
    C
  • bucket 0 1
    2 3 4 5
    6

3 1 2 0 3 1 3
31
C2 Generation (2nd iteration)

C2 of Apriori
A B
A C
A E
B C
B E
C E
L1L1 in the bucket
A B 1
A C 3
A E 1
B C 2
B E 3
C E 3
Resulted C2
A C
B C
B E
C E
32
Effective Database Pruning
  • Apriori
  • Dont prune database.
  • Prune Ck by support counting on the original
    database.
  • DHP
  • More efficient support counting can be achieved
    on pruned database.

33
Performance Comparison
34
Performance Comparison (2)
35
Conclusion
  • Effective hash-based algorithm for the candidate
    itemset generation
  • Two phase transaction database pruning
  • Much more efficient ( time space ) than Apriori
    algorithm
Write a Comment
User Comments (0)
About PowerShow.com