Title: Advanced Topics in Data Mining: Association Rules
1Advanced Topics in Data MiningAssociation Rules
2What Is Association Mining?
- Association Rule Mining
- Finding frequent patterns, associations,
correlations, or causal structures among item
sets in transaction databases, relational
databases, and other information repositories - Applications
- Market basket analysis (marketing strategy items
to put on sale at reduced prices),
cross-marketing, catalog design, shelf space
layout design, etc - Examples
- Rule form Body Head Support, Confidence.
- buys(x, Computer) buys(x, Software) 2,
60 - major(x, CS) takes(x, DB) grade(x, A)
1, 75
3Market Basket Analysis
Typically, association rules are considered
interesting if they satisfy both a minimum
support threshold and a minimum confidence
threshold.
4Rule Measures Support and Confidence
- Let minimum support 50, and minimum confidence
50, we have - A ? C 50, 66.6
- C ? A 50, 100
5Support Confidence
6Association Rule Basic Concepts
- Given
- (1) database of transactions,
- (2) each transaction is a list of items
(purchased by a customer in a visit) - Find all rules that correlate the presence of one
set of items with that of another set of items - Find all the rules A ? B with minimum confidence
and support - support, s, P(A ? B)
- confidence, c, P(BA)
7Association Rule MiningA Road Map
- Boolean vs. quantitative associations (Based on
the types of values handled in the rule set) - buys(x, SQLServer) buys(x, DM Book) ?
buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K) ? buys(x,
PC) 1, 75 - Single dimension vs. multiple dimensional
associations - Single level vs. multiple-level analysis (Based
on the levels of abstractions involved in the
rule set)
8Terminologies
- Item
- I1, I2, I3,
- A, B, C,
- Itemset
- I1, I1, I7, I2, I3, I5,
- A, A, G, B, C, E,
- 1-Itemset
- I1, I2, A,
- 2-Itemset
- I1, I7, I3, I5, A, G,
9Terminologies
- K-Itemset
- If the length of the itemset is K
- Frequent K-Itemset
- If the length of the itemset is K and the itemset
satisfies a minimum support threshold. - Association Rule
- If a rule satisfies both a minimum support
threshold and a minimum confidence threshold
10Analysis
- The number of itemsets of a given cardinality
tends to grow exponentially
11Mining Association Rules Apriori Principle
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A ? C) 50
- confidence support(A ? C)/support(A)
66.6 - The Apriori principle
- Any subset of a frequent itemset must be frequent
12Mining Frequent Itemsets the Key Step
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate
association rules
13Example
14Apriori Algorithm
15Apriori Algorithm
16Apriori Algorithm
17Example of Generating Candidates
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
18Another Example 1
19Another Example 2
20Is Apriori Fast Enough? Performance Bottlenecks
- The core of the Apriori algorithm
- Use frequent (k1)-itemsets to generate candidate
frequent k-itemsets - Use database scan to collect counts for the
candidate itemsets - The bottleneck of Apriori
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates. - Multiple scans of database
- Needs (n 1) scans, n is the length of the
longest pattern
21Demo-IBM Intelligent Minner
22Demo Database
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Methods to Improve Aprioris Efficiency
- Hash-based itemset counting A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent - Transaction reduction A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans - Partitioning Any itemset that is potentially
frequent in DB must be frequent in at least one
of the partitions of DB - Sampling mining on a subset of given data, lower
support threshold a method to determine the
completeness
27Partitioning
28Hash-Based Itemset Counting
29Compare Apriori DHP (Direct Hash Pruning)
Apriori
30Compare Apriori DHP
DHP
31DHP Database Trimming
32DHP (Direct Hash Pruning)
- A database has four transactions
- Let min_sup 50
33Example Apriori
34Example DHP
35Example DHP
36Example DHP
37Mining Frequent Patterns Without Candidate
Generation
- Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure - highly condensed, but complete for frequent
pattern mining - avoid costly database scans
- Develop an efficient, FP-tree-based frequent
pattern mining method - A divide-and-conquer methodology decompose
mining tasks into smaller ones - Avoid candidate generation sub-database test
only!
38Construct FP-tree from a Transaction DB
39Construction Steps
- Scan DB once, find frequent 1-itemset (single
item pattern) - Order frequent items in frequency descending
order - Sorting DB according to the frequency descending
order - Scan DB again, construct FP-tree
40Benefits of the FP-Tree Structure
- Completeness
- never breaks a long pattern of any transaction
- preserves complete information for frequent
pattern mining - Compactness
- reduce irrelevant informationinfrequent items
are gone - frequency descending ordering more frequent
items are more likely to be shared - never be larger than the original database (if
not count node-links and counts) - Compression ratio could be over 100
41Frequent Pattern Growth
Order frequent items in frequency descending order
- For I5
- I1, I5, I2, I5, I2, I1, I5
- For I4
- I2, I4
- For I3
- I1, I3, I2, I3, I2, I1, I3
- For I1
- I2, I1
42Frequent Pattern Growth
- For I5
- I1, I5, I2, I5, I2, I1, I5
- For I4
- I2, I4
- For I3
- I1, I3, I2, I3, I2, I1, I3
- For I1
- I2, I1
Sub DB
Trimming Databases
Sub DB
Sub DB
FP-tree
Sub DB
Conditional FP-tree from Conditional Pattern-Base
43Conditional FP-tree
Conditional FP-tree from Conditional
Pattern-Base for I3
44Mining Results Using FP-tree
- For I5 (???NULL)
- Conditional Pattern Base
- (I2I11), (I2I1I31)
- Conditional FP-tree
- Generate Frequent Itemsets
- I22 Rule I2I52
- I12 Rule I1I52
- I2I12 Rule I2I1I52
? NULL 2
Item ID Support count Node Link
I2 2
I1 2
? I2 2
? I1 2
45Mining Results Using FP-tree
- For I4
- Conditional Pattern Base
- (I2I11), (I21)
- Conditional FP-tree
- Generate Frequent Itemsets
- I22 Rule I2I42
? NULL 2
Item ID Support count Node Link
I2 2
? I2 2
46Mining Results Using FP-tree
- For I3
- Conditional Pattern Base
- (I2I12), (I22), (I12)
- Conditional FP-tree
? NULL 4
Item ID Support count Node Link
I2 4
I1 4
? I2 4
? I1 2
? I1 2
47Mining Results Using FP-tree
- For I1/I3
- Conditional Pattern Base
- (NULL2), (I22)
- Conditional FP-tree
- Generate Frequent Itemsets
- Null4 Rule I1I34
- I22 Rule I2I1I32
? NULL 4
Item ID Support count Node Link
I2 2
? I2 2
48Mining Results Using FP-tree
- For I2/I3
- Conditional Pattern Base
- (NULL4)
- Conditional FP-tree
- Generate Frequent Itemsets
- Null Rule I2I34
? NULL 4
49Mining Results Using FP-tree
- For I1
- Conditional Pattern Base
- (NULL2), (I24)
- Conditional FP-tree
- Generate Frequent Itemsets
- I24 Rule I2I14
? NULL 6
Item ID Support count Node Link
I2 4
? I2 4
50Mining Results Using FP-tree
51Mining Frequent PatternsUsing FP-tree
- General idea (divide-and-conquer)
- Recursively grow frequent pattern path using the
FP-tree - Method
- For each item, construct its conditional
pattern-base, and then its conditional FP-tree - Repeat the process on each newly created
conditional FP-tree - Until the resulting FP-tree is empty, or it
contains only one path (single path will generate
all the combinations of its sub-paths, each of
which is a frequent pattern)
52Major Steps to Mine FP-tree
- Construct conditional pattern base for each item
in the FP-tree - Construct conditional FP-tree from each
conditional pattern-base - Recursively mine conditional FP-trees and grow
frequent patterns obtained so far - If the conditional FP-tree contains a single
path, simply enumerate all the patterns
53Virtual Items in Association Mining
- Different region exhibit different selling
patterns. Thus, including as virtual item the
information on the location or the type of stores
(existing or new) where the purchase was made
will enable the comparisons between locations or
types within a single chain - Virtual item may include information on whether
the purchase was made with cash, a credit card or
check. The inclusion of such virtual item allows
to analyze the association between the payment
method and items purchased. - Virtual item may include information on the day
of the week or the time of the day the
transaction occurred. The inclusion of such
virtual item allows to analyze the association
between the transaction time and items purchased
54Virtual Items An Example
55Dissociation Rules
- A dissociation rule is similar to an association
rule except that it can have not item-name in
the condition or the result of the rule - A and not B C
- A and D not E
- Dissociation rules can be generated by a simple
adaptation of the association rule analysis
56Discussions
- The size of a typical transaction grows because
it now includes inverted items - The total number of items used in the analysis
doubles - Since the amount of computation grows
exponentially with the number of items, doubling
the number of items seriously degrades
performance - The frequency of the inverted items tends to be
much larger than the frequency of the original
items. So, it tends to produce rules in which all
items are inverted. These rules are less likely
to be actionable. - not A and not B not C
- It is useful to invert only the most frequent
items in the set used for analysis. It is also
useful to invert some items whose inverses are of
interest.
57Interestingness Measurements
- Subjective Measures
- A rule (pattern) is interesting if
- it is unexpected (surprising to the user)
- actionable (the user can do something with it)
- Objective Measures
- Two popular measurements
- Support
- confidence
58Criticism to Support and Confidence
- Example 1
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7.
59Criticism to Support and Confidence
- Example 2
- X and Y positively correlated,
- X and Z, negatively related
- support and confidence of
- XgtZ dominates
- We need a measure of dependent or correlated
events
60Criticism to Support and Confidence
- Improvement (Correlation)
- Taking both P(A) and P(B) in consideration
- P(AB)P(B)P(A), if A and B are independent
events - A and B negatively correlated, if the value is
less than 1otherwise A and B positively
correlated - When improvement is less than 1, negating the
result produces a better rule - X gt NOT Z
61Multiple-Level Association Rules
- Items often form hierarchy
- Items at the lower level are expected to have
lower support - Rules regarding itemsets at appropriate levels
could be quite useful - Transaction database can be encoded based on
dimensions and levels - We can explore multi-level mining
62Transaction Database
63Concept Hierarchy
64Mining Multi-Level Associations
- A top_down, progressive deepening approach
- First find high-level strong rulesmilk bread
20, 60 - Then find their lower-level weaker rules2
milk wheat bread 6, 50 - Variations at mining multiple-level association
rules. - Cross-level association rules (Generalized Asso.
Rules)2 milk Wonder wheat bread - Association rules with multiple, alternative
hierarchies2 milk Wonder bread
65Multi-level Association Uniform Support vs.
Reduced Support
- Uniform Support the same minimum support for all
levels - One minimum support threshold is needed.
- Lower level items do not occur as frequently. If
support threshold - too high ? miss low level associations
- too low ? generate too many high level
associations - Reduced Support reduced minimum support at lower
levels - There are 4 search strategies
- Level-by-level independent
- Level-cross filtering by k-itemset
- Level-cross filtering by single item
- Controlled level-cross filtering by single item
66 Uniform Support
?
- Optimization Technique
- The search avoids examining itemsets containing
any item whose ancestors do not have minimum
support.
67Uniform Support An Example
L1 2,3,4,5,6 L2 23,24,25,26,34,45,46,56 L3
234,245,246,256,456 L4 2456
min_sup50
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
L1 2,3,6,8,9 L2 23,68,69,89 L3 689
min_sup50
L1 9
min_sup60
68Uniform Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
69Uniform Support An Example
L1 2,3,4,5,6 L2 23,24,25,26,34,45,46,56 L3
234,245,246,256,456 L4 2456
min_sup50
Apriori/DHP FP Growth
(1)
(2)
min_sup50
L1 2,3,6,8,9
Scan DB
C2 23,36,29,69,28,68,39,89 C3 239,369,289,689
(3)
min_sup50
Scan DB
L2 23,68,69,89 L3 689
70Uniform Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
71Reduced Support
- Each level of abstraction has its own minimum
support threshold.
72Search Strategies forReduced Support
- There are 4 search strategies
- Level-by-level independent
- Full-Breadth Search
- No pruning
- No background knowledge of frequent itemsets is
used for pruning - Level-cross filtering by single item
- An item at the ith level is examined if and only
if its parent node at the (i-1)th level is
frequent. - Level-cross filtering by k-itemset
- An k-itemset at the ith level is examined if and
only if its corresponding parent k-itemset at the
(i-1)th level is frequent. - Controlled level-cross filtering by single item
73Level-Cross Filtering by Single Item
74Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
(1)
(1)
min_sup50
L1 2,3,6,9
Scan DB
(2)
min_sup50
L2 23,69
Apriori/DHP FP Growth
(3)
75Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
76Level-Cross Filtering by K-Itemset
?
?
77Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
(1)
(2)
min_sup50
L1 2,3,6,9
Scan DB
C2 23,36,29,69,39 C3 239,369
(3)
min_sup50
Scan DB
L2 23,69
78Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
79Reduced Support
- Level-by-level independent
- It is very relaxed in that it may lead to
examining numerous infrequent items at low
levels, finding associations between items of
little importance. - Level-cross filtering by k-itemset
- It allows the mining system to examine only the
children of frequent k-itemsets. - This restriction is very strong in that there
usually are not many k-itemsets. - Many valuable patterns may be filtered out.
- Level-cross filtering by single item
- A compromise between the above two approaches
- This method may miss associations between low
level items that are frequent based on a reduced
minimum support, but whose ancestors do not
satisfy minimum support.
80Controlled Level-Cross Filtering by Single Item
81Reduced Support An Example
L1 2,3,4 L2 23,24,34 L3 234
min_sup60
Apriori/DHP FP Growth
level_passage_sup50 L1 2,3,4,5,6
(1)
(1)
min_sup50
L1 2,3,6,8,9
Scan DB
(2)
L2 23,68,69,89 L3 689
min_sup50
Apriori/DHP FP Growth
(3)
82Reduced Support An Example
All
Cheese
Crab
Milk
3
1
2
1
2
6
3
5
Kings Crab
Sunset Milk
Dairyland Milk
Dairyland Cheese
Best Cheese
Bread
Apple
Pie
4
5
6
4
9
7
10
8
Best Bread
Wonder Bread
Goldenfarm Apple
Westcoast Bread
Tasty Pie
83Multi-Dimensional Association
- Single-Dimensional (Intra-Dimension) Rules
Single Dimension (Predicate) with Multiple
Occurrences. - buys(X, milk) ? buys(X, bread)
- Multi-Dimensional Rules ? 2 Dimensions
- Inter-dimension association rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X,coke) - hybrid-dimension association rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke)
84Summary
- Association rule mining
- Probably the most significant contribution from
the database community in KDD - A large number of papers have been published
- Some Important Issues
- Generalized Association Rules
- Multiple-Level Association Rules
- Association Analysis in Other Types of Data
- Spatial Data, Multimedia Data, Time Series Data,
etc. - Weighted Association Rules
- Quantitative Association Rules
85Weighted Association Rules
- Why Weighted Association Analysis?
- In previous work, all items in a transactional
database are treated uniformly - Items are given weights to reflect their
importance to the user - The weights may correspond to special promotions
on some products, or the profitability of
different items - Some products may be under promotion and hence
are more interesting, or some products are more
profitable and hence rules concerning them are of
greater values
86Weighted Association Rules
- A simple attempt to solve this problem is to
eliminate the items with small weights - However, a rule for a heavy weighted item may
also consist of low weighted items - Is Apriori algorithm feasible?
- Apriori algorithm depends on the downward closure
property which governs that subsets of a frequent
itemset are also frequent - However, it is not true for the weighted case
87Weighted Association RulesAn Example
- Total Benefits 500
- Benefits for the First Transaction
(403030202010 10)160 - Benefits for the Second Transaction (403020
201010)130 - Benefits for the Third Transaction
(4030201010) 110 - Benefits for the Fourth Transaction
(3030201010)100 - Suppose Weighted_Min_Sup 40
- Minimum Benefits 500 40 200
88An Example
- Minimum Benefits 500 40 200
- Itemset 3,5,6,7
- Benefits 70
- Support Count (Frequency) 3
- 70 3 210 gt 200 ?3,5,6,7is a Frequent
Itemset - Itemset 3,5,6
- Benefits 60
- Support Count ( Frequency) 3
- 60 3 180 lt 200 ?3,5,6 is not a Frequent
Itemset
Apriori Principle can not be applied!
89K-Support Bound
- If Y is a frequent q-itemset
- Support_Count(Y) ? (Weighted_Min_Sup
Total_Benefits) / Benefits(Y) - Example
- 3,5,6,7 is a Frequent 4-Itemset
- Support_Count(3,5,6,7) 3 ? (40 500) /
Benefits(3,5,6,7) (40 500) / 70 2.857 - If X is a frequent k-itemset containing q-itemset
Y - Minimum_Support_Count(X) ? (Weighted_Min_Sup
Total_Benefits) / (Benefits(Y) (k-q) Maximum
Remaining Weights) - Example
- X is a Frequent 5-Itemset containing 3,5,6,7
- Minimum_Support_Count(X) ? (40 500) / (70
40) 1.81
K-Support Bound
90K-Support Bound
- Itemset 1,2
- Benefits 70
- Support_Count(1,2) 1 lt (40 500) /
Benefits(1,2) (40 500) / 70 2.857 - 1,2 is not a Frequent Itemset
- If X is a frequent 3-itemset containing 1,2
- Minimum_Support_Count(X) ? (40 500) / (70
30) 2 - But, Maximum_Support_Count(X) 1
- No frequent 3-itemsets containing 1,2
- If X is a frequent 4-itemset containing 1,2
- Minimum_Support_Count(X) ? (40 500) / (70 30
20) 1.667 - But, Maximum_Support_Count(X) 1
- No frequent 4-itemsets containing 1,2
- Similarly, no frequent 5, 6, 7-itemsets
containing 1,2 - The algorithm is designed based on this k-support
bound
91MINWAL Algorithm
92Step by Step
- Input Product Transactional Databases
Weighted_Min_Sup 50
Total Profits 1380
93Step 2, 7
- Search(D)
- This subroutine finds out the maximum transaction
size in that transactional database D - Size 4 in this case
- Counting(D, w)
- This subroutine cumulates the support counts of
the 1-itemsets - The k-support bounds of each 1-itemset will be
calculated, and the 1-itemsets with support
counts greater than any of the k-support bounds
will be kept in C1
94Step 7
?
K-Support Bound ? (50 1380) / (10 90) 6.9
95Step 11
- Join(Ck-1)
- The Join step generates Ck from Ck-1 as Apriori
Algorithm - If we have 1, 2, 3, 1, 2, 4 in Ck-1 1, 2, 3,
4 will be generated in Ck - In this case,
- C1 1 (4), 2 (5), 4 (6), 5 (7)
- C2 Join(C1) 12, 14, 15, 24, 25, 45
Support_Count
96Step 12
- Prune(Ck)
- The itemset will be pruned in either of the
following cases - A subset of the candidate itemset in Ck does not
exist in Ck-1 - Estimate an upper bound on the support count (SC)
of the joined itemset X, which is the minimum
support count among the k different (k-1)-subsets
of X in Ck-1. If the estimated upper bound on
the support count shows that the itemset X cannot
be a subset of any large itemset in the coming
passes (from the calculation of k-support bounds
for all itemsets), that itemset will be pruned - In this case,
- C2 Prune(C2) 12 (4), 14 (4), 15 (4), 24 (5),
25 (5), 45 (6) - Using K-Support-Bound
Estimated_Support_Count
97Step 12
- Prune(Ck)
- Using K-Support-Bound (No one is pruned)
98Step 13
- Checking(Ck, D)
- Scan DB Generate Lk
?
?
C2
L2
99Step 11, 12
- Join(C2)
- C2 15 (4), 24 (5), 25 (5), 45 (6)
- C3 Join(C2) 245
- Prune(C3)
- C3 Prune(C3) 245 (5)
- Using K-Support-Bound (No one is pruned)
100Step 13
- Checking(C3, D) Scan DB
- C3 L3 245
- Finally, L 45, 245
101Step 15
- Generate Rules for L 45, 245
- 4 ? 5 (confidence 100)
- 5 ? 4 (confidence 85.7)
- 24 ? 5 (confidence 100)
- 25 ? 4 (confidence 100)
- 45 ? 2 (confidence 83.3)
- 2 ? 45 (confidence 100)
- 4 ? 25 (confidence 83.3)
- 5 ? 24 (confidence 71.4)
Min_Conf90
102Generalized Association Rules
103Quantitative Association Rules
- Let min_sup 50, we have
- A, B 60
- B, D 70
- A, B, D 50
- A(1..2), B(3) 50
- A(1..2), B(3..5) 60
- A(1..2), B(1..5) 60
- B(1..5), D(1..2) 60
- B(3..5), D(1..3) 60
- B(1..5), D(1..3) 70
- B(1..3), D(1..2) 50
- B(3..5), D(1..2) 50
- A(1..2), B(3..5), D(1..2) 50
?