Title: Mining Association Rules in Large Databases
1Mining Association Rules in Large Databases
- Association rule mining
- Mining single-dimensional Boolean association
rules from transactional databases - Mining multilevel association rules from
transactional databases - Mining multidimensional association rules from
transactional databases and data warehouse - From association mining to correlation analysis
- Constraint-based association mining
- Summary
2Rule Measures Support and Confidence
Customer buys both
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X ? Y ? Z - confidence, c, conditional probability that a
transaction having X ? Y also contains Z
Customer buys diaper
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
3Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
4Definition Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
5Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
6Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
7Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
8Mining Association RulesAn Example
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A ?C) 50
- confidence support(A ?C)/support(A) 66.6
- The Apriori principle
- Any subset of a frequent itemset must be frequent
9Mining Frequent Itemsets the Key Step
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate association
rules.
10Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
11Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
12Frequent Itemset Generation
- Brute-force approach
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
13Computational Complexity
- Given d unique items
- Total number of itemsets 2d
- Total number of possible association rules
If d6, R 602 rules
14Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search M2d
- Use pruning techniques to reduce M
- Reduce the number of transactions (N)
- Reduce size of N as the size of itemset increases
- Used by DHP and vertical-based mining algorithms
- Reduce the number of comparisons (NM)
- Use efficient data structures to store the
candidates or transactions - No need to match every candidate against every
transaction
15Reducing Number of Candidates
- Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent - Apriori principle holds due to the following
property of the support measure - Support of an itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
16Illustrating Apriori Principle
17The Apriori Algorithm
- Join Step Ck is generated by joining Lk-1with
itself - Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset - Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
18The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
19Generate Hash Tree
- Suppose you have 15 candidate itemsets of length
3 - 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8 - You need
- Hash function
- Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)
20Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
21How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
22How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
23Example of Generating Candidates
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
24Methods to Improve Aprioris Efficiency
- Hash-based itemset counting A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent - Transaction reduction A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans - Partitioning Any itemset that is potentially
frequent in DB must be frequent in at least one
of the partitions of DB - Sampling mining on a subset of given data, lower
support threshold a method to determine the
completeness - Dynamic itemset counting add new candidate
itemsets only when all of their subsets are
estimated to be frequent
25Compact Representation of Frequent Itemsets
- Some itemsets are redundant because they have
identical support as their supersets - Number of frequent itemsets
- Need a compact representation
26Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
27Closed Itemset
- An itemset is closed if none of its immediate
supersets has the same support as the itemset
28Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
29Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
30Maximal vs Closed Itemsets
31Is Apriori Fast Enough? Performance Bottlenecks
- The core of the Apriori algorithm
- Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets - Use database scan and pattern matching to collect
counts for the candidate itemsets - The bottleneck of Apriori candidate generation
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates. - Multiple scans of database
- Needs (n 1 ) scans, n is the length of the
longest pattern
32Alternative Methods for Frequent Itemset
Generation
- Representation of Database
- horizontal vs vertical data layout
33FP-growth Algorithm
- Use a compressed representation of the database
using an FP-tree - Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets
34FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
35FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
36FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
37Tree Projection
Set enumeration tree
Possible Extension E(A) B,C,D,E
Possible Extension E(ABC) D,E
38Tree Projection
- Items are listed in lexicographic order
- Each node P stores the following information
- Itemset for node P
- List of possible lexicographic extensions of P
E(P) - Pointer to projected database of its ancestor
node - Bitvector containing information about which
transactions in the projected database contain
the itemset
39Projected Database
Projected Database for node A
Original Database
For each transaction T, projected transaction at
node A is T ? E(A)
40Benefits of the FP-tree Structure
- Completeness
- never breaks a long pattern of any transaction
- preserves complete information for frequent
pattern mining - Compactness
- reduce irrelevant informationinfrequent items
are gone - frequency descending ordering more frequent
items are more likely to be shared - never be larger than the original database (if
not count node-links and counts) - Example For Connect-4 DB, compression ratio
could be over 100
41Mining Frequent Patterns Using FP-tree
- General idea (divide-and-conquer)
- Recursively grow frequent pattern path using the
FP-tree - Method
- For each item, construct its conditional
pattern-base, and then its conditional FP-tree - Repeat the process on each newly created
conditional FP-tree - Until the resulting FP-tree is empty, or it
contains only one path (single path will generate
all the combinations of its sub-paths, each of
which is a frequent pattern)
42Major Steps to Mine FP-tree
- Construct conditional pattern base for each node
in the FP-tree - Construct conditional FP-tree from each
conditional pattern-base - Recursively mine conditional FP-trees and grow
frequent patterns obtained so far - If the conditional FP-tree contains a single
path, simply enumerate all the patterns
43Step 1 From FP-tree to Conditional Pattern Base
- Starting at the frequent header table in the
FP-tree - Traverse the FP-tree by following the link of
each frequent item - Accumulate all of transformed prefix paths of
that item to form a conditional pattern base
Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
44Properties of FP-tree for Conditional Pattern
Base Construction
- Node-link property
- For any frequent item ai, all the possible
frequent patterns that contain ai can be obtained
by following ai's node-links, starting from ai's
head in the FP-tree header - Prefix path property
- To calculate the frequent patterns for a node ai
in a path P, only the prefix sub-path of ai in P
need to be accumulated, and its frequency count
should carry the same count as node ai.
45Step 2 Construct Conditional FP-tree
- For each pattern-base
- Accumulate the count for each item in the base
- Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base fca2, fcab1
Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
b1
b1
c3
?
?
p1
a3
b1
m2
p2
m1
46Step 3 Recursively mine the conditional FP-tree
Cond. pattern base of am (fc3)
Cond. pattern base of cm (f3)
f3
cm-conditional FP-tree
Cond. pattern base of cam (f3)
f3
cam-conditional FP-tree
47FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
48FP-growth vs. Tree-Projection Scalability with
Support Threshold
Data set T25I20D100K
49Rule Generation
- How to efficiently generate rules from frequent
itemsets? - In general, confidence does not have an
anti-monotone property - c(ABC ?D) can be larger or smaller than c(AB ?D)
- But confidence of rules generated from the same
itemset has an anti-monotone property - e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD) -
- Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
50Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
51Rule Generation for Apriori Algorithm
- Candidate rule is generated by merging two rules
that share the same prefixin the rule consequent - join(CDgtAB,BDgtAC)would produce the
candidaterule D gt ABC - Prune rule DgtABC if itssubset ADgtBC does not
havehigh confidence
52Effect of Support Distribution
- Many real data sets have skewed support
distribution
Support distribution of a retail data set
53Effect of Support Distribution
- How to set the appropriate minsup threshold?
- If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products) - If minsup is set too low, it is computationally
expensive and the number of itemsets is very
large - Using a single minimum support threshold may not
be effective
54Iceberg Queries
- Icerberg query Compute aggregates over one or a
set of attributes only for those whose aggregate
values is above certain threshold - Example
- select P.custID, P.itemID, sum(P.qty)
- from purchase P
- group by P.custID, P.itemID
- having sum(P.qty) gt 10
- Compute iceberg queries efficiently by Apriori
- First compute lower dimensions
- Then compute higher dimensions only when all the
lower ones are above the threshold
55Pattern Evaluation
- Association rule algorithms tend to produce too
many rules - many of them are uninteresting or redundant
- Redundant if A,B,C ? D and A,B ? D
have same support confidence - Interestingness measures can be used to
prune/rank the derived patterns - In the original formulation of association rules,
support confidence are the only measures used
56Application of Interestingness Measure
57Computing Interestingness Measure
- Given a rule X ? Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X ? Y
Y Y
X f11 f10 f1
X f01 f00 fo
f1 f0 T
- Used to define various measures
- support, confidence, lift, Gini, J-measure,
etc.
58Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
59Interestingness Measurements
- Objective measures
- Two popular measurements
- support and
- confidence
- Subjective measures (Silberschatz Tuzhilin,
KDD95) - A rule (pattern) is interesting if
- it is unexpected (surprising to the user) and/or
- actionable (the user can do something with it)
60Criticism to Support and Confidence
- Example 1 (Aggarwal Yu, PODS98)
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7. - play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence
61Statistical Independence
- Population of 1000 students
- 600 students know how to swim (S)
- 700 students know how to bike (B)
- 420 students know how to swim and bike (S,B)
- P(S?B) 420/1000 0.42
- P(S) ? P(B) 0.6 ? 0.7 0.42
- P(S?B) P(S) ? P(B) gt Statistical independence
- P(S?B) gt P(S) ? P(B) gt Positively correlated
- P(S?B) lt P(S) ? P(B) gt Negatively correlated
62Statistical-based Measures
- Measures that take into account statistical
dependence
63Example Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
- Association Rule Tea ? Coffee
- Confidence P(CoffeeTea) 0.75
- but P(Coffee) 0.9
- Lift 0.75/0.9 0.8333 (lt 1, therefore is
negatively associated)
64Drawback of Lift Interest
Y Y
X 10 0 10
X 0 90 90
10 90 100
Y Y
X 90 0 90
X 0 10 10
90 10 100
Statistical independence If P(X,Y)P(X)P(Y) gt
Lift 1
65There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
66Constraint-Based Mining
- Interactive, exploratory mining giga-bytes of
data? - Could it be real? Making good use of
constraints! - What kinds of constraints can be used in mining?
- Knowledge type constraint classification,
association, etc. - Data constraint SQL-like queries
- Find product pairs sold together in Vancouver in
Dec.98. - Dimension/level constraints
- in relevance to region, price, brand, customer
category. - Rule constraints
- small sales (price lt 10) triggers big sales
(sum gt 200). - Interestingness constraints
- strong rules (min_support ? 3, min_confidence ?
60).
67Rule Constraints in Association Mining
- Two kind of rule constraints
- Rule form constraints meta-rule guided mining.
- P(x, y) Q(x, w) takes(x, database
systems). - Rule (content) constraint constraint-based query
optimization (Ng, et al., SIGMOD98). - sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000 - 1-variable vs. 2-variable constraints
(Lakshmanan, et al. SIGMOD99) - 1-var A constraint confining only one side (L/R)
of the rule, e.g., as shown above. - 2-var A constraint confining both sides (L and
R). - sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)
68Constrain-Based Association Query
- Database (1) trans (TID, Itemset ), (2)
itemInfo (Item, Type, Price) - A constrained asso. query (CAQ) is in the form of
(S1, S2 )C , - where C is a set of constraints on S1, S2
including frequency constraint - A classification of (single-variable)
constraints - Class constraint S ? A. e.g. S ? Item
- Domain constraint
- S? v, ? ? ?, ?, ?, ?, ?, ? . e.g. S.Price lt
100 - v? S, ? is ? or ?. e.g. snacks ? S.Type
- V? S, or S? V, ? ? ?, ?, ?, ?, ?
- e.g. snacks, sodas ? S.Type
- Aggregation constraint agg(S) ? v, where agg is
in min, max, sum, count, avg, and ? ? ?, ?,
?, ?, ?, ? . - e.g. count(S1.Type) ? 1 , avg(S2.Price) ? 100
69Constrained Association Query Optimization Problem
- Given a CAQ (S1, S2) C , the algorithm
should be - sound It only finds frequent sets that satisfy
the given constraints C - complete All frequent sets satisfy the given
constraints C are found - A naïve solution
- Apply Apriori for finding all frequent sets, and
then to test them for constraint satisfaction one
by one. - Our approach
- Comprehensive analysis of the properties of
constraints and try to push them as deeply as
possible inside the frequent set computation.
70Anti-monotone and Monotone Constraints
- A constraint Ca is anti-monotone iff. for any
pattern S not satisfying Ca, none of the
super-patterns of S can satisfy Ca - A constraint Cm is monotone iff. for any pattern
S satisfying Cm, every super-pattern of S also
satisfies it
71Succinct Constraint
- A subset of item Is is a succinct set, if it can
be expressed as ?p(I) for some selection
predicate p, where ? is a selection operator - SP?2I is a succinct power set, if there is a
fixed number of succinct set I1, , Ik ?I, s.t.
SP can be expressed in terms of the strict power
sets of I1, , Ik using union and minus - A constraint Cs is succinct provided SATCs(I) is
a succinct power set
72Convertible Constraint
- Suppose all items in patterns are listed in a
total order R - A constraint C is convertible anti-monotone iff a
pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C - A constraint C is convertible monotone iff a
pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also
satisfies C
73Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
74Property of Constraints Anti-Monotone
- Anti-monotonicity If a set S violates the
constraint, any superset of S violates the
constraint. - Examples
- sum(S.Price) ? v is anti-monotone
- sum(S.Price) ? v is not anti-monotone
- sum(S.Price) v is partly anti-monotone
- Application
- Push sum(S.price) ? 1000 deeply into iterative
frequent set computation.
75Characterization of Anti-Monotonicity
Constraints
S ? v, ? ? ?, ?, ? v ? S S ? V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
yes no no yes partly no yes partly yes no partly y
es no partly yes no partly convertible (yes)
76Example of Convertible Constraints Avg(S) ? V
- Let R be the value descending order over the set
of items - E.g. I9, 8, 6, 4, 3, 1
- Avg(S) ? v is convertible monotone w.r.t. R
- If S is a suffix of S1, avg(S1) ? avg(S)
- 8, 4, 3 is a suffix of 9, 8, 4, 3
- avg(9, 8, 4, 3)6 ? avg(8, 4, 3)5
- If S satisfies avg(S) ?v, so does S1
- 8, 4, 3 satisfies constraint avg(S) ? 4, so
does 9, 8, 4, 3
77Property of Constraints Succinctness
- Succinctness
- For any set S1 and S2 satisfying C, S1 ? S2
satisfies C - Given A1 is the sets of size 1 satisfying C, then
any set S satisfying C are based on A1 , i.e., it
contains a subset belongs to A1 , - Example
- sum(S.Price ) ? v is not succinct
- min(S.Price ) ? v is succinct
- Optimization
- If C is succinct, then C is pre-counting
prunable. The satisfaction of the constraint
alone is not affected by the iterative support
counting.
78Characterization of Constraints by Succinctness
S ? v, ? ? ?, ?, ? v ? S S ?V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
Yes yes yes yes yes yes yes yes yes yes yes weakly
weakly weakly no no no no (no)
79Why Is the Big Pie Still There?
- More on constraint-based mining of associations
- Boolean vs. quantitative associations
- Association on discrete vs. continuous data
- From association to correlation and causal
structure analysis. - Association does not necessarily imply
correlation or causal relationships - From intra-trasanction association to
inter-transaction associations - E.g., break the barriers of transactions (Lu, et
al. TOIS99). - From association analysis to classification and
clustering analysis - E.g, clustering association rules
80Summary
- Association rule mining
- probably the most significant contribution from
the database community in KDD - A large number of papers have been published
- Many interesting issues have been explored
- An interesting research direction
- Association analysis in other types of data
spatial data, multimedia data, time series data,
etc.