Title: Chapter 5: Mining Frequent Patterns, Association and Correlations
1Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
2What Is Frequent Pattern Analysis?
- Frequent pattern a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set - First proposed by Agrawal, Imielinski, and Swami
AIS93 in the context of frequent itemsets and
association rule mining - Motivation Finding inherent regularities in data
- What products were often purchased together?
Beer and diapers?! - What are the subsequent purchases after buying a
PC? - What kinds of DNA are sensitive to this new drug?
- Can we automatically classify web documents?
- Applications
- Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis.
3An Example--Market Basket Analysis
4Why Is Freq. Pattern Mining Important?
- Discloses an intrinsic and important property of
data sets - Forms the foundation for many essential data
mining tasks - Association, correlation, and causality analysis
- Sequential, structural (e.g., sub-graph) patterns
- Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data - Classification associative classification
- Cluster analysis frequent pattern-based
clustering - Data warehousing iceberg cube and cube-gradient
- Semantic data compression fascicles
- Broad applications
5Basic Concepts Frequent Patterns and Association
Rules
- Itemset X x1, , xk
- Find all the rules X ? Y with minimum support and
confidence - support, s, probability that a transaction
contains X ? Y - confidence, c, conditional probability that a
transaction having X also contains Y
Let supmin 50, confmin 50 Freq. Pat.
A3, B3, D4, E3, AD3 Association rules A ?
D (60, 100) D ? A (60, 75)
6Closed Patterns and Max-Patterns
- A long pattern contains a combinatorial number of
sub-patterns, e.g., a1, , a100 contains (1001)
(1002) (110000) 2100 1 1.271030
sub-patterns! - Solution Mine closed patterns and max-patterns
instead - An itemset X is closed if X is frequent and there
exists no super-pattern Y ? X, with the same
support as X (proposed by Pasquier, et al. _at_
ICDT99) - An itemset X is a max-pattern if X is frequent
and there exists no frequent super-pattern Y ? X
(proposed by Bayardo _at_ SIGMOD98) - Closed pattern is a lossless compression of freq.
patterns - Reducing the of patterns and rules
7Why Closed Patterns and Max-Patterns
Let supmin 50, confmin 50
- Freq. Pat. A4, B3, D3
- AB3, AD3, BD3
- ABD3
- Freq. Closed Pat A4, ABD3
8Why Closed Patterns and Max-Patterns(cont.)
Let supmin 50, confmin 50
- Frequent patterns
- A3, B5, D4, F3
- AB3, AD3, BD4, BF3
- ABD3
- Frequent Closed patterns
- B5, D4, BF3, ABD3
9Closed Patterns and Max-Patterns
- Exercise. DB lta1, , a100gt, lt a1, , a50gt
- Min_sup 1.
- What is the set of closed itemset?
- lta1, , a100gt 1
- lt a1, , a50gt 2
- What is the set of max-pattern?
- lta1, , a100gt 1
- What is the set of all patterns?
- !!
10Itemsets
- A set of items is referred to as an itemset.
- A itemset that contains k items is a k-itemset.
- Example
- The set computer, antivirus_software is a
2-itemset. - The support of an itemset is the number of
transactions that contain the item.
11Association Rules
- Let II1, I2, , Im be a set of items, D be a
set of database transactions where each
transaction is a set of items such that - An association rule is an implication of the form
A?B, where and - The support and confidence of the rule A?B is
defined as
12Find Strong Association Rules
- Association rule mining can be viewed as a
two-step process - Find all frequent itemsets
- Generate strong association rules from the
frequent itemsets
13Generate Association Rules
- Given a set of frequent itemsets, association
rules - can be generated as
- For each frequent itemset l, generate all
nonempty subsets of l - For every nonempty subset s of l, output the rule
s ? (l s), if
14An Example
Let supmin 50, confmin 50 Frequent
patterns A3, B5, D4, F3 AB3, AD3,
BD4, BF3 ABD3
- Generate association rules
- From AB A? B (60, 100), B?A (60, 60)
- From AD A? D (60, 100), D?A (60, 75)
- From BD B? D (80, 80), D?B (80, 100)
- From BF B? F (60, 60), F?B (60, 100)
- From ABD
- A? BD (60, 100), B? AD (60, 60), D?AB (60,
75) - BD?A (60, 100), AD ?B (60,100), AB?D (60,
100)
15An Example
Let supmin 50, confmin 50 Frequent Closed
patterns B5, D4, BF3, BD4, ABD3
- Generate association rules
- From BF B? F (60, 60), F?B (60, 100)
- From BD B? D (80, 80), D?B (80, 100)
- From ABD
- A? BD (60, 100), B? AD (60, 60), D?AB (60,
75) - BD?A (60, 100), AD ?B (60,100), AB?D (60,
100) - A? B (60, 100), B?A (60, 60)
- A? D (60, 100), D?A (60, 75)
16Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
17Scalable Methods for Mining Frequent Patterns
- The downward closure property of frequent
patterns - Any subset of a frequent itemset must be frequent
- If beer, diaper, nuts is frequent, so is beer,
diaper - i.e., every transaction having beer, diaper,
nuts also contains beer, diaper - Scalable mining methods Three major approaches
- Apriori (Agrawal Srikant_at_VLDB94)
- Freq. pattern growth (FPgrowthHan, Pei Yin
_at_SIGMOD00) - Vertical data format approach (CharmZaki Hsiao
_at_SDM02)
18Apriori A Candidate Generation-and-Test Approach
- Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested! (Agrawal Srikant
_at_VLDB94, Mannila, et al. _at_ KDD 94) - Method
- Initially, scan DB once to get frequent 1-itemset
- Generate length (k1) candidate itemsets from
length k frequent itemsets - Test the candidates against DB
- Terminate when no frequent or candidate set can
be generated
19The Apriori AlgorithmAn Example
Supmin 2
Database TDB
L1
C1
1st scan
C2
C2
L2
2nd scan
C3
L3
3rd scan
20The Apriori Algorithm
- Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
21Important Details of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- How to count supports of candidates?
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
22How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
23An Example
- Supmin 2
- If we list the items as A, B, C, D, E
- Get frequent 1-itemsets A2, B3, C3, E3
- Generate candidate 2-itemsets AB, AC, AE, BC,
BE, CE - Get the support of the candidate itemsets to find
the frequent 2-itemsets D2AC2, BC2, CE2,
BE3 - Generate the candidate 3-itemsets BCE
- select P.item1, P.item2, , P.itemk-1,
P.itemk-1 - from D2 p, D2 q
- where p.item1q.item1, ,
p.itemk-2q.itemk-2, - p.itemk-1 lt q.itemk-1
- Find the frequent 3-itemsets BCE2
Database TDB
24Challenges of Frequent Pattern Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
25Partition Scan Database Only Twice
- Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB - Scan 1 partition database and find local
frequent patterns - Scan 2 consolidate global frequent patterns
- A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in
large databases. In VLDB95
26Sampling for Frequent Patterns
- Select a sample of original database, mine
frequent patterns within sample using Apriori - Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked - Example check abcd instead of ab, ac, , etc.
- Scan database again to find missed frequent
patterns - H. Toivonen. Sampling large databases for
association rules. In VLDB96
27DIC Reduce Number of Scans
ABCD
- Once both A and D are determined frequent, the
counting of AD begins - Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
1-itemsets
B
C
D
A
2-itemsets
Apriori
Itemset lattice
1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket data. In SIGMOD97
3-items
DIC
28Bottleneck of Frequent-pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates (1001) (1002) (110000)
2100-1 1.271030 ! - Bottleneck candidate-generation-and-test
- Can we avoid candidate generation?
29Implications of the Methodology
- Mining closed frequent itemsets and max-patterns
- CLOSET (DMKD00)
- Mining sequential patterns
- FreeSpan (KDD00), PrefixSpan (ICDE01)
- Constraint-based mining of frequent patterns
- Convertible constraints (KDD00, ICDE01)
- Computing iceberg data cubes with complex
measures - H-tree and H-cubing algorithm (SIGMOD01)
30MaxMiner Mining Max-patterns
- 1st scan find frequent items
- A, B, C, D, E
- 2nd scan find support for
- AB, AC, AD, AE, ABCDE
- BC, BD, BE, BCDE
- CD, CE, CDE, DE,
- Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan - R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD98
Potential max-patterns
31CHARM Mining by Exploring Vertical Data Format
- Vertical format t(AB) T11, T25,
- tid-list list of trans.-ids containing an
itemset - Deriving closed patterns based on vertical
intersections - t(X) t(Y) X and Y always happen together
- t(X) ? t(Y) transaction having X always has Y
- Using diffset to accelerate mining
- Only keep track of differences of tids
- t(X) T1, T2, T3, t(XY) T1, T3
- Diffset (XY, X) T2
- Eclat/MaxEclat (Zaki et al. _at_KDD97), VIPER(P.
Shenoy et al._at_SIGMOD00), CHARM (Zaki
Hsiao_at_SDM02)
32(No Transcript)
33Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
34Mining Various Kinds of Association Rules
- Mining multilevel association
- Miming multidimensional association
- Mining quantitative association
- Mining interesting correlation patterns
35Mining Multiple-Level Association Rules
- Items often form hierarchies
- Flexible support settings
- Items at the lower level are expected to have
lower support
36Mining Multiple-Level Association Rules (cont.)
- Association rules generated from mining data at
multiple levels of abstraction are called
multilevel association rules - Using uniform minimum support for all levels
- Using reduced minimum support at lower levels
- Using item or group-based minimum support
- Example
- buys(X,laptop)?buys(X, HP printer)
- support 8, confidence 70
- buys(X,IBM laptop)?buys(X, HP printer)
- support 2, confidence 72
37Multi-level Association Redundancy Filtering
- Some rules may be redundant due to ancestor
relationships between items. - Example
- milk ? wheat bread support 8, confidence
70 - 2 milk ? wheat bread support 2, confidence
72 - We say the first rule is an ancestor of the
second rule. - A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.
38Mining Multi-Dimensional Association
- Association rules that involve two or more
predicates are referred to as multidimensional
association rules - Example
- Age(X,2029)occupations(X,student)? buys(X,
laptop)
39Mining Multi-Dimensional Association (cont.)
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- Multi-dimensional rules ? 2 dimensions or
predicates - Inter-dimension assoc. rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X, coke) - hybrid-dimension assoc. rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes finite number of possible
values, no ordering among valuesdata cube
approach - Quantitative Attributes numeric, implicit
ordering among valuesdiscretization, clustering,
and gradient approaches
40Mining Quantitative Associations
- Techniques can be categorized by how numerical
attributes, such as age or salary are treated - Static discretization based on predefined concept
hierarchies (data cube methods) - Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal
Srikant_at_SIGMOD96) - Clustering Distance-based association (e.g.,
Yang Miller_at_SIGMOD97) - one dimensional clustering then association
- Deviation (such as Aumann and Lindell_at_KDD99)
- Sex female gt Wage mean7/hr (overall mean
9)
41Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
42Strong Rules Are Not Necessarily Interesting
- Example Of 10,000 transactions in
AllElectronics, 6000 included computer games,
while 7500 include videos, and 4000 include both
computer games and videos. - buys(X,computer games)?buys(X,videos) s40,
c66 - Is the above association rule interesting?
- Hint Overall 75 of 10,000 customers bought
videos, - while only 66 of computer-game buying customers
- bought videos
43Interestingness Measure Correlations (Lift)
- play basketball ? eat cereal 40, 66.7 is
misleading - The overall of students eating cereal is 75 gt
66.7. - play basketball ? not eat cereal 20, 33.3 is
more accurate, although with lower support and
confidence - Measure of dependent/correlated events lift
44Lift
- Given a rule A?B, lift measures the correlation
between A and B. - The occurrence of itemset A is independent of the
occurrence of itemset B if lift(A,B)1, i.e., - if lift(A,B) gt 1, then the occurrence of A is
positively correlated with the occurrence of B - if lift(A,B) lt1, then the occurrence of A is
negatively correlated with the occurrence of B
45?2
- Suppose A has c distinct values a1, a2, , ac, B
has r distinct values, b1, b2, , br. The data
tuples can be shown as a contingency table as
left
- The ?2 is defined as
- Where nij is the observed frequency of the
joint event - (ai, bj) and eij is the expected frequency
of (ai, bj), - which can be computed as
46Example
- Since ?2 gt1 and the (game, video)4000, which is
less than the expected value 4500, buying game
and buying video are negatively correlated.
47Other Correlation Measures
- All Confidence Given an itemset Xi1,i2, ,
ik, the all confidence of X is defined as - Cosine Given itemsets A and B, the cosine
measure of A and B is defined as
48Are lift and ?2 Good Measures of Correlation?
- Buy walnuts ? buy milk 1, 80 is
misleading - if 85 of customers buy milk
- Support and confidence are not good to represent
correlations - So many interestingness measures? (Tan, Kumar,
Sritastava _at_KDD02)
49Which Measures Should Be Used?
- lift and ?2 are not good measures for
correlations in large transactional DBs - all-conf or coherence could be good measures
(Omiecinski_at_TKDE03) - Both all-conf and coherence have the downward
closure property - Efficient algorithms can be derived for mining
(Lee et al. _at_ICDM03sub)
50Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
51Constraint-based (Query-Directed) Mining
- Finding all the patterns in a database
autonomously? unrealistic! - The patterns could be too many but not focused!
- Data mining should be an interactive process
- User directs what to be mined using a data mining
query language (or a graphical user interface) - Constraint-based mining
- User flexibility provides constraints on what to
be mined - System optimization explores such constraints
for efficient miningconstraint-based mining
52Constraints in Data Mining
- Knowledge type constraint
- classification, association, etc.
- Data constraint using SQL-like queries
- find product pairs sold together in stores in
Chicago in Dec.02 - Dimension/level constraint
- in relevance to region, price, brand, customer
category - Rule (or pattern) constraint
- small sales (price lt 10) triggers big sales
(sum gt 200) - Interestingness constraint
- strong rules min_support ? 3, min_confidence
? 60
53Chapter 5 Mining Frequent Patterns, Association
and Correlations
- Basic concepts and a road map
- Efficient and scalable frequent itemset mining
methods - Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
54Frequent-Pattern Mining Summary
- Frequent pattern miningan important task in data
mining - Scalable frequent pattern mining methods
- Apriori (Candidate generation test)
- Projection-based (FPgrowth, CLOSET, ...)
- Vertical format approach (CHARM, ...)
- Mining a variety of rules and interesting
patterns - Constraint-based mining
- Mining sequential and structured patterns
- Extensions and applications
55Frequent-Pattern Mining Research Problems
- Mining fault-tolerant frequent, sequential and
structured patterns - Patterns allows limited faults (insertion,
deletion, mutation) - Mining truly interesting patterns
- Surprising, novel, concise,
- Application exploration
- E.g., DNA sequence analysis and bio-pattern
classification - Invisible data mining