Title: Mining Frequent Patterns, Association, and Correlations
1Mining Frequent Patterns, Association, and
Correlations
2- The Course
DS
Ch4
OLAP
Ch2
Ch3
DW
DP
DS
DM
Association
Ch5
DS
Classification
Ch6
Clustering
Ch7
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3Motivation
in this shopping basket customer bought tomatoes,
carrots, bananas, bread, eggs, milk, etc.
how the demographical information affects what
the customer buys?
is bread usually bought with milk?
does a specific milk brand make any difference?
is the bread bought when both milk and eggs are
bought together?
where we place the tomatoes in the store to
maximize their sales?
4Mining Frequent Patterns, Association, and
Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Association Rule Mining
- Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
5Definition Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Sugar
- k-itemset
- An itemset that contains k items
- Support count (P)
- Frequency of occurrence of an itemset
- E.g. P(Bread,Milk,Sugar) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Bread, Milk, Sugar) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
6Mining Frequent Patterns, Association and
Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Association Rule Mining
- Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
7Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
8Frequent Itemset Generation
- The are a number of algorithms to generate
Frequent itemsets. Some of which are - Brute force
- Apriori based
- Simple
- Hash
- Partitioning
- Sampling
- FP-growth
- Vertical Data format
9-- Brute-Force
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
Transactions
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar
Candidates
N
M
W
10Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search M2d
- Use pruning techniques to reduce M
- Reduce the number of transactions (N)
- Reduce size of N as the size of itemset increases
- Used by vertical-based mining algorithms
- Reduce the number of comparisons (NM)
- Use efficient data structures to store the
candidates or transactions - No need to match every candidate against every
transaction
11- Reducing Number of Candidates
- Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent - Apriori principle holds due to the following
property of the support measure - Support of an itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
12Illustrating Apriori Principle
13Illustrating Apriori Principle
Minimum Support 3
1-itemsets
3-itemsets
2-itemsets
Itemset Count
Bread 4
Coke 2
Milk 4
Coffee 3
Sugar 4
Eggs 1
Item Count
Bread,Milk 3
Bread,coffee 2
Bread,Sugar 3
Milk,Coffee 2
Milk,Sugar 3
Coffee,Sugar 3
Itemset Count
Bread,Milk,Sugar 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
(No need to generate candidates involving Coke or
Eggs)
14---- The Apriori AlgorithmAn Example
Supmin 2
Itemset sup
A 2
B 3
C 3
D 1
E 3
Database TDB
Itemset sup
A 2
B 3
C 3
E 3
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
Itemset sup
B, C, E 2
3rd scan
15- Apriori Algorithm
- Method
- Let k1
- Generate frequent itemsets of length 1
- Repeat until no new frequent itemsets are
identified - Generate length (k1) candidate itemsets from
length k frequent itemsets - Prune candidate itemsets containing subsets of
length k that are infrequent - Count the support of each candidate by scanning
the DB - Eliminate candidates that are infrequent, leaving
only those that are frequent
16---- The Apriori Algorithm
- Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
17---- Important Details of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- How to count supports of candidates?
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
18---- How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
19Factors Affecting Complexity
- Choice of minimum support threshold
- lowering support threshold results in more
frequent itemsets. This may increase number of
candidates and max length of frequent itemsets - Dimensionality (number of items) of the data set
- more space is needed to store support count of
each item - if number of frequent items also increases, both
computation and I/O costs may also increase - Size of database
- since Apriori makes multiple passes, run time of
algorithm may increase with number of
transactions - Average transaction width
- transaction width increases with denser data
sets - This may increase max length of frequent itemsets
and traversals of hash tree (number of subsets in
a transaction increases with its width)
20-- Improving the Efficiency of Apriori
- Reduce the number of Comparisons
- Reduce the number of Transactions
- Partitioning the data to find candidate itemsets
- Sampling Mine on a subset of the given data
21--- Reducing Number of Comparisons
- Candidate counting Scan the database of
transactions to determine the support of each
candidate itemset. To reduce the number of
comparisons, store the candidates in a hash
structure - Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets
h(x,y) ((order of x) 10 (order of y)) mod 7
Bucket address 0 1 2 3 4 5 6
Bucket count 2 2 4 2 2 4 4
Bucket content A,D C,E A,E A,E B,C B,C B,C B,C B,D B,D B,E B,E A,B A,B A,B A,B A,C A,C A,C A,C
H2
If the min_sup 3, buckets 0, 1, 3, 4 can not be
frequent so They should not be included in C2
22--- Reduce the number of Transactions
- A transaction that doesnt contain any frequent
k-itemset can not contain any frequent j-itemset,
for any j gt k. So such transaction can be marked
or removed from subsequent scans of the database
for j-itemsets.
23--- Partitioning the data to find candidate
itemsets
- Requires just 2 database scans to mine the
frequent itemsets, if the size of each partition
fits the available memory. - It consists of 2 phases
- Phase 1
- The algorithms partitions the D transactions in
to n partitions. - For each partition find local frequent itemsets.
Local frequent itemset have a support-count gt
min_sup the number of transactions in that
partitions. - For each itemset, using special data structures
records the TIDs of the transactions containing
the itemset. - Phase 2
- A second scan of D is conducted to find the
actual support of each local frequent itemset.
24 --- Partitioning the data to find candidate
itemsets
- Partitioning of the data
- Data set partitioning generates frequent itemsets
based on finding frequent itemsets in subsets
(partition) of D
25--- Sampling Mine on a subset of the given data
- Pick random sample S of the given data D. (Make
sure all S fits in the available memory.) - Search for frequent itemsets in S instead in D.
You can lower the support threshold to reduce the
number of missed frequent itemsets. - Find the set, Ls, of frequent itemsets in S.
- The rest of the database can be used to compute
the actual frequencies of the itemset in Ls. - If Ls doesnt contain all the frequent itemsets
in D, then a second pass will be needed.
26Bottleneck of Frequent-pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates (1001) (1002) (110000)
2100-1 1.271030 ! - Bottleneck candidate-generation-and-test
- Can we avoid candidate generation?
- Yes, if we use fp-growth algorithm (see next
slide)
27FP-growth Another Method for Frequent Itemset
Generation
- Use a compressed representation of the database
using an FP-tree - Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets.
28FP-Tree Construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
29FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
30FP-growth
Build conditional pattern base for E P
(A1,C1,D1), (A1,D1),
(B1,C1) Recursively apply FP-growth on P
null
B3
A7
B5
C3
C1
D1
C3
D1
D1
E1
E1
D1
E1
D1
31 FP-growth
Conditional tree for E
null
Conditional Pattern base for E P
(A1,C1,D1,E1), (A1,D1,E1),
(B1,C1,E1) Count for E is 3 E is frequent
itemset Recursively apply FP-growth on P
B1
A2
C1
C1
D1
D1
E1
E1
E1
32 FP-growth
Conditional tree for D within conditional tree
for E
Conditional pattern base for D within conditional
base for E P (A1,C1,D1), (A1,D1)
Count for D is 2 D,E is frequent
itemset Recursively apply FP-growth on P
null
A2
C1
D1
D1
33 FP-growth
Conditional tree for C within D within E
Conditional pattern base for C within D within E
P (A1,C1) Count for C is 1 C,D,E is
NOT frequent itemset
null
A1
C1
34 FP-growth
Conditional tree for A within D within E
Count for A is 2 A,D,E is frequent
itemset Next step Construct conditional tree C
within conditional tree E Continue until
exploring conditional tree for A (which has only
node A)
null
A2
35Benefits of the FP-tree Structure
36Why is FP-Growth the Winner?
- Divide-and-conquer
- decompose both the mining task and DB according
to the frequent patterns obtained so far - leads to focused search of smaller databases
- Other factors
- no candidate generation, no candidate test
- compressed database FP-tree structure
- no repeated scan of entire database
- basic opscounting local freq items and building
sub FP-tree, no pattern search and matching
37Mining Frequent Itemsets using Vertical Data
Format
- For each item, store a list of transaction ids
(tids) vertical data layout
TID-list
38Mining Frequent Itemsets using Vertical Data
Format
- Determine support of any k-itemset by
intersecting tid-lists of two of its (k-1)
subsets. - Advantage very fast support counting
- Disadvantage intermediate tid-lists may become
too large for memory
?
?
39- Compact Representation of Frequent Itemsets
- Some itemsets are redundant because they have
identical support as their supersets - Number of frequent itemsets
- Need a compact representation
40-- Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
41-- Closed Itemset
- An itemset is closed if none of its immediate
supersets has the same support as the itemset
42-- Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
43 -- Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
44 -- Maximal vs Closed Itemsets
45-- Mining Closed Frequent Itemsets
- A Naïve approach
- Generate all possible frequent itemsets, then
remove the non-closed itemsets - A recommended methodology search for frequent
closed itemsets during mining - Itemset merging if Y appears in every occurrence
of X, then Y is merged with X - Sub-itemset pruning if Y ? X, and sup(X)
sup(Y), X and all of Xs descendants in the set
enumeration tree can be pruned - Efficient subset checking Use compressed pattern
tree, which is similar in structure to the
FP-tree except its branches store closed
itemsets. - Item skipping if a local frequent item has the
same support in several header tables at
different levels, one can prune it from the
header table at higher levels (Used in
depth-first mining of closed itemsets which we
dont cover)
46Mining Frequent Patterns, Association and
Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Association Rule Mining
- Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
47- Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
sugar ? coffee,Milk, Bread ?
Eggs,Coke,coffee, Bread ? Milk,
48-- Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Sugar ? Coffee
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
Milk, Sugar ? Coffee s P(milk,sugar,coffee/
T 2/5 0.4 c P(milk,sugar,coffee/Pmilk,su
gar 2/3 0.67
49-- Example
50-- Computational Complexity
- Given d unique items
- Total number of itemsets 2d
- Total number of possible association rules
If d6, R 602 rules
51-- Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
52-- Mining Association Rules
Example of Rules milk,Sugar ? coffee
(s0.4, c0.67)milk,coffee ? sugar (s0.4,
c1.0) sugar,coffee ? milk (s0.4,
c0.67) coffee ? milk,sugar (s0.4, c0.67)
sugar ? milk,coffee (s0.4, c0.5) milk ?
sugar,coffee (s0.4, c0.5)
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar
- Observations
- All the above rules are binary partitions of the
same itemset milk, sugar, coffee - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
53-- Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
54Mining Frequent Patterns, Association and
Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Association Rule Mining
- Mining various kinds of association rules
- From association mining to correlation analysis
- Constraint-based association mining
- Summary
55- Mining Various Kinds of Association Rules
- Mining multilevel association
- Miming multidimensional association
- Mining quantitative association
56-- Mining Multiple-Level Association Rules
- Items often form hierarchies
- Flexible support settings
- Items at the lower level are expected to have
lower support
57--- Multi-level Association Redundancy Filtering
- Some rules may be redundant due to ancestor
relationships between items. - Example
- Laptop ? HP printer support 8, confidence
70 - IBM laptop ? HP printer support 2, confidence
72 - We say the first rule is an ancestor of the
second rule. - A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.
58-- Mining Multi-Dimensional Association
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- Multi-dimensional rules ? 2 dimensions or
predicates - Inter-dimension assoc. rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X, coke) - hybrid-dimension assoc. rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes finite number of possible
values, no ordering among valuesdata cube
approach - Quantitative Attributes numeric, implicit
ordering among valuesdiscretization, clustering,
and gradient approaches
59-- Mining Quantitative Associations
- Techniques can be categorized by how numerical
attributes, such as age or salary are treated - Static discretization based on predefined concept
hierarchies (data cube methods) - Dynamic discretization based on data distribution
60--- Static Discretization of Quantitative
Attributes
- Discretized prior to mining using concept
hierarchy. - Numeric values are replaced by ranges.
- In relational database, finding all frequent
k-predicate sets will require k or k1 table
scans. - Data cube is well suited for mining.
- The cells of an n-dimensional
- cuboid correspond to the
- predicate sets.
- Mining from data cubescan be much faster.
61-- Quantitative Association Rules
- Numeric attributes are dynamically discretized
- Such that the confidence or compactness of the
rules mined is maximized - 2-D quantitative association rules Aquan1 ?
Aquan2 ? Acat - Cluster adjacent association rules to form
general rules using a 2-D grid - Example
age(X,34-35) ? income(X,30-50K) ?
buys(X,high resolution TV)
62Mining Frequent Patterns, Association and
Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Association Rule Mining
- Mining various kinds of association rules
- From association mining to correlation analysis
- Summary
63- Finding Interesting Association Rules
- Depending on the minimum support and confidence
values the user may generate a large number of
rules to analyze and assess - How can we filter out rules that are potentially
the most interesting? - whenever a rule is interesting (or not) can be
evaluated either objectively or subjectively - the ultimate subjective users evaluation cannot
be quantified or anticipated they are different
for different users - that is why objective interestingness measures,
based on the statistical information present in
D, were developed
64 -- Finding Interesting Association Rules
- The subjective evaluation of association rules
often boils down to checking if a given rule is
unexpected (i.e., surprises the user) and
actionable (i.e., the user can do something
useful based on the rule). - useful, when they provide high quality,
actionable information e.g. Pepsi ? chips - trivial, when they are valid and supported by
data, but useless since they confirm well known
facts e.g. milk ? bread - inexplicable, when they concern valid and new
facts, but cannot be utilized e.g. grocery_store
? milk_is_sold_as_often_as_bread
65 -- Finding Interesting Association Rules
- In most cases, confidence and support values
associated with each rule are used as the
objective measure to select the most interesting
rules - rules that have these values higher with respect
to other rules are preferred - although this simple approach works in many
cases, we will show that sometimes rules that
have high confidence and support may be
uninteresting and even misleading
66 -- Finding Interesting Association Rules
- example
- let us assume that a transactional data set
concerning grocery store contains milk and bread
as the frequent items - 2,000 transactions were recorded and among them
- in 1,200 transactions the customers bought tea
- in 1,650 transactions customers bough cofee
- in 900 the customers bough both
tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
67 -- Finding Interesting Association Rules
- example
- given the minimum support threshold of 40 and
minimum confidence threshold of 70 the tea ?
coffee 45, 75 rule would be generated - on the other hand, due to low support and
confidence values the tea ? not coffee 15,
25 rule would not be generated - the latter rule is by far more accurate, while
the first may be misleading
68 -- Finding Interesting Association Rules
- example
- tea ? coffee 45, 75 rule
- probability of buying coffee is 82.5, while
confidence of tea ? coffee is lower and equals
75 - coffee and tea are negatively associated, i.e.,
buying one results in decrease in buying the
other - obviously using this rule would not be a wise
decision
tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
69 -- Finding Interesting Association Rules
- alternative approach to evaluate interestingness
of association rules is to use measures based on
correlation - for A ? B rule, the itemset A is independent of
the occurrence of the itemset B if P(A ? B)
P(A)P(B). Otherwise, itemsets A and B are
dependent and correlated as events. - correlation measure (also referred to as lift and
interest), which is defined between itemsets A
and B is defined as
70 -- Finding Interesting Association Rules
- correlation measure
- if the correlation value is less than 1, then the
occurrence of A is negatively correlated
(inhibits) the occurrence of B - if the value is greater than 1, then A and B and
positively correlated, which means that
occurrence of one implies (promotes) occurrence
of the other - if correlation equals 1, then A and B are
independent, i.e., there is no correlation
between these itemsets. - correlation value for tea ? coffee rule equals
- 0.45 / (0.60.825) 0.45 / 0.495 0.91
71END