Mining Frequent Patterns, Association, and Correlations

About This Presentation

Title:

Mining Frequent Patterns, Association, and Correlations

Description:

select p.item1, p.item2, ..., p.itemk-1, q.itemk-1. from Lk-1 p, Lk-1 q. where p.item1=q.item1, ..., p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1. Step 2: pruning ... – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 72

Provided by: Jiawe7

Category:

more less

Transcript and Presenter's Notes

Title: Mining Frequent Patterns, Association, and Correlations

1
Mining Frequent Patterns, Association, and
Correlations
2
- The Course
DS
Ch4
OLAP
Ch2
Ch3
DW
DP
DS
DM
Association
Ch5
DS
Classification
Ch6
Clustering
Ch7
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3
Motivation
in this shopping basket customer bought tomatoes,
carrots, bananas, bread, eggs, milk, etc.
how the demographical information affects what
the customer buys?
is bread usually bought with milk?
does a specific milk brand make any difference?
is the bread bought when both milk and eggs are
bought together?
where we place the tomatoes in the store to
maximize their sales?
4
Mining Frequent Patterns, Association, and
Correlations

Basic concepts
Efficient and scalable frequent itemset mining
methods
Association Rule Mining
Mining various kinds of association rules
From association mining to correlation analysis
Constraint-based association mining
Summary

5
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Sugar
k-itemset
An itemset that contains k items
Support count (P)
Frequency of occurrence of an itemset
E.g. P(Bread,Milk,Sugar) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Bread, Milk, Sugar) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
6
Mining Frequent Patterns, Association and
Correlations

Basic concepts
Efficient and scalable frequent itemset mining
methods
Association Rule Mining
Mining various kinds of association rules
From association mining to correlation analysis
Constraint-based association mining
Summary

7
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
8
Frequent Itemset Generation

The are a number of algorithms to generate
Frequent itemsets. Some of which are
Brute force
Apriori based
Simple
Hash
Partitioning
Sampling
FP-growth
Vertical Data format

9
-- Brute-Force

Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!

Transactions
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar
Candidates

N
M
W
10
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)
Complete search M2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the
candidates or transactions
No need to match every candidate against every
transaction

11
- Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

12
Illustrating Apriori Principle
13
Illustrating Apriori Principle
Minimum Support 3
1-itemsets
3-itemsets
2-itemsets
Itemset Count
Bread 4
Coke 2
Milk 4
Coffee 3
Sugar 4
Eggs 1
Item Count
Bread,Milk 3
Bread,coffee 2
Bread,Sugar 3
Milk,Coffee 2
Milk,Sugar 3
Coffee,Sugar 3
Itemset Count
Bread,Milk,Sugar 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
(No need to generate candidates involving Coke or
Eggs)
14
---- The Apriori AlgorithmAn Example
Supmin 2
Itemset sup
A 2
B 3
C 3
D 1
E 3
Database TDB
Itemset sup
A 2
B 3
C 3
E 3
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
Itemset sup
B, C, E 2
3rd scan
15
- Apriori Algorithm

Method
Let k1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are
identified
Generate length (k1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets of
length k that are infrequent
Count the support of each candidate by scanning
the DB
Eliminate candidates that are infrequent, leaving
only those that are frequent

16
---- The Apriori Algorithm

Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

17
---- Important Details of Apriori

How to generate candidates?
Step 1 self-joining Lk
Step 2 pruning
How to count supports of candidates?
Example of Candidate-generation
L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

18
---- How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

19
Factors Affecting Complexity

Choice of minimum support threshold
lowering support threshold results in more
frequent itemsets. This may increase number of
candidates and max length of frequent itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of
each item
if number of frequent items also increases, both
computation and I/O costs may also increase
Size of database
since Apriori makes multiple passes, run time of
algorithm may increase with number of
transactions
Average transaction width
transaction width increases with denser data
sets
This may increase max length of frequent itemsets
and traversals of hash tree (number of subsets in
a transaction increases with its width)

20
-- Improving the Efficiency of Apriori

Reduce the number of Comparisons
Reduce the number of Transactions
Partitioning the data to find candidate itemsets
Sampling Mine on a subset of the given data

21
--- Reducing Number of Comparisons

Candidate counting Scan the database of
transactions to determine the support of each
candidate itemset. To reduce the number of
comparisons, store the candidates in a hash
structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets

h(x,y) ((order of x) 10 (order of y)) mod 7
Bucket address 0 1 2 3 4 5 6
Bucket count 2 2 4 2 2 4 4
Bucket content A,D C,E A,E A,E B,C B,C B,C B,C B,D B,D B,E B,E A,B A,B A,B A,B A,C A,C A,C A,C
H2
If the min_sup 3, buckets 0, 1, 3, 4 can not be
frequent so They should not be included in C2
22
--- Reduce the number of Transactions

A transaction that doesnt contain any frequent
k-itemset can not contain any frequent j-itemset,
for any j gt k. So such transaction can be marked
or removed from subsequent scans of the database
for j-itemsets.

23
--- Partitioning the data to find candidate
itemsets

Requires just 2 database scans to mine the
frequent itemsets, if the size of each partition
fits the available memory.
It consists of 2 phases
Phase 1
The algorithms partitions the D transactions in
to n partitions.
For each partition find local frequent itemsets.
Local frequent itemset have a support-count gt
min_sup the number of transactions in that
partitions.
For each itemset, using special data structures
records the TIDs of the transactions containing
the itemset.
Phase 2
A second scan of D is conducted to find the
actual support of each local frequent itemset.

24
--- Partitioning the data to find candidate
itemsets

Partitioning of the data
Data set partitioning generates frequent itemsets
based on finding frequent itemsets in subsets
(partition) of D

25
--- Sampling Mine on a subset of the given data

Pick random sample S of the given data D. (Make
sure all S fits in the available memory.)
Search for frequent itemsets in S instead in D.
You can lower the support threshold to reduce the
number of missed frequent itemsets.
Find the set, Ls, of frequent itemsets in S.
The rest of the database can be used to compute
the actual frequencies of the itemset in Ls.
If Ls doesnt contain all the frequent itemsets
in D, then a second pass will be needed.

26
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates (1001) (1002) (110000)
2100-1 1.271030 !
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?
Yes, if we use fp-growth algorithm (see next
slide)

27
FP-growth Another Method for Frequent Itemset
Generation

Use a compressed representation of the database
using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets.

28
FP-Tree Construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
29
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
30
FP-growth
Build conditional pattern base for E P
(A1,C1,D1), (A1,D1),
(B1,C1) Recursively apply FP-growth on P
null
B3
A7
B5
C3
C1
D1
C3
D1
D1
E1
E1
D1
E1
D1
31
FP-growth
Conditional tree for E
null
Conditional Pattern base for E P
(A1,C1,D1,E1), (A1,D1,E1),
(B1,C1,E1) Count for E is 3 E is frequent
itemset Recursively apply FP-growth on P
B1
A2
C1
C1
D1
D1
E1
E1
E1
32
FP-growth
Conditional tree for D within conditional tree
for E
Conditional pattern base for D within conditional
base for E P (A1,C1,D1), (A1,D1)
Count for D is 2 D,E is frequent
itemset Recursively apply FP-growth on P
null
A2
C1
D1
D1
33
FP-growth
Conditional tree for C within D within E
Conditional pattern base for C within D within E
P (A1,C1) Count for C is 1 C,D,E is
NOT frequent itemset
null
A1
C1
34
FP-growth
Conditional tree for A within D within E
Count for A is 2 A,D,E is frequent
itemset Next step Construct conditional tree C
within conditional tree E Continue until
exploring conditional tree for A (which has only
node A)
null
A2
35
Benefits of the FP-tree Structure
36
Why is FP-Growth the Winner?

Divide-and-conquer
decompose both the mining task and DB according
to the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database FP-tree structure
no repeated scan of entire database
basic opscounting local freq items and building
sub FP-tree, no pattern search and matching

37
Mining Frequent Itemsets using Vertical Data
Format

For each item, store a list of transaction ids
(tids) vertical data layout

TID-list
38
Mining Frequent Itemsets using Vertical Data
Format

Determine support of any k-itemset by
intersecting tid-lists of two of its (k-1)
subsets.
Advantage very fast support counting
Disadvantage intermediate tid-lists may become
too large for memory

?
?
39
- Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have
identical support as their supersets
Number of frequent itemsets
Need a compact representation

40
-- Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
41
-- Closed Itemset

An itemset is closed if none of its immediate
supersets has the same support as the itemset

42
-- Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
43
-- Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
44
-- Maximal vs Closed Itemsets
45
-- Mining Closed Frequent Itemsets

A Naïve approach
Generate all possible frequent itemsets, then
remove the non-closed itemsets
A recommended methodology search for frequent
closed itemsets during mining
Itemset merging if Y appears in every occurrence
of X, then Y is merged with X
Sub-itemset pruning if Y ? X, and sup(X)
sup(Y), X and all of Xs descendants in the set
enumeration tree can be pruned
Efficient subset checking Use compressed pattern
tree, which is similar in structure to the
FP-tree except its branches store closed
itemsets.
Item skipping if a local frequent item has the
same support in several header tables at
different levels, one can prune it from the
header table at higher levels (Used in
depth-first mining of closed itemsets which we
dont cover)

46
Mining Frequent Patterns, Association and
Correlations

Basic concepts
Efficient and scalable frequent itemset mining
methods
Association Rule Mining
Mining various kinds of association rules
From association mining to correlation analysis
Constraint-based association mining
Summary

47
- Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
sugar ? coffee,Milk, Bread ?
Eggs,Coke,coffee, Bread ? Milk,
48
-- Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Sugar ? Coffee
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
Milk, Sugar ? Coffee s P(milk,sugar,coffee/
T 2/5 0.4 c P(milk,sugar,coffee/Pmilk,su
gar 2/3 0.67
49
-- Example
50
-- Computational Complexity

Given d unique items
Total number of itemsets 2d
Total number of possible association rules

If d6, R 602 rules
51
-- Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

52
-- Mining Association Rules
Example of Rules milk,Sugar ? coffee
(s0.4, c0.67)milk,coffee ? sugar (s0.4,
c1.0) sugar,coffee ? milk (s0.4,
c0.67) coffee ? milk,sugar (s0.4, c0.67)
sugar ? milk,coffee (s0.4, c0.5) milk ?
sugar,coffee (s0.4, c0.5)
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar

Observations
All the above rules are binary partitions of the
same itemset milk, sugar, coffee
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

53
-- Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

54
Mining Frequent Patterns, Association and
Correlations

Basic concepts
Efficient and scalable frequent itemset mining
methods
Association Rule Mining
Mining various kinds of association rules
From association mining to correlation analysis
Constraint-based association mining
Summary

55
- Mining Various Kinds of Association Rules

Mining multilevel association
Miming multidimensional association
Mining quantitative association

56
-- Mining Multiple-Level Association Rules

Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have
lower support

57
--- Multi-level Association Redundancy Filtering

Some rules may be redundant due to ancestor
relationships between items.
Example
Laptop ? HP printer support 8, confidence
70
IBM laptop ? HP printer support 2, confidence
72
We say the first rule is an ancestor of the
second rule.
A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.

58
-- Mining Multi-Dimensional Association

Single-dimensional rules
buys(X, milk) ? buys(X, bread)
Multi-dimensional rules ? 2 dimensions or
predicates
Inter-dimension assoc. rules (no repeated
predicates)
age(X,19-25) ? occupation(X,student) ?
buys(X, coke)
hybrid-dimension assoc. rules (repeated
predicates)
age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke)
Categorical Attributes finite number of possible
values, no ordering among valuesdata cube
approach
Quantitative Attributes numeric, implicit
ordering among valuesdiscretization, clustering,
and gradient approaches

59
-- Mining Quantitative Associations

Techniques can be categorized by how numerical
attributes, such as age or salary are treated
Static discretization based on predefined concept
hierarchies (data cube methods)
Dynamic discretization based on data distribution

60
--- Static Discretization of Quantitative
Attributes

Discretized prior to mining using concept
hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent
k-predicate sets will require k or k1 table
scans.
Data cube is well suited for mining.
The cells of an n-dimensional
cuboid correspond to the
predicate sets.
Mining from data cubescan be much faster.

61
-- Quantitative Association Rules

Numeric attributes are dynamically discretized
Such that the confidence or compactness of the
rules mined is maximized
2-D quantitative association rules Aquan1 ?
Aquan2 ? Acat
Cluster adjacent association rules to form
general rules using a 2-D grid
Example

age(X,34-35) ? income(X,30-50K) ?
buys(X,high resolution TV)
62
Mining Frequent Patterns, Association and
Correlations

Basic concepts
Efficient and scalable frequent itemset mining
methods
Association Rule Mining
Mining various kinds of association rules
From association mining to correlation analysis
Summary

63
- Finding Interesting Association Rules

Depending on the minimum support and confidence
values the user may generate a large number of
rules to analyze and assess
How can we filter out rules that are potentially
the most interesting?
whenever a rule is interesting (or not) can be
evaluated either objectively or subjectively
the ultimate subjective users evaluation cannot
be quantified or anticipated they are different
for different users
that is why objective interestingness measures,
based on the statistical information present in
D, were developed

64
-- Finding Interesting Association Rules

The subjective evaluation of association rules
often boils down to checking if a given rule is
unexpected (i.e., surprises the user) and
actionable (i.e., the user can do something
useful based on the rule).
useful, when they provide high quality,
actionable information e.g. Pepsi ? chips
trivial, when they are valid and supported by
data, but useless since they confirm well known
facts e.g. milk ? bread
inexplicable, when they concern valid and new
facts, but cannot be utilized e.g. grocery_store
? milk_is_sold_as_often_as_bread

65
-- Finding Interesting Association Rules

In most cases, confidence and support values
associated with each rule are used as the
objective measure to select the most interesting
rules
rules that have these values higher with respect
to other rules are preferred
although this simple approach works in many
cases, we will show that sometimes rules that
have high confidence and support may be
uninteresting and even misleading

66
-- Finding Interesting Association Rules

example
let us assume that a transactional data set
concerning grocery store contains milk and bread
as the frequent items
2,000 transactions were recorded and among them
in 1,200 transactions the customers bought tea
in 1,650 transactions customers bough cofee
in 900 the customers bough both

tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
67
-- Finding Interesting Association Rules

example
given the minimum support threshold of 40 and
minimum confidence threshold of 70 the tea ?
coffee 45, 75 rule would be generated
on the other hand, due to low support and
confidence values the tea ? not coffee 15,
25 rule would not be generated
the latter rule is by far more accurate, while
the first may be misleading

68
-- Finding Interesting Association Rules

example
tea ? coffee 45, 75 rule
probability of buying coffee is 82.5, while
confidence of tea ? coffee is lower and equals
75
coffee and tea are negatively associated, i.e.,
buying one results in decrease in buying the
other
obviously using this rule would not be a wise
decision

tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
69
-- Finding Interesting Association Rules

alternative approach to evaluate interestingness
of association rules is to use measures based on
correlation
for A ? B rule, the itemset A is independent of
the occurrence of the itemset B if P(A ? B)
P(A)P(B). Otherwise, itemsets A and B are
dependent and correlated as events.
correlation measure (also referred to as lift and
interest), which is defined between itemsets A
and B is defined as

70
-- Finding Interesting Association Rules

correlation measure
if the correlation value is less than 1, then the
occurrence of A is negatively correlated
(inhibits) the occurrence of B
if the value is greater than 1, then A and B and
positively correlated, which means that
occurrence of one implies (promotes) occurrence
of the other
if correlation equals 1, then A and B are
independent, i.e., there is no correlation
between these itemsets.
correlation value for tea ? coffee rule equals
0.45 / (0.60.825) 0.45 / 0.495 0.91

71
END

Write a Comment

User Comments (0)