Mining Association Rules in Large Databases

About This Presentation

Title:

Mining Association Rules in Large Databases

Description:

Mining single-dimensional Boolean association rules from transactional databases ... people who purchase tires and auto accessories also get automotive services done ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 81

Provided by: jiaw206

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Association Rules in Large Databases

1
Mining Association Rules in Large Databases

Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary

2
Rule Measures Support and Confidence
Customer buys both

Find all the rules X Y ? Z with minimum
confidence and support
support, s, probability that a transaction
contains X ? Y ? Z
confidence, c, conditional probability that a
transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

3
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
4
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

5
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

6
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

7
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

8
Mining Association RulesAn Example
Min. support 50 Min. confidence 50

For rule A ? C
support support(A ?C) 50
confidence support(A ?C)/support(A) 66.6
The Apriori principle
Any subset of a frequent itemset must be frequent

9
Mining Frequent Itemsets the Key Step

Find the frequent itemsets the sets of items
that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association
rules.

10
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

11
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
12
Frequent Itemset Generation

Brute-force approach
Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!

13
Computational Complexity

Given d unique items
Total number of itemsets 2d
Total number of possible association rules

If d6, R 602 rules
14
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)
Complete search M2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the
candidates or transactions
No need to match every candidate against every
transaction

15
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

16
Illustrating Apriori Principle
17
The Apriori Algorithm

Join Step Ck is generated by joining Lk-1with
itself
Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

18
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
19
Generate Hash Tree

Suppose you have 15 candidate itemsets of length
3
1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8
You need
Hash function
Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)

20
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
21
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

22
How to Count Supports of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

23
Example of Generating Candidates

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

24
Methods to Improve Aprioris Efficiency

Hash-based itemset counting A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent
Transaction reduction A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans
Partitioning Any itemset that is potentially
frequent in DB must be frequent in at least one
of the partitions of DB
Sampling mining on a subset of given data, lower
support threshold a method to determine the
completeness
Dynamic itemset counting add new candidate
itemsets only when all of their subsets are
estimated to be frequent

25
Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have
identical support as their supersets
Number of frequent itemsets
Need a compact representation

26
Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
27
Closed Itemset

An itemset is closed if none of its immediate
supersets has the same support as the itemset

28
Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
29
Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
30
Maximal vs Closed Itemsets
31
Is Apriori Fast Enough? Performance Bottlenecks

The core of the Apriori algorithm
Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets
Use database scan and pattern matching to collect
counts for the candidate itemsets
The bottleneck of Apriori candidate generation
Huge candidate sets
104 frequent 1-itemset will generate 107
candidate 2-itemsets
To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates.
Multiple scans of database
Needs (n 1 ) scans, n is the length of the
longest pattern

32
Alternative Methods for Frequent Itemset
Generation

Representation of Database
horizontal vs vertical data layout

33
FP-growth Algorithm

Use a compressed representation of the database
using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets

34
FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
35
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
36
FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
37
Tree Projection
Set enumeration tree
Possible Extension E(A) B,C,D,E
Possible Extension E(ABC) D,E
38
Tree Projection

Items are listed in lexicographic order
Each node P stores the following information
Itemset for node P
List of possible lexicographic extensions of P
E(P)
Pointer to projected database of its ancestor
node
Bitvector containing information about which
transactions in the projected database contain
the itemset

39
Projected Database
Projected Database for node A
Original Database
For each transaction T, projected transaction at
node A is T ? E(A)
40
Benefits of the FP-tree Structure

Completeness
never breaks a long pattern of any transaction
preserves complete information for frequent
pattern mining
Compactness
reduce irrelevant informationinfrequent items
are gone
frequency descending ordering more frequent
items are more likely to be shared
never be larger than the original database (if
not count node-links and counts)
Example For Connect-4 DB, compression ratio
could be over 100

41
Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)
Recursively grow frequent pattern path using the
FP-tree
Method
For each item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created
conditional FP-tree
Until the resulting FP-tree is empty, or it
contains only one path (single path will generate
all the combinations of its sub-paths, each of
which is a frequent pattern)

42
Major Steps to Mine FP-tree

Construct conditional pattern base for each node
in the FP-tree
Construct conditional FP-tree from each
conditional pattern-base
Recursively mine conditional FP-trees and grow
frequent patterns obtained so far
If the conditional FP-tree contains a single
path, simply enumerate all the patterns

43
Step 1 From FP-tree to Conditional Pattern Base

Starting at the frequent header table in the
FP-tree
Traverse the FP-tree by following the link of
each frequent item
Accumulate all of transformed prefix paths of
that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
44
Properties of FP-tree for Conditional Pattern
Base Construction

Node-link property
For any frequent item ai, all the possible
frequent patterns that contain ai can be obtained
by following ai's node-links, starting from ai's
head in the FP-tree header
Prefix path property
To calculate the frequent patterns for a node ai
in a path P, only the prefix sub-path of ai in P
need to be accumulated, and its frequency count
should carry the same count as node ai.

45
Step 2 Construct Conditional FP-tree

For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of
the pattern base

m-conditional pattern base fca2, fcab1

Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
b1
b1
c3
?
?
p1
a3
b1
m2
p2
m1
46
Step 3 Recursively mine the conditional FP-tree
Cond. pattern base of am (fc3)

Cond. pattern base of cm (f3)
f3
cm-conditional FP-tree

Cond. pattern base of cam (f3)
f3
cam-conditional FP-tree
47
FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
48
FP-growth vs. Tree-Projection Scalability with
Support Threshold
Data set T25I20D100K
49
Rule Generation

How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an
anti-monotone property
c(ABC ?D) can be larger or smaller than c(AB ?D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD)
Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule

50
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
51
Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules
that share the same prefixin the rule consequent
join(CDgtAB,BDgtAC)would produce the
candidaterule D gt ABC
Prune rule DgtABC if itssubset ADgtBC does not
havehigh confidence

52
Effect of Support Distribution

Many real data sets have skewed support
distribution

Support distribution of a retail data set
53
Effect of Support Distribution

How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
If minsup is set too low, it is computationally
expensive and the number of itemsets is very
large
Using a single minimum support threshold may not
be effective

54
Iceberg Queries

Icerberg query Compute aggregates over one or a
set of attributes only for those whose aggregate
values is above certain threshold
Example
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) gt 10
Compute iceberg queries efficiently by Apriori
First compute lower dimensions
Then compute higher dimensions only when all the
lower ones are above the threshold

55
Pattern Evaluation

Association rule algorithms tend to produce too
many rules
many of them are uninteresting or redundant
Redundant if A,B,C ? D and A,B ? D
have same support confidence
Interestingness measures can be used to
prune/rank the derived patterns
In the original formulation of association rules,
support confidence are the only measures used

56
Application of Interestingness Measure
57
Computing Interestingness Measure

Given a rule X ? Y, information needed to compute
rule interestingness can be obtained from a
contingency table

Contingency table for X ? Y
Y Y
X f11 f10 f1
X f01 f00 fo
f1 f0 T

Used to define various measures
support, confidence, lift, Gini, J-measure,
etc.

58
Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
59
Interestingness Measurements

Objective measures
Two popular measurements
support and
confidence
Subjective measures (Silberschatz Tuzhilin,
KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user) and/or
actionable (the user can do something with it)

60
Criticism to Support and Confidence

Example 1 (Aggarwal Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7.
play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence

61
Statistical Independence

Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(S?B) 420/1000 0.42
P(S) ? P(B) 0.6 ? 0.7 0.42
P(S?B) P(S) ? P(B) gt Statistical independence
P(S?B) gt P(S) ? P(B) gt Positively correlated
P(S?B) lt P(S) ? P(B) gt Negatively correlated

62
Statistical-based Measures

Measures that take into account statistical
dependence

63
Example Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule Tea ? Coffee
Confidence P(CoffeeTea) 0.75
but P(Coffee) 0.9
Lift 0.75/0.9 0.8333 (lt 1, therefore is
negatively associated)

64
Drawback of Lift Interest
Y Y
X 10 0 10
X 0 90 90
10 90 100
Y Y
X 90 0 90
X 0 10 10
90 10 100
Statistical independence If P(X,Y)P(X)P(Y) gt
Lift 1
65
There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
66
Constraint-Based Mining

Interactive, exploratory mining giga-bytes of
data?
Could it be real? Making good use of
constraints!
What kinds of constraints can be used in mining?
Knowledge type constraint classification,
association, etc.
Data constraint SQL-like queries
Find product pairs sold together in Vancouver in
Dec.98.
Dimension/level constraints
in relevance to region, price, brand, customer
category.
Rule constraints
small sales (price lt 10) triggers big sales
(sum gt 200).
Interestingness constraints
strong rules (min_support ? 3, min_confidence ?
60).

67
Rule Constraints in Association Mining

Two kind of rule constraints
Rule form constraints meta-rule guided mining.
P(x, y) Q(x, w) takes(x, database
systems).
Rule (content) constraint constraint-based query
optimization (Ng, et al., SIGMOD98).
sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000
1-variable vs. 2-variable constraints
(Lakshmanan, et al. SIGMOD99)
1-var A constraint confining only one side (L/R)
of the rule, e.g., as shown above.
2-var A constraint confining both sides (L and
R).
sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)

68
Constrain-Based Association Query

Database (1) trans (TID, Itemset ), (2)
itemInfo (Item, Type, Price)
A constrained asso. query (CAQ) is in the form of
(S1, S2 )C ,
where C is a set of constraints on S1, S2
including frequency constraint
A classification of (single-variable)
constraints
Class constraint S ? A. e.g. S ? Item
Domain constraint
S? v, ? ? ?, ?, ?, ?, ?, ? . e.g. S.Price lt
100
v? S, ? is ? or ?. e.g. snacks ? S.Type
V? S, or S? V, ? ? ?, ?, ?, ?, ?
e.g. snacks, sodas ? S.Type
Aggregation constraint agg(S) ? v, where agg is
in min, max, sum, count, avg, and ? ? ?, ?,
?, ?, ?, ? .
e.g. count(S1.Type) ? 1 , avg(S2.Price) ? 100

69
Constrained Association Query Optimization Problem

Given a CAQ (S1, S2) C , the algorithm
should be
sound It only finds frequent sets that satisfy
the given constraints C
complete All frequent sets satisfy the given
constraints C are found
A naïve solution
Apply Apriori for finding all frequent sets, and
then to test them for constraint satisfaction one
by one.
Our approach
Comprehensive analysis of the properties of
constraints and try to push them as deeply as
possible inside the frequent set computation.

70
Anti-monotone and Monotone Constraints

A constraint Ca is anti-monotone iff. for any
pattern S not satisfying Ca, none of the
super-patterns of S can satisfy Ca
A constraint Cm is monotone iff. for any pattern
S satisfying Cm, every super-pattern of S also
satisfies it

71
Succinct Constraint

A subset of item Is is a succinct set, if it can
be expressed as ?p(I) for some selection
predicate p, where ? is a selection operator
SP?2I is a succinct power set, if there is a
fixed number of succinct set I1, , Ik ?I, s.t.
SP can be expressed in terms of the strict power
sets of I1, , Ik using union and minus
A constraint Cs is succinct provided SATCs(I) is
a succinct power set

72
Convertible Constraint

Suppose all items in patterns are listed in a
total order R
A constraint C is convertible anti-monotone iff a
pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
A constraint C is convertible monotone iff a
pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also
satisfies C

73
Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
74
Property of Constraints Anti-Monotone

Anti-monotonicity If a set S violates the
constraint, any superset of S violates the
constraint.
Examples
sum(S.Price) ? v is anti-monotone
sum(S.Price) ? v is not anti-monotone
sum(S.Price) v is partly anti-monotone
Application
Push sum(S.price) ? 1000 deeply into iterative
frequent set computation.

75
Characterization of Anti-Monotonicity
Constraints
S ? v, ? ? ?, ?, ? v ? S S ? V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
yes no no yes partly no yes partly yes no partly y
es no partly yes no partly convertible (yes)
76
Example of Convertible Constraints Avg(S) ? V

Let R be the value descending order over the set
of items
E.g. I9, 8, 6, 4, 3, 1
Avg(S) ? v is convertible monotone w.r.t. R
If S is a suffix of S1, avg(S1) ? avg(S)
8, 4, 3 is a suffix of 9, 8, 4, 3
avg(9, 8, 4, 3)6 ? avg(8, 4, 3)5
If S satisfies avg(S) ?v, so does S1
8, 4, 3 satisfies constraint avg(S) ? 4, so
does 9, 8, 4, 3

77
Property of Constraints Succinctness

Succinctness
For any set S1 and S2 satisfying C, S1 ? S2
satisfies C
Given A1 is the sets of size 1 satisfying C, then
any set S satisfying C are based on A1 , i.e., it
contains a subset belongs to A1 ,
Example
sum(S.Price ) ? v is not succinct
min(S.Price ) ? v is succinct
Optimization
If C is succinct, then C is pre-counting
prunable. The satisfaction of the constraint
alone is not affected by the iterative support
counting.

78
Characterization of Constraints by Succinctness
S ? v, ? ? ?, ?, ? v ? S S ?V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
Yes yes yes yes yes yes yes yes yes yes yes weakly
weakly weakly no no no no (no)
79
Why Is the Big Pie Still There?

More on constraint-based mining of associations
Boolean vs. quantitative associations
Association on discrete vs. continuous data
From association to correlation and causal
structure analysis.
Association does not necessarily imply
correlation or causal relationships
From intra-trasanction association to
inter-transaction associations
E.g., break the barriers of transactions (Lu, et
al. TOIS99).
From association analysis to classification and
clustering analysis
E.g, clustering association rules

80
Summary

Association rule mining
probably the most significant contribution from
the database community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data
spatial data, multimedia data, time series data,
etc.

Write a Comment

User Comments (0)