Title: Mining Association Rules
1Mining Association Rules
2Data Mining Overview
- Data Mining
- Data warehouses and OLAP (On Line Analytical
Processing.) - Association Rules Mining
- Clustering Hierarchical and Partitional
approaches - Classification Decision Trees and Bayesian
classifiers - Sequential Patterns Mining
- Advanced topics outlier detection, web mining
3Association Rules Background
- Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit) - Find all association rules that satisfy
user-specified minimum support and minimum
confidence interval - Example 30 of transactions that contain beer
also contain diapers 5 of transactions contain
these items - 30 confidence of the rule
- 5 support of the rule
- We are interested in finding all rules rather
than verifying if a rule holds
4Rule Measures Support and Confidence
Customer buys both
Customer buys diaper
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X ? Y ? Z - confidence, c, conditional probability that a
transaction having X ? Y also contains Z
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
5Application Examples
- Market Basket Analysis
- ? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales?) - Home Electronics ? (What other products
should the store stocks up on if the store has a
sale on Home Electronics?) - Attached mailing in direct marketing
- Detecting ping-ponging of patients
- Transaction patient
- Item doctor/clinic visited by patient
- Support of the rule number of common patients
- HIC Australia success story
6Problem Statement
- I i1, i2, , im a set of literals, called
items - Transaction T a set of items s.t. T I
- Database D a set of transactions
- A transaction contains X, a set of items in I, if
X T - An association rule is an implication of the form
X ? Y, - where X,Y I
- The rule X ? Y holds in the transaction set D
with confidence c if c of transactions in D
that contain X also contain Y - The rule X ? Y has support s in the transaction
set D if s of transactions in D contain X Y - Find all rules that have support and confidence
greater than user-specified min support and min
confidence
7Association Rule Mining A Road Map
- Boolean vs. quantitative associations (Based on
the types of values handled) - buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75 - Single dimension vs. multiple dimensional
associations (see ex. Above) - Single level vs. multiple-level analysis
- What brands of beers are associated with what
brands of diapers? - Various extensions
- Correlation, causality analysis
- Association does not necessarily imply
correlation or causality - Constraints enforced
- E.g., small sales (sum lt 100) trigger big buys
(sum gt 1,000)?
8Problem Decomposition
- 1. Find all sets of items that have minimum
support (frequent itemsets) - 2. Use the frequent itemsets to generate the
desired rules
9Problem Decomposition Example
For min support 50 2 trans, and min
confidence 50
For the rule Shoes ? Jacket
- Support Sup(Shoes,Jacket)50
- Confidence 66.6
Jacket ? Shoes has 50 support and 100
confidence
10Discovering Rules
- Naïve Algorithm
- for each frequent itemset l do
- for each subset c of l do
- if (support(l ) / support(l - c) gt minconf)
then - output the rule (l c ) ? c,
- with confidence support(l ) /
support (l - c ) - and support support(l )
11Discovering Rules (2)
- Lemma. If consequent c generates a valid rule, so
do all subsets of c. (e.g. X ? YZ, then XY ? Z
and XZ ? Y) - Example Consider a frequent itemset ABCDE
- If ACDE ? B and ABCE ? D are the only
one-consequent rules with minimum support
confidence, then - ACE ? BD is the only other rule that needs to be
tested
12Mining Frequent Itemsets the Key Step
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate association
rules.
13The Apriori Algorithm
- Lk Set of frequent itemsets of size k (those
with min support) - Ck Set of candidate itemset of size k
(potentially frequent itemsets) - L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
14The Apriori Algorithm Example
Min support 50 2 trans
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
15How to Generate Candidates?
- Suppose the items in Lk-1 are listed in order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
16Example of Generating Candidates
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
17How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
18Hash-treesearch
- Given a transaction T and a set Ck find all
members of its members contained in T - Assume an ordering on the items
- Start from the root, use every item in T to go to
the next node - If you are at an interior node and you just used
item i, then use each item that comes after i in
T - If you are at a leaf node check the itemsets
19Methods to Improve Aprioris Efficiency
- Transaction reduction A transaction that does
not contain any frequent k-itemset is useless in
subsequent scans - Partitioning Any itemset that is potentially
frequent in DB must be frequent in at least one
of the partitions of DB - Sampling mining on a subset of given data, lower
support threshold a method to determine the
completeness - Dynamic itemset counting add new candidate
itemsets only when all of their subsets are
estimated to be frequent
20Is Apriori Fast Enough? Performance Bottlenecks
- The core of the Apriori algorithm
- Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets - Use database scan and pattern matching to collect
counts for the candidate itemsets - The bottleneck of Apriori candidate generation
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates. - Multiple scans of database
- Needs (n 1 ) scans, n is the length of the
longest pattern
21Max-Miner
- Max-miner finds long patterns efficiently the
maximal frequent patterns - Instead of checking all subsets of a long pattern
try to detect long patterns early - Scales linearly to the size of the patterns
22Max-Miner the idea
Set enumeration tree of an ordered set
f
1
3
4
2
Pruning (1) set infrequency (2) Superset
frequency
1,2
1,3
1,4
2,3
2,4
3,4
1,2,3
2,3,4
1,2,4
1,3,4
Each node is a candidate group g h(g) is the
head the itemset of the node t(g) tail an
ordered set that contains all items that can
appear in the subnodes
1,2,3,4
Example h(1) 1 and t(1) 2,3,4
23Max-miner pruning
- When we count the support of a candidate group g,
we compute also the support for h(g), h(g)
t(g) and h(g) i for each i in t(g) - If h(g) t(g) is frequent, then stop expanding
the node g and report the union as frequent
itemset - If h(g) i is infrequent, then remove I from
all subnodes (just remove i from any tail of a
group after g) - Expand the node g by one and do the same
24The algorithm
- Max-Miner
- Set candidate groups C ?
- Set of Itemsets F ?Gen-Initial-Groups(T,C)
- while C not empty do
- scan T to count the support of all candidate
groups in C - for each g in C s.t. h(g) U t(g) is frequent
do - F ? F U h(g) U t(g)
- Set candidate groups Cnew?
- for each g in C such that h(g) U t(g) is
infrequent do - F ?F U Gen-sub-nodes(g, Cnew)
- C ?
- remove from F any itemset with a proper
superset in F - remove from C any group g s.t. h(g) U t(g)
has a superset in F - return F
25The algorithm (2)
- Gen-Initial-Groups(T, C)
- scan T to obtain F1, the set of frequent
1-itemsets - impose an ordering on items in F1
- for each item i in F1 other than the greatest
itemset do - let g be a new candidate with h(g) i
- and t(g) j j follows i in the
ordering - C ?C U g
- return the itemset F1 (an the C of course)
- Gen-sub-nodes(g, C) / generation of new itemsets
at the next level/ - remove any item i from t(g) if h(g) U i is
infrequent - reorder the items in t(g)
- for each i in t(g) other than the greatest do
- let g be a new candidate with h(g) h(g)
U i and t(g) j j in t(g) - and j is after i in t(g)
- C ? C U g
- return h(g) U m where m is the greatest item
in t(g) or h(g) if t(g) is empty -
26Item Ordering
- Re-ordering items we try to increase the
effectiveness of frequency-pruning - Very frequent items have higher probability to be
contained in long patterns - Put these item at the end of the ordering, so
they appear in many tails