Title: Frequent Item Mining
1Frequent Item Mining
2What is data mining?
- Pattern Mining?
- What patterns?
- Why are they useful?
3Definition Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
4Frequent Itemsets Mining
TID Transactions
100 A, B, E
200 B, D
300 A, B, E
400 A, C
500 B, C
600 A, C
700 A, B
800 A, B, C, E
900 A, B, C
1000 A, C, E
- Minimum support level 50
- A,B,C,A,B, A,C
5Three Different Views of FIM
- Transactional Database
- How we do store a transactional database?
- Horizontal, Vertical, Transaction-Item Pair
- Binary Matrix
- Bipartite Graph
- How does the FIM formulated in these different
settings?
5
6Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
7Frequent Itemset Generation
- Brute-force approach
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
8Reducing Number of Candidates
- Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent - Apriori principle holds due to the following
property of the support measure - Support of an itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
9Illustrating Apriori Principle
10Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets) (No need to
generatecandidates involving Cokeor Eggs)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
11Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994
12(No Transcript)
13How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
14Challenges of Frequent Itemset Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
15Alternative Methods for Frequent Itemset
Generation
- Representation of Database
- horizontal vs vertical data layout
16ECLAT
- For each item, store a list of transaction ids
(tids)
TID-list
17ECLAT
- Determine support of any k-itemset by
intersecting tid-lists of two of its (k-1)
subsets. - 3 traversal approaches
- top-down, bottom-up and hybrid
- Advantage very fast support counting
- Disadvantage intermediate tid-lists may become
too large for memory
?
?
18(No Transcript)
19(No Transcript)
20FP-growth Algorithm
- Use a compressed representation of the database
using an FP-tree - Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets
21FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
22FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
23FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
24(No Transcript)
25Compact Representation of Frequent Itemsets
- Some itemsets are redundant because they have
identical support as their supersets - Number of frequent itemsets
- Need a compact representation
26Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Border
Infrequent Itemsets
27Closed Itemset
- An itemset is closed if none of its immediate
supersets has the same support as the itemset
28Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
29Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
30Maximal vs Closed Itemsets
31Association Rule Mining and FIM
32Research Questions
- How to efficiently enumerate Maximal Frequent
Itemsets? - How about Closed Frequent Itemsets?
33Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Example of Association Rules
Market-Basket transactions
Diaper ? Beer,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
34Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
35Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
36Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
37Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
38Computational Complexity
- Given d unique items
- Total number of itemsets 2d
- Total number of possible association rules
If d6, R 602 rules
39Rule Generation
- Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement - If A,B,C,D is a frequent itemset, candidate
rules - ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB, - If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)
40Rule Generation
- How to efficiently generate rules from frequent
itemsets? - In general, confidence does not have an
anti-monotone property - c(ABC ?D) can be larger or smaller than c(AB ?D)
- But confidence of rules generated from the same
itemset has an anti-monotone property - e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD) -
- Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
41Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
42Rule Generation for Apriori Algorithm
- Candidate rule is generated by merging two rules
that share the same prefixin the rule consequent - join(CDgtAB,BDgtAC)would produce the
candidaterule D gt ABC - Prune rule DgtABC if itssubset ADgtBC does not
havehigh confidence
43Beyond Itemsets
- Sequence Mining
- Finding frequent subsequences from a collection
of sequences - Graph Mining
- Finding frequent (connected) subgraphs from a
collection of graphs - Tree Mining
- Finding frequent (embedded) subtrees from a set
of trees/graphs - Geometric Structure Mining
- Finding frequent substructures from 3-D or 2-D
geometric graphs - Among others
44Frequent Pattern Mining
E
E
A
B
A
B
A
A
B
B
A
A
B
A
B
F
E
A
A
E
C
B
A
B
C
D
F
D
C
C
D
F
D
C
C
C
D
D
A
D
F
C
D
A
B
D
C
45Why Frequent Pattern Mining is So Important?
- Application Domains
- Business, biology, chemistry, WWW,
computer/networing security, - Summarizing the underlying datasets, providing
key insights - Basic tools for other data mining tasks
- Assocation rule mining
- Classification
- Clustering
- Change Detection
- etc
46- Network motifs recurring patterns that occur
significantly more than in randomized nets - Do motifs have specific roles in the network?
- Many possible distinct subgraphs
47The 13 three-node connected subgraphs
48199 4-node directed connected subgraphs
And it grows fast for larger subgraphs 9364
5-node subgraphs, 1,530,843 6-node
49Finding network motifs an overview
- Generation of a suitable random ensemble
(reference networks) - Network motifs detection process
- Count how many times each subgraph appears
- Compute statistical significance for each
subgraph probability of appearing in random as
much as in real network - (P-val or Z-score)
50Ensemble of networks
Real 5 Rand0.50.6 Zscore
(Standard Deviations)7.5
51Performance and Scalability Apriori
Implementation
52Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994
53Challenges of Frequent Itemset Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
53
54Reducing Number of Comparisons
- Candidate counting
- Scan the database of transactions to determine
the support of each candidate itemset - To reduce the number of comparisons, store the
candidates in a hash structure - Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets
55Generate Hash Tree
- Suppose you have 15 candidate itemsets of length
3 - 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
3 5 7, 6 8 9, 3 6 7, 3 6 8 - You need
- Hash function
- Max leaf size max number of itemsets stored in
a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node)
56Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
57Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 2, 5 or 8
58Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 3, 6 or 9
59Subset Operation
Given a transaction t, what are the possible
subsets of size 3?
60Subset Operation Using Hash Tree
transaction
61Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
62Subset Operation Using Hash Tree
transaction
1 3 6
3 4 5
1 5 9
Match transaction against 11 out of 15 candidates
63Prefix Tree Representation
Efficient Implementations of Apriori and
EclatChristian Borgelt., FIMI03
64Prefix Tree
65Prefix Tree Structure for Counting
66Other key optimization
- Recording the items
- Why is this relevant?
- Transaction Tree
- Organize transaction into trees
- Count through two trees
67Scalability
- How to handle very large dataset?
- The dataset can not be stored in the main memory
- Performance of out-of-core datasets/Performance
of in-core datasets
68Partition Scan Database Only Twice
- Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB - Scan 1 partition database and find local
frequent patterns - Scan 2 consolidate global frequent patterns
- A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in
large databases. In VLDB95
69DHP Reduce the Number of Candidates
- A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent - Candidates a, b, c, d, e
- Hash entries ab, ad, ae bd, be, de
- Frequent 1-itemset a, b, d, e
- ab is not a candidate 2-itemset if the sum of
count of ab, ad, ae is below support threshold - J. Park, M. Chen, and P. Yu. An effective
hash-based algorithm for mining association
rules. In SIGMOD95
70Sampling for Frequent Patterns
- Select a sample of original database, mine
frequent patterns within sample using Apriori - Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked - Example check abcd instead of ab, ac, , etc.
- Scan database again to find missed frequent
patterns - H. Toivonen. Sampling large databases for
association rules. In VLDB96
71DIC Reduce Number of Scans
ABCD
- Once both A and D are determined frequent, the
counting of AD begins - Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
1-itemsets
B
C
D
A
2-itemsets
Apriori
Itemset lattice
1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket data. In SIGMOD97
3-items
DIC
72References
- R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large
databases. SIGMOD, 207-216, 1993. - Â R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994. - R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.
73References
- Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI03 - Ferenc Bodon, A fast APRIORI implementation,
FIMI03 - Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University of
Technology and Economic, 2006
74Important websites
- FIMI workshop
- Not only Apriori and FIM
- FP-tree, ECLAT, Closed, Maximal
- http//fimi.cs.helsinki.fi/
- Christian Borgelts website
- http//www.borgelt.net/software.html
- Ferenc Bodons website
- http//www.cs.bme.hu/bodon/en/apriori/