Title: Frequent Patterns I
1Frequent Patterns I
2Outline
- Introduction
- What is frequent pattern mining?
- What is association rule mining?
- Methods for association rule mining
- Extensions of frequent patterns
3Introduction
- Topics
- Association Rule
- Sequential Patterns
- Graph Mining
- Clustering and Outlier Detection
- Classification and Prediction
- Regression
- Pattern Interestingness
- Dimensionality Reduction
-
4Introduction
- Applications
- Bioinformatics
- Web mining
- Text mining
- Visualization
- Financial data analysis
- Intrusion detection
-
5Introduction
- Data mining and KDD (SIGKDD CDROM)
- Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
PKDD, PAKDD, etc. - Journal Data Mining and Knowledge Discovery, KDD
Explorations - Database systems (SIGMOD CD ROM)
- Conferences ACM-SIGMOD, ACM-PODS, VLDB,
IEEE-ICDE, EDBT, ICDT, DASFAA - Journals ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
- AI Machine Learning
- Conferences Machine learning (ICML), AAAI,
IJCAI, COLT (Learning Theory), etc. - Journals Machine Learning, Artificial
Intelligence, etc.
6Introduction
- Statistics
- Conferences Joint Stat. Meeting, etc.
- Journals Annals of statistics, etc.
- Bioinformatics
- Conferences ISMB, RECOMB, PSB, CSB, BIBE, etc.
- Journals J. of Computational Biology,
Bioinformatics, etc. - Visualization
- Conference proceedings CHI, ACM-SIGGraph, etc.
- Journals IEEE Trans. visualization and computer
graphics, etc.
7What Is Frequent Pattern Mining?
- Frequent patterns patterns (set of items,
sequence, etc.) that occur frequently in a
database AIS93 - Frequent pattern mining finding regularities in
data - What products were often purchased together?
- Beer and diapers?!
- What are the subsequent purchases after buying a
car? - Can we automatically profile customers?
8Basics
- Itemset a set of items
- E.g., acma, c, m
- Support of itemsets
- Sup(acm)3
- Given min_sup3, acm is a frequent pattern
- Frequent pattern mining find all frequent
patterns in a database
Transaction database TDB
9Association Rules Mining A Road Map
- Boolean vs. quantitative associations
- age(x, 30..39) income(x, 42..48K) ? buys(x,
car) 1, 75 - Single dimension vs. multiple dimensional
associations - Single level vs. multiple-level analysis
- What brands of beers are associated with what
brands of diapers?
10Extensions Applications
- Correlation, causality analysis mining
interesting rules - Maxpatterns and frequent closed itemsets
- Sequential patterns
- Periodic patterns
- Structural Patterns
11Frequent Pattern Mining Methods
- Apriori and its variations/improvements
- Mining frequent-patterns without candidate
generation - Mining max-patterns and closed itemsets
- Mining multi-dimensional, multi-level frequent
patterns with flexible support constraints - Interestingness correlation and causality
12Apriori Candidate Generation-and-test
- Any subset of a frequent itemset must be also
frequent an anti-monotone property - A transaction containing beer, diaper, nuts
also contains beer, diaper - beer, diaper, nuts is frequent ? beer, diaper
must also be frequent - No superset of any infrequent itemset should be
generated or tested - Many item combinations can be pruned
13Apriori-based Mining
- Generate length (k1) candidate itemsets from
length k frequent itemsets, and - Test the candidates against DB
14Apriori Algorithm
- A level-wise, candidate-generation-and-test
approach (Agrawal Srikant 1994)
Data base D
1-candidates
Freq 1-itemsets
2-candidates
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Scan D
Freq 3-itemsets
15The Apriori Algorithm
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent 1-itemsets
- for (k 1 Lk !? k) do
- Ck1 candidates generated from Lk
- for each transaction t in database do increment
the count of all candidates in Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- return ?k Lk
16Important Details of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- How to count supports of candidates?
17How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-join Lk-1
- INSERT INTO Ck
- SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
- FROM Lk-1 p, Lk-1 q
- WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- For each itemset c in Ck do
- For each (k-1)-subsets s of c do if (s is not in
Lk-1) then delete c from Ck
18Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
19How to Count Supports of Candidates?
- Why counting supports of candidates is a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all candidates contained
in a transaction
20Challenges of Frequent Pattern Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious work of support counting for candidates
- Improving Apriori general ideas
- Reduce number of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
21DIC Reduce Number of Scans
ABCD
- Once both A and D are determined frequent, the
counting of AD can begin - Once all length-2 subsets of BCD are determined
frequent, the counting of BCD can begin
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori
Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
22DHP Reduce the Number of Candidates
- A hashing bucket count ltmin_sup ? every candidate
in the buck is infrequent - Candidates a, b, c, d, e
- Hash entries ab, ad, ae bd, be, de
- Large 1-itemset a, b, d, e
- The sum of counts of ab, ad, ae lt min_sup ? ab
should not be a candidate 2-itemset - J. Park, M. Chen, and P. Yu, 1995
23Partition Scan Database Only Twice (Distributed
Computing)
- Partition the database into n partitions
- Itemset X is frequent ? X frequent in at least
one partition - Scan 1 partition database and find local
frequent patterns - Scan 2 consolidate global frequent patterns
- A. Savasere, E. Omiecinski, and S. Navathe, 1995
24Sampling for Frequent Patterns
- Select a sample of original database, mine
frequent patterns within sample using Apriori - Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked - Example check abcd instead of ab, ac, , etc.
- Scan database again to find missed frequent
patterns - H. Toivonen, 1996