Title: An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining
1An Efficient Polynomial Delay Algorithm
forPseudo Frequent Itemset Mining
- Takeaki Uno (National Institute of Informatics)
- Hiroki Arimura (Hokkaido University)
2/Oct/2007 Discovery Science 2007
2Frequent Pattern Mining
- problem of finding all frequently appearing
patterns from - (large scale) database
- database transaction, tree, string, graph,
vector - pattern subset, tree, path, sequence, graph,
geograph
database
ex1? ,ex3 ? ex2? ,ex4? ex2?, ex3 ?, ex4?
ex2? ,ex3 ? . . .
ex1 ex2 ex3 ex4
? ? ?
? ?
? ? ? ?
? ? ? ?
? ? ?
? ? ?
? ? ?
? ?
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT
ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
Genome info
3This Research
- address transaction database
- transaction database each record (transaction)
T of the database is a subset of the itemset E,
i.e., D, ?T ?D, T ? E - frequent itemset subset of E included in at
least s transactions - problems
- - so many patterns for finding valuable
patterns - - inclusion is strict, to deal with errors
- ? "patterns ambiguously included in many
transactions" are impotant
minimum support threshold
We introduce an ambiguous inclusion, and propose
an efficient mining algorithm
4Related Works
- Such frequent itemset mining with ambiguity is
called - fault-tolerant pattern, degenerate pattern, soft
occurrence - - ambiguity for inclusion is, "pattern is
included if the ratio of - included items is more than the threshold
- - another approach find combinations of
itemset and - transaction set, such that few pairs of
item and transaction do - not satisfy inclusion relation
- - similarity is used, for string matching
and homology search - Few "enumeration type" research with
completeness
Look at practical models and algorithms, from
algorithm theory
5Notations for F.I.M.
- For itemset K,
- occurrence of K transaction of D including K
- Occ(K) occurrence set of K the set of
occurrences of K - frq(K) frequency of K the size of Occ(K)
Occ( 1,2 ) 1,2,5,6,7,9,
1,2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Occ( 2,7,9 ) 1,2,5,6,7,9,
1,2,7,8,9, 2,7,9
6Frequent Itemset
- Frequent itemset itemset with frequency no
less than s - ( s is called minimum support (threshold) )
- Ex.)
Itemsets included in no less than 3
transactions 1 2 7 9 1,7
1,9 2,7 2,9 7,9 1,7,9 2,7,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Frequent itemset mining problem of enumerating
all frequent itemsets for given database D and
minimum support s
7Inclusion with Ambiguity
- Ambiguous inclusion relation for itemset P and
transaction T - Popular definition PnT / P ? ? for
threshold ?lt1 - ? lose monotonicity of frequent itemsets
- ? there is a frequent itemset s.t. "any its
subset is infrequent" - ? much cost for computation
? 0.6 1,2 2,3 1,3
1,2,3 ? 1,2,4,5 for ? 0.6
1,2,3,4,5,6,7 ? 1,3,5,6,7 for ? 0.6
1,2,3 ? 1,4,5 for ? 0.6
1,2,3 ? included in all subset ? not for any
8k-pseudo Inclusion
- Use threshold for non-included items
- k-pseudo inclusion P\T ?k for threshold k ?
0 - ( k-pseudo occurrence / occurrence set /
frequency ) - ? monotonicity is kept
- ? able to find characterizations such as
- "many transactions include at least 3
items of P"
1,2,3 ? 1,2,4,5 for k 1
1,2,3,4,5,6,7 ? 1,3,5,6,7 for k 1
1,2,3 ? 1,4,5 for k 1
9k Pseudo Frequent Itemset
- k-pseudo frequent itemset itemset k-pseudo
included in at least s transactions of D
1-pseudo frequent itemsets for s3 1,2,3
1,2,4 1,2,5 1,2,7 1,2,9 1,3,7 1,3,9
1,4,7 1,4,9 1,5,7 1,5,9 1,6,7 1,6,9
1,7,8 1,7,9 1,8,9 2,3,7 2,3,9 2,4,7
2,4,9 2,5,7 2,5,8 2,5,9 2,6,7 2,6,9
2,7,8 2,7,9 2,8,9 3,7,9 4,7,9 5,7,9
6,7,9 7,8,9 1,2,7,9 1,3,7,9
1,4,7,91,5,7,9 1,6,7,9 1,7,8,9 2,3,7,9
2,4,7,9 2,5,7,9 2,6,7,9 2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Many trivial patterns How to efficiently
enumerate?
10Enumeration using Monotonicity
- Pseudo frequent itemsets have monotone property
thereby simple backtrack algorithm work - For each k-pseudo frequent itemset P, compute
k-pseudo frequency of each - Pe
- If the k-pseudo frequency of Pe
- is no less than s, generate recursive
- call to enumerate k-pseudo frequent itemsets
including Pe
Polynomial time enumeration
How to efficiently computate?
11Computing k-Pseudo Occurrences
- Define Occh(P) T?D P\T h
- ? set of transactions missing just h
items of P - ? Occ?k(P) ?h?kOcch(P)
- Occh(P?e) Occh(P)nOcc(e) ?
Occh-1(P)\Occ(e) - ? update of pseudo occurrence
- set is done by taking intersection
- compute Occh(P)nOcc(e)
- for all pair of e and h
A B C D E F G
A B C D E F G
A B E F G
A B C D F
B A C D F
A B C F
A B C D
A B C D
A B C D
B C F
Occ0 Occ1 Occ2
C D
8 9 10 11 12
P
12Taking Intersections Efficiently
- Occh(P?e) Occh(P)nOcc(e) ?
Occh-1(P)\Occ(e) - ? having the same properties as usual
occurrences - ? can use many existing techniques for updating
occurrence set - (down project, delivery, bitmap)
- Database reduction (FP-tree)
- is also available
- In deeper levels of recursion,
- transactions to be scanned
- becomes few, thereby
- the computation is fast
1 A,C,D 2 A,B,C,E,F 3 B 4 B 5 A,B 6 A 7
A,C,D,E 8 C 9 A,C,D,E
A 1,2,5,6,7,9 B 2,3,4,5 C 1,2,7,8,9 D
1,7,9 E 2,7,9 F 2
13Using Bottom-wideness
- Backtrack (depth-first search) generates
several recursive calls in each iteration - ? The computation tree spreads exponentially by
going down - ? The computation time is dominated by the
bottom level iterations on the recursion tree
Since occurrences to be computed is few in
lower levels,
long time
short time
Amortized computation time is reduced to that of
bottom levels
14For Large Minimum Support
- When s is large, we access many transactions on
the bottom levels - ? Improvements by bottom-wideness is not
drastic - Reduce the database to speed up the bottoms
- (1) Delete items less than the maximum item in P
- (2) Delete items being infrequent on the
occurrence set database - (since it never be added in the recursive call)
- (3) unify the same transactions
- The database size is constant in the
- bottom levels in practice
P1,3, k1, s4
1 3 5
1 2 3 4 6
1 7
2 3 4 6 7
3 4 5 6 7
2 3 4 6 7
No big difference from small s
15Small Trivial Patterns
- Under the k-pseudo inclusion, itemsets of size
no more than k is included in any transaction - itemsets of size bit greater than k is also
included in many transactions - ? Many small and trivial frequent itemsets
- We want to ignore these itemsets in practice
- ? Consider problem of directly finding
- pseudo frequent itemsets of size l
16Directly Finding Large Itemset
- Need exponential time if search all itemsets of
size l - ? Pruning unnecessary search is crucial
- ? Take candidates according to partial structure
- Let P be a k-pseudo frequent itemset of size l
- WLOG, P1,,l and
- sorted in decreasing order of
Occk(P)\Occ(e) - Consider the (k-1)-pseudo frequency of itemset
1,,y - Any transaction in Occk(P)\Occ(e), egty
- (k-1)-pseudo includes 1,,y
17Search Route to Itemset of Size l
- Any transaction in Occk(P)\Occ(e), egty
- (k-1)-pseudo includes 1,,y
- ? Occk-1(1,,y) ? ?ey1,...,P
(Occk(P)\Occ(e)) - average of Occk(P)\Occ(e) is no less than
(k / P) Occk (P) - 1,,y are sorted in increasing order of
Occk(P)\Occ(e) - ? Occk-1(1,,y) ? Occk(P)(P-y)/P
Partial frequency condition
There is a sequence of itemsets from empty set to
P composed only of itemsets satisfying partial
frequency condition
18Example for Partial Frequency Condition
- Itemsets satisfying the partial frequency
condition, - for k1, s3, l3
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
1-pseudo frequent itemsets satisfying the
partial frequency condition 1 2 5 7 9
1,2 1,5 1,6 1,7 1,8 1,9 2,3 2,4
2,5 2,6 2,7 2,8 2,9 3,5 4,5 5,6
5,7 5,9 6,7 6,9 7,8 7,9 8,9
D
frequent itemsets to be searched is decreased,
? efficient search is expected
19Restricted Search Route by P.F.C.
- Any k-pseudo frequent itemset of size l can be
found by passing through those satisfying partial
frequency condition - ? Let's do backtrack search
- Always exist an item whose removal satisfies
the condition - Tail extension is not available
- (removal of tail may violate condition)
- Simple hill climbing generates duplications
- So, use a generation rule to avoid duplication
(reverse search)
20Reverse Search for P.F.C.
- Rule generate itemset P from P\e maximizing
Occk-1(P\e) - (Tie is broken by choosing the minimum
index) - ReverseSearch (P)
- 1. if P1 then output P return
- 2. for each e?P do
- if Pe is a k-pseudo frequent itemset
satisfying P.F.C. then - if e maximizes Occk-1(P\e) then
ReverseSearch (Pe) - 3. end for
- Occk-1(P\e) can be efficiently computed by
existing methods
O(PD) time for one iteration
21Conclusion
- Introduced ambiguous inclusion relation such
that at most k items of the pattern is not
included - Pseudo frequent itemset mining under the
inclusion (monotonicity, intersection, many
small-trivial patterns) - Reverse search for directly finding frequent
itemset with fixed size
Future works
implementation and experiments extension of
the technique to other pattern mining approach
to inclusion with "ratio r "