Title: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration
1Ambiguous Frequent Itemset Mining and Polynomial
Delay Enumeration
- Takeaki Uno(1), Hiroki Arimura(2)
- (1) National Institute of Informatics, JAPAN
- (The Guraduate University for Advanced Science)
- (2) Hokkaido University, JAPAN
May/25/2008 PAKDD 2008
2Frequent Pattern Mining
- Problem of finding all frequently appearing
patterns from given database - database transaction database (itemset), tree,
graph, vector - patterns itemset, tree, path/cycle, graph,
geometric graph
database
??1? ,??3 ? ??2? ,??4? ??2?, ??3 ?, ??4?
??2? ,??3 ? . . .
Extract frequently appearing patterns
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT
ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
genome
3Researches on Pattern Mining
- So many studies and applications on itemsets,
sequences, trees, graphs, geometric graphs - Thanks to the efficient algorithms, we would
say any simple structures can be enumerated in
practically short time - One of the next problems is how to handle the
noise, error, and ambiguity - ? usual inclusion is too strict
- ? we want to find patterns mostly included in
many records -
We consider ambiguous appearance of patterns
4Related Works on Ambiguity
- It is popular to detect ambiguous XXXX
- ? dense substructures clustering, community
discovering - ? homology search on genome sequence
- Heuristic search is popular because of the
difficulty on modeling and computation - Advantage usually works efficiently
- Problem not easy to understand what is found
- much more cost for additional conditions(for
each solution) - Here we look at the problem from algorithmic
point of view - (efficient models arising from efficient
computation)
5Itemset Mining
- In this talk, we focus on the itemset mining
- transaction database D each record called
transaction is a subset of itemset E, that is, ?T
?D, T ? E - Occ(P) set of transactions including P
- frq(P) Occ(P) transactions including P
- P is a frequent itemset ? frq(P) s (s is minimum
support) - Problem is to enumerate all frequent itemsets
in D
We introduce ambiguous inclusion for frequent
itemset mining
6Related works
- fault-tolerant pattern?degenerate pattern?soft
occurrence, etc. - mainly two approaches
- (1) generalize inclusion
- (1-a) the ratio of included items ? ?
include - ? lose monotonicity no subset may be frequent
in the worst case - ? several heuristic-search-based algorithms
- (1-b) at most k items are not included ?
include - ? satisfy monotonicity so many small itemsets
are frequent - ? maximal enumeration or complete enumeration
with small k
1,2 2,3 1,3
?66
7Related works 2
- (2) find pairs of itemset and transaction set
such that few of them do not satisfy inclusion - ? equivalent to finding dense submatrix, or
dense bicluster - so many equivalent patterns will be found
- ? mainly, heuristic search for
- finding one such dense substructure
- ambiguity on the transaction set
- ? an itemset can have many partners
-
items
transactions
We introduce a new model for (2) to avoid
redundancy, and propose an efficient depth-first
search type algorithm
8Average Inclusion
- inclusion ratio of t for P ? tnP /
P - average inclusion ratio of transaction set T
for P - ? average of inclusion ratio over all
transactions in T - ? t n P / ( P T )
- ? equivalent to dense submatrix/subgraph of
transaction-item inclusion matrix/graph - For a density threshold ?, maximum
co-occurrence size cov(P) of itemset P - ? maximum size of transaction set s.t. average
inclusion ratio ?
1,3,4 2,4,5 1,2
2,3?50 4,5?50 1,2?66
9Problem Definition
- For a density threshold ?,
- the maximum co-occurrence size cov(P) of itemset
P - ? maximum size of transaction set s.t. average
inclusion ratio ? - Ambiguous frequent itemset itemset P s.t.,
cov(P) s - (s minimum support)
- Ambiguous frequent itemsets
- are not monotone !!
?66 cov(3) 1 cov(2) 3 cov(1,3)
2 cov(1,2) 3
1,3,4 2,4,5 1,2
Ambiguous frequent itemset enumeration the
problem of outputting all ambiguous frequent
itemsets for given database D, density threshold
?, minimum support s
The goal is to develop an efficient algorithm for
this problem
10Hardness for Branch-and-Bound
- A straightforward approach to this problem is
branch-and-bound - In each iteration, divide the
- problem into two non-empty
- problems by the
- inclusion of an item
Checking the existence of ambiguous frequent
itemset is NP-comp. (Theorem 1)
11Is This Really Hard?
- We proved NP-hardness for "very dense graphs"
- ? unclear for middle dense graph
- ? not impossible for polynomial time enumeration
polynomial time in (input size) (output size)
hard
easy
?????
easy
12Efficient Algorithm Idea of Reverse Search
- We dont use branch and bound, but use reverse
search - Define an acyclic parent-child relation on all
objects to be found
objects
Depth-first search on the rooted tree induced by
the relation
Recursively find children to search, thus an
algorithm for finding all children is sufficient
13Neighboring Relation
- AmbiOcc(P) of an ambiguous frequent itemset P
- ? lexicographically minimum one among
transaction sets whose average inclusion ratio ?
and size cov(P) - e(P) the item e in P s.t. transactions in
AmbiOcc(P) including e is the minimum (ties are
broken by taking the minimum index) - the parent Prt(P) of P P \ e(P)
?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
14Properties of Parent
- The parent Prt(P) of P P \ e(P)
- ? uniquely defined
- Average inclusion ratio of AmbiOcc(P) for P
does not decrease - ? Prt(P) is an ambiguous frequent itemset
- Prt(P) lt P (parent is always smaller)
- ? the relation is acyclic, and induces a tree
(rooted at f)
?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
15Enumeration Tree
- The relation is acyclic, and induces a tree
(rooted at f) - We call the tree enumeration tree
?66, s 4
f
A 1,3,4,7 B 2,4,5, C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
1
2
3
4
7
1,7
3,4
4,5
1,4
4,7
1,4,7
1,4,5
1,3,4
3,4,7
4,5,7
1,2,7
1,3,7
1,5,7
1,3,4,7
1,4,5,7
16Listing Children
- To perform a depth-first search on enumeration
tree, what we have to do is finding all children
of given itemset - P Prt(P) is obtained by removing an item
from P - ? a child P of P is obtained by adding an item
to P - ? to find all children, we examine all possible
items
f
17Check Candidates
- An item addition does not always yield a child
- ? They are just candidates
- If the parent of a candidate P P?e is P
(satisfies e(P) e ), - P is a child of P
- ? checking by computing e(P?e), for each
candidate P?e
Theorem
Enumeration is done in O(Dn) time for each
ambifuous frequent itemset
f
18Algorithm Description
- Algorithm AFIM ( Ppattern, Ddatabase )
- output P
- compute cov(P?e) for all item e not in P
- for each e s.t. cov(P?e) s do
- compute AmbiOcc(P?e)
- compute e(P?e)
- if e(P?e) e then call AFIM ( P?e, D )
- done
19Computing cov(P?e)
- A transaction set whose size and average
inclusion ratio are equal to AmbiOcc(P ?e) is
obtained by choosing transactions in the
decreasing order of average inclusion ratio - cov(P) cov(P?e) always holds
- for any transactions T and T such that average
inclusion ratio of T for P is larger than T - ? average inclusion ratio of T for P?e is no
less than T - ? we can restrict the choice to transactions in
AmbiOcc(P), to compute cov(P?e)
20Example of Computing cov
- computation of cov(P?e) for P1,4 and e5
AmbiOcc(1,4) D,A, B,C,F ,E
?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
inc. 2 items
inc. 1 item
inc. no item
AmbiOcc(1,4,5) D, A,B, C ,F ,E
inc. 3 items
inc. 2 items
inc. 1 item
21Efficient Computation of covs
- For efficient computation, we classify
transactions by inclusion ratio - When we compute cov(P?e), we compute the
intersection of each group and Occ(e) - ? inclusion ratio increases, for transactions
included in Occ(e) - ? by moving such transactions, classification
for P?e is obtained - This task for all items is done efficiently by
Delivery, which takes O(G) time where G
is the sum of transaction sizes in group G ?
computation of cov(P?e) can be done in linear time
0 miss
1 miss
2 miss
3 miss
4 miss
5 miss
22Computing AmbiOcc and e
- Computation of AmbiOcc(P?e) needs greedy
choice of transactions, in the decreasing order
of (inclusion ratio index) - Computation of e(P?e) needs intersection of
AmbiOcc(P?e) and Occ(i) for each i?P ? Delivery - ? need O(D) time in the worst case
- However, when cov(P) is small, not so many
transactions may be scanned, thus we expect the
average computation time is not so long
23Bottom-wideness
- DFS search generates several recursive calls
in each iteration - ? Recursion tree grows exponentially, by going
down - ? Computation time is dominated by the lowest
levels - Computation time decreases by going down
long time
short time
Near by bottom levels, computation time may be
close to s, thus an iteration may take O(st) time
where t is the average size of transactions
24Computational Experiments
- CPU Pentium M 1.1GHz,
- memory 256MB
- OS Windows XP Cygwin
- Code C
- Compiler gcc 2.3
- Test instances are taken from benchmark
datasets for frequent itemset mining -
25BMS-WebView 2
- A real-world web access data (sparse
transaction siz 4.5)
26Mushroom
- A real-world machine learning data of
mushrooms (density 1/3)
27Possibility for Further Improvements
- Ratio of unnecessary operations, non-maximal
patterns
28Conclusion
- Introduced a new model for frequent itemset
mining with ambiguous inclusion relation, which
avoids redundancy - Showed a hardness result for branch-and-bound
- Showed efficiency on practical (sparse)
datasets - Future Works
- Reduce the time complexity and fill the gap
from the practice - Efficient models and computation for maximal
ones - Application of the technique to the other
problems - (ambiguous pattern mining for graph, tree, vector
data, etc.) -