Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration - PowerPoint PPT Presentation

About This Presentation

Title:

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration

Description:

dense substructures: clustering, community discovering... homology search on genome ... finding one such dense substructure. ambiguity on the transaction set ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 27

Provided by: researc46

Category:

more less

Transcript and Presenter's Notes

Title: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration

1
Ambiguous Frequent Itemset Mining and Polynomial
Delay Enumeration

Takeaki Uno(1), Hiroki Arimura(2)
(1) National Institute of Informatics, JAPAN
(The Guraduate University for Advanced Science)
(2) Hokkaido University, JAPAN

May/25/2008 PAKDD 2008
2
Frequent Pattern Mining

Problem of finding all frequently appearing
patterns from given database
database transaction database (itemset), tree,
graph, vector
patterns itemset, tree, path/cycle, graph,
geometric graph

database
??1? ,??3 ? ??2? ,??4? ??2?, ??3 ?, ??4?
??2? ,??3 ? . . .
Extract frequently appearing patterns
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT

ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
genome
3
Researches on Pattern Mining

So many studies and applications on itemsets,
sequences, trees, graphs, geometric graphs
Thanks to the efficient algorithms, we would
say any simple structures can be enumerated in
practically short time
One of the next problems is how to handle the
noise, error, and ambiguity
? usual inclusion is too strict
? we want to find patterns mostly included in
many records

We consider ambiguous appearance of patterns
4
Related Works on Ambiguity

It is popular to detect ambiguous XXXX
? dense substructures clustering, community
discovering
? homology search on genome sequence
Heuristic search is popular because of the
difficulty on modeling and computation
Advantage usually works efficiently
Problem not easy to understand what is found
much more cost for additional conditions(for
each solution)
Here we look at the problem from algorithmic
point of view
(efficient models arising from efficient
computation)

5
Itemset Mining

In this talk, we focus on the itemset mining
transaction database D each record called
transaction is a subset of itemset E, that is, ?T
?D, T ? E
Occ(P) set of transactions including P
frq(P) Occ(P) transactions including P
P is a frequent itemset ? frq(P) s (s is minimum
support)
Problem is to enumerate all frequent itemsets
in D

We introduce ambiguous inclusion for frequent
itemset mining
6
Related works

fault-tolerant pattern?degenerate pattern?soft
occurrence, etc.
mainly two approaches
(1) generalize inclusion
(1-a) the ratio of included items ? ?
include
? lose monotonicity no subset may be frequent
in the worst case
? several heuristic-search-based algorithms
(1-b) at most k items are not included ?
include
? satisfy monotonicity so many small itemsets
are frequent
? maximal enumeration or complete enumeration
with small k

1,2 2,3 1,3
?66
7
Related works 2

(2) find pairs of itemset and transaction set
such that few of them do not satisfy inclusion
? equivalent to finding dense submatrix, or
dense bicluster
so many equivalent patterns will be found
? mainly, heuristic search for
finding one such dense substructure
ambiguity on the transaction set
? an itemset can have many partners

items
transactions
We introduce a new model for (2) to avoid
redundancy, and propose an efficient depth-first
search type algorithm
8
Average Inclusion

inclusion ratio of t for P ? tnP /
P
average inclusion ratio of transaction set T
for P
? average of inclusion ratio over all
transactions in T
? t n P / ( P T )
? equivalent to dense submatrix/subgraph of
transaction-item inclusion matrix/graph
For a density threshold ?, maximum
co-occurrence size cov(P) of itemset P
? maximum size of transaction set s.t. average
inclusion ratio ?

1,3,4 2,4,5 1,2
2,3?50 4,5?50 1,2?66
9
Problem Definition

For a density threshold ?,
the maximum co-occurrence size cov(P) of itemset
P
? maximum size of transaction set s.t. average
inclusion ratio ?
Ambiguous frequent itemset itemset P s.t.,
cov(P) s
(s minimum support)
Ambiguous frequent itemsets
are not monotone !!

?66 cov(3) 1 cov(2) 3 cov(1,3)
2 cov(1,2) 3
1,3,4 2,4,5 1,2
Ambiguous frequent itemset enumeration the
problem of outputting all ambiguous frequent
itemsets for given database D, density threshold
?, minimum support s
The goal is to develop an efficient algorithm for
this problem
10
Hardness for Branch-and-Bound

A straightforward approach to this problem is
branch-and-bound
In each iteration, divide the
problem into two non-empty
problems by the
inclusion of an item

Checking the existence of ambiguous frequent
itemset is NP-comp. (Theorem 1)
11
Is This Really Hard?

We proved NP-hardness for "very dense graphs"
? unclear for middle dense graph
? not impossible for polynomial time enumeration

polynomial time in (input size) (output size)
hard
easy
?????
easy
12
Efficient Algorithm Idea of Reverse Search

We dont use branch and bound, but use reverse
search
Define an acyclic parent-child relation on all
objects to be found

objects
Depth-first search on the rooted tree induced by
the relation
Recursively find children to search, thus an
algorithm for finding all children is sufficient
13
Neighboring Relation

AmbiOcc(P) of an ambiguous frequent itemset P
? lexicographically minimum one among
transaction sets whose average inclusion ratio ?
and size cov(P)
e(P) the item e in P s.t. transactions in
AmbiOcc(P) including e is the minimum (ties are
broken by taking the minimum index)
the parent Prt(P) of P P \ e(P)

?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
14
Properties of Parent

The parent Prt(P) of P P \ e(P)
? uniquely defined
Average inclusion ratio of AmbiOcc(P) for P
does not decrease
? Prt(P) is an ambiguous frequent itemset
Prt(P) lt P (parent is always smaller)
? the relation is acyclic, and induces a tree
(rooted at f)

?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
15
Enumeration Tree

The relation is acyclic, and induces a tree
(rooted at f)
We call the tree enumeration tree

?66, s 4
f
A 1,3,4,7 B 2,4,5, C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
1
2
3
4
7
1,7
3,4
4,5
1,4
4,7
1,4,7
1,4,5
1,3,4
3,4,7
4,5,7
1,2,7
1,3,7
1,5,7
1,3,4,7
1,4,5,7
16
Listing Children

To perform a depth-first search on enumeration
tree, what we have to do is finding all children
of given itemset
P Prt(P) is obtained by removing an item
from P
? a child P of P is obtained by adding an item
to P
? to find all children, we examine all possible
items

f
17
Check Candidates

An item addition does not always yield a child
? They are just candidates
If the parent of a candidate P P?e is P
(satisfies e(P) e ),
P is a child of P
? checking by computing e(P?e), for each
candidate P?e

Theorem
Enumeration is done in O(Dn) time for each
ambifuous frequent itemset
f
18
Algorithm Description

Algorithm AFIM ( Ppattern, Ddatabase )
output P
compute cov(P?e) for all item e not in P
for each e s.t. cov(P?e) s do
compute AmbiOcc(P?e)
compute e(P?e)
if e(P?e) e then call AFIM ( P?e, D )
done

19
Computing cov(P?e)

A transaction set whose size and average
inclusion ratio are equal to AmbiOcc(P ?e) is
obtained by choosing transactions in the
decreasing order of average inclusion ratio
cov(P) cov(P?e) always holds
for any transactions T and T such that average
inclusion ratio of T for P is larger than T
? average inclusion ratio of T for P?e is no
less than T
? we can restrict the choice to transactions in
AmbiOcc(P), to compute cov(P?e)

20
Example of Computing cov

computation of cov(P?e) for P1,4 and e5

AmbiOcc(1,4) D,A, B,C,F ,E
?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
inc. 2 items
inc. 1 item
inc. no item
AmbiOcc(1,4,5) D, A,B, C ,F ,E
inc. 3 items
inc. 2 items
inc. 1 item
21
Efficient Computation of covs

For efficient computation, we classify
transactions by inclusion ratio
When we compute cov(P?e), we compute the
intersection of each group and Occ(e)
? inclusion ratio increases, for transactions
included in Occ(e)
? by moving such transactions, classification
for P?e is obtained
This task for all items is done efficiently by
Delivery, which takes O(G) time where G
is the sum of transaction sizes in group G ?
computation of cov(P?e) can be done in linear time

0 miss
1 miss
2 miss
3 miss
4 miss
5 miss
22
Computing AmbiOcc and e

Computation of AmbiOcc(P?e) needs greedy
choice of transactions, in the decreasing order
of (inclusion ratio index)
Computation of e(P?e) needs intersection of
AmbiOcc(P?e) and Occ(i) for each i?P ? Delivery
? need O(D) time in the worst case
However, when cov(P) is small, not so many
transactions may be scanned, thus we expect the
average computation time is not so long

23
Bottom-wideness

DFS search generates several recursive calls
in each iteration
? Recursion tree grows exponentially, by going
down
? Computation time is dominated by the lowest
levels
Computation time decreases by going down

long time
short time
Near by bottom levels, computation time may be
close to s, thus an iteration may take O(st) time
where t is the average size of transactions
24
Computational Experiments

CPU Pentium M 1.1GHz,
memory 256MB
OS Windows XP Cygwin
Code C
Compiler gcc 2.3
Test instances are taken from benchmark
datasets for frequent itemset mining

25
BMS-WebView 2

A real-world web access data (sparse
transaction siz 4.5)

26
Mushroom

A real-world machine learning data of
mushrooms (density 1/3)

27
Possibility for Further Improvements

Ratio of unnecessary operations, non-maximal
patterns

28
Conclusion

Introduced a new model for frequent itemset
mining with ambiguous inclusion relation, which
avoids redundancy
Showed a hardness result for branch-and-bound
Showed efficiency on practical (sparse)
datasets
Future Works
Reduce the time complexity and fill the gap
from the practice
Efficient models and computation for maximal
ones
Application of the technique to the other
problems
(ambiguous pattern mining for graph, tree, vector
data, etc.)