Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration - PowerPoint PPT Presentation

About This Presentation
Title:

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration

Description:

dense substructures: clustering, community discovering... homology search on genome ... finding one such dense substructure. ambiguity on the transaction set ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 27
Provided by: researc46
Category:

less

Transcript and Presenter's Notes

Title: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration


1
Ambiguous Frequent Itemset Mining and Polynomial
Delay Enumeration
  • Takeaki Uno(1), Hiroki Arimura(2)
  • (1) National Institute of Informatics, JAPAN
  • (The Guraduate University for Advanced Science)
  • (2) Hokkaido University, JAPAN

May/25/2008 PAKDD 2008
2
Frequent Pattern Mining
  • Problem of finding all frequently appearing
    patterns from given database
  • database transaction database (itemset), tree,
    graph, vector
  • patterns itemset, tree, path/cycle, graph,
    geometric graph

database
??1? ,??3 ? ??2? ,??4? ??2?, ??3 ?, ??4?
??2? ,??3 ? . . .
Extract frequently appearing patterns
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT

ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
genome
3
Researches on Pattern Mining
  • So many studies and applications on itemsets,
    sequences, trees, graphs, geometric graphs
  • Thanks to the efficient algorithms, we would
    say any simple structures can be enumerated in
    practically short time
  • One of the next problems is how to handle the
    noise, error, and ambiguity
  • ? usual inclusion is too strict
  • ? we want to find patterns mostly included in
    many records

We consider ambiguous appearance of patterns
4
Related Works on Ambiguity
  • It is popular to detect ambiguous XXXX
  • ? dense substructures clustering, community
    discovering
  • ? homology search on genome sequence
  • Heuristic search is popular because of the
    difficulty on modeling and computation
  • Advantage usually works efficiently
  • Problem not easy to understand what is found
  • much more cost for additional conditions(for
    each solution)
  • Here we look at the problem from algorithmic
    point of view
  • (efficient models arising from efficient
    computation)

5
Itemset Mining
  • In this talk, we focus on the itemset mining
  • transaction database D each record called
    transaction is a subset of itemset E, that is, ?T
    ?D, T ? E
  • Occ(P) set of transactions including P
  • frq(P) Occ(P) transactions including P
  • P is a frequent itemset ? frq(P) s (s is minimum
    support)
  • Problem is to enumerate all frequent itemsets
    in D

We introduce ambiguous inclusion for frequent
itemset mining
6
Related works
  • fault-tolerant pattern?degenerate pattern?soft
    occurrence, etc.
  • mainly two approaches
  • (1) generalize inclusion
  • (1-a) the ratio of included items ? ?
    include
  • ? lose monotonicity no subset may be frequent
    in the worst case
  • ? several heuristic-search-based algorithms
  • (1-b) at most k items are not included ?
    include
  • ? satisfy monotonicity so many small itemsets
    are frequent
  • ? maximal enumeration or complete enumeration
    with small k

1,2 2,3 1,3
?66
7
Related works 2
  • (2) find pairs of itemset and transaction set
    such that few of them do not satisfy inclusion
  • ? equivalent to finding dense submatrix, or
    dense bicluster
  • so many equivalent patterns will be found
  • ? mainly, heuristic search for
  • finding one such dense substructure
  • ambiguity on the transaction set
  • ? an itemset can have many partners

items
transactions
We introduce a new model for (2) to avoid
redundancy, and propose an efficient depth-first
search type algorithm
8
Average Inclusion
  • inclusion ratio of t for P ? tnP /
    P
  • average inclusion ratio of transaction set T
    for P
  • ? average of inclusion ratio over all
    transactions in T
  • ? t n P / ( P T )
  • ? equivalent to dense submatrix/subgraph of
    transaction-item inclusion matrix/graph
  • For a density threshold ?, maximum
    co-occurrence size cov(P) of itemset P
  • ? maximum size of transaction set s.t. average
    inclusion ratio ?

1,3,4 2,4,5 1,2
2,3?50 4,5?50 1,2?66
9
Problem Definition
  • For a density threshold ?,
  • the maximum co-occurrence size cov(P) of itemset
    P
  • ? maximum size of transaction set s.t. average
    inclusion ratio ?
  • Ambiguous frequent itemset itemset P s.t.,
    cov(P) s
  • (s minimum support)
  • Ambiguous frequent itemsets
  • are not monotone !!

?66 cov(3) 1 cov(2) 3 cov(1,3)
2 cov(1,2) 3
1,3,4 2,4,5 1,2
Ambiguous frequent itemset enumeration the
problem of outputting all ambiguous frequent
itemsets for given database D, density threshold
?, minimum support s
The goal is to develop an efficient algorithm for
this problem
10
Hardness for Branch-and-Bound
  • A straightforward approach to this problem is
    branch-and-bound
  • In each iteration, divide the
  • problem into two non-empty
  • problems by the
  • inclusion of an item


Checking the existence of ambiguous frequent
itemset is NP-comp. (Theorem 1)
11
Is This Really Hard?
  • We proved NP-hardness for "very dense graphs"
  • ? unclear for middle dense graph
  • ? not impossible for polynomial time enumeration

polynomial time in (input size) (output size)
hard
easy
?????
easy
12
Efficient Algorithm Idea of Reverse Search
  • We dont use branch and bound, but use reverse
    search
  • Define an acyclic parent-child relation on all
    objects to be found

objects
Depth-first search on the rooted tree induced by
the relation
Recursively find children to search, thus an
algorithm for finding all children is sufficient
13
Neighboring Relation
  • AmbiOcc(P) of an ambiguous frequent itemset P
  • ? lexicographically minimum one among
    transaction sets whose average inclusion ratio ?
    and size cov(P)
  • e(P) the item e in P s.t. transactions in
    AmbiOcc(P) including e is the minimum (ties are
    broken by taking the minimum index)
  • the parent Prt(P) of P P \ e(P)

?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
14
Properties of Parent
  • The parent Prt(P) of P P \ e(P)
  • ? uniquely defined
  • Average inclusion ratio of AmbiOcc(P) for P
    does not decrease
  • ? Prt(P) is an ambiguous frequent itemset
  • Prt(P) lt P (parent is always smaller)
  • ? the relation is acyclic, and induces a tree
    (rooted at f)

?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
e(P) 5 Prt(1,4,5) ? 1,4 AmbiOcc(1,4)
D,A, B,C, F
1,4,5 ? D, A,B, C,F, E AmbiOcc(1,4,5)
D,A,B,C
15
Enumeration Tree
  • The relation is acyclic, and induces a tree
    (rooted at f)
  • We call the tree enumeration tree

?66, s 4
f
A 1,3,4,7 B 2,4,5, C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
1
2
3
4
7
1,7
3,4
4,5
1,4
4,7
1,4,7
1,4,5
1,3,4
3,4,7
4,5,7
1,2,7
1,3,7
1,5,7
1,3,4,7
1,4,5,7
16
Listing Children
  • To perform a depth-first search on enumeration
    tree, what we have to do is finding all children
    of given itemset
  • P Prt(P) is obtained by removing an item
    from P
  • ? a child P of P is obtained by adding an item
    to P
  • ? to find all children, we examine all possible
    items

f
17
Check Candidates
  • An item addition does not always yield a child
  • ? They are just candidates
  • If the parent of a candidate P P?e is P
    (satisfies e(P) e ),
  • P is a child of P
  • ? checking by computing e(P?e), for each
    candidate P?e

Theorem
Enumeration is done in O(Dn) time for each
ambifuous frequent itemset
f
18
Algorithm Description
  • Algorithm AFIM ( Ppattern, Ddatabase )
  • output P
  • compute cov(P?e) for all item e not in P
  • for each e s.t. cov(P?e) s do
  • compute AmbiOcc(P?e)
  • compute e(P?e)
  • if e(P?e) e then call AFIM ( P?e, D )
  • done

19
Computing cov(P?e)
  • A transaction set whose size and average
    inclusion ratio are equal to AmbiOcc(P ?e) is
    obtained by choosing transactions in the
    decreasing order of average inclusion ratio
  • cov(P) cov(P?e) always holds
  • for any transactions T and T such that average
    inclusion ratio of T for P is larger than T
  • ? average inclusion ratio of T for P?e is no
    less than T
  • ? we can restrict the choice to transactions in
    AmbiOcc(P), to compute cov(P?e)

20
Example of Computing cov
  • computation of cov(P?e) for P1,4 and e5

AmbiOcc(1,4) D,A, B,C,F ,E
?66, s 4
A 1,3,4,7 B 2,4,5 C 1,2,7 D 1,4,5,7 E
2,3,6 F 3,4,6
inc. 2 items
inc. 1 item
inc. no item
AmbiOcc(1,4,5) D, A,B, C ,F ,E
inc. 3 items
inc. 2 items
inc. 1 item
21
Efficient Computation of covs
  • For efficient computation, we classify
    transactions by inclusion ratio
  • When we compute cov(P?e), we compute the
    intersection of each group and Occ(e)
  • ? inclusion ratio increases, for transactions
    included in Occ(e)
  • ? by moving such transactions, classification
    for P?e is obtained
  • This task for all items is done efficiently by
    Delivery, which takes O(G) time where G
    is the sum of transaction sizes in group G ?
    computation of cov(P?e) can be done in linear time

0 miss
1 miss
2 miss
3 miss
4 miss
5 miss
22
Computing AmbiOcc and e
  • Computation of AmbiOcc(P?e) needs greedy
    choice of transactions, in the decreasing order
    of (inclusion ratio index)
  • Computation of e(P?e) needs intersection of
    AmbiOcc(P?e) and Occ(i) for each i?P ? Delivery
  • ? need O(D) time in the worst case
  • However, when cov(P) is small, not so many
    transactions may be scanned, thus we expect the
    average computation time is not so long

23
Bottom-wideness
  • DFS search generates several recursive calls
    in each iteration
  • ? Recursion tree grows exponentially, by going
    down
  • ? Computation time is dominated by the lowest
    levels
  • Computation time decreases by going down

long time
short time
Near by bottom levels, computation time may be
close to s, thus an iteration may take O(st) time
where t is the average size of transactions
24
Computational Experiments
  • CPU Pentium M 1.1GHz,
  • memory 256MB
  • OS Windows XP Cygwin
  • Code C
  • Compiler gcc 2.3
  • Test instances are taken from benchmark
    datasets for frequent itemset mining

25
BMS-WebView 2
  • A real-world web access data (sparse
    transaction siz 4.5)

26
Mushroom
  • A real-world machine learning data of
    mushrooms (density 1/3)

27
Possibility for Further Improvements
  • Ratio of unnecessary operations, non-maximal
    patterns

28
Conclusion
  • Introduced a new model for frequent itemset
    mining with ambiguous inclusion relation, which
    avoids redundancy
  • Showed a hardness result for branch-and-bound
  • Showed efficiency on practical (sparse)
    datasets
  • Future Works
  • Reduce the time complexity and fill the gap
    from the practice
  • Efficient models and computation for maximal
    ones
  • Application of the technique to the other
    problems
  • (ambiguous pattern mining for graph, tree, vector
    data, etc.)
Write a Comment
User Comments (0)
About PowerShow.com