An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining


1
An Efficient Polynomial Delay Algorithm
forPseudo Frequent Itemset Mining
  • Takeaki Uno (National Institute of Informatics)
  • Hiroki Arimura (Hokkaido University)

2/Oct/2007 Discovery Science 2007
2
Frequent Pattern Mining
  • problem of finding all frequently appearing
    patterns from
  • (large scale) database
  • database transaction, tree, string, graph,
    vector
  • pattern subset, tree, path, sequence, graph,
    geograph

database
ex1? ,ex3 ? ex2? ,ex4? ex2?, ex3 ?, ex4?
ex2? ,ex3 ? . . .
ex1 ex2 ex3 ex4
? ? ?
? ?
? ? ? ?
? ? ? ?
? ? ?
? ? ?
? ? ?
? ?
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT

ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
Genome info
3
This Research
  • address transaction database
  • transaction database each record (transaction)
    T of the database is a subset of the itemset E,
    i.e., D, ?T ?D, T ? E
  • frequent itemset subset of E included in at
    least s transactions
  • problems
  • - so many patterns for finding valuable
    patterns
  • - inclusion is strict, to deal with errors
  • ? "patterns ambiguously included in many
    transactions" are impotant

minimum support threshold
We introduce an ambiguous inclusion, and propose
an efficient mining algorithm
4
Related Works
  • Such frequent itemset mining with ambiguity is
    called
  • fault-tolerant pattern, degenerate pattern, soft
    occurrence
  • - ambiguity for inclusion is, "pattern is
    included if the ratio of
  • included items is more than the threshold
  • - another approach find combinations of
    itemset and
  • transaction set, such that few pairs of
    item and transaction do
  • not satisfy inclusion relation
  • - similarity is used, for string matching
    and homology search
  • Few "enumeration type" research with
    completeness

Look at practical models and algorithms, from
algorithm theory
5
Notations for F.I.M.
  • For itemset K,
  • occurrence of K transaction of D including K
  • Occ(K) occurrence set of K the set of
    occurrences of K
  • frq(K) frequency of K the size of Occ(K)

Occ( 1,2 ) 1,2,5,6,7,9,
1,2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Occ( 2,7,9 ) 1,2,5,6,7,9,
1,2,7,8,9, 2,7,9
6
Frequent Itemset
  • Frequent itemset itemset with frequency no
    less than s
  • ( s is called minimum support (threshold) )
  • Ex.)

Itemsets included in no less than 3
transactions 1 2 7 9 1,7
1,9 2,7 2,9 7,9 1,7,9 2,7,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Frequent itemset mining problem of enumerating
all frequent itemsets for given database D and
minimum support s
7
Inclusion with Ambiguity
  • Ambiguous inclusion relation for itemset P and
    transaction T
  • Popular definition PnT / P ? ? for
    threshold ?lt1
  • ? lose monotonicity of frequent itemsets
  • ? there is a frequent itemset s.t. "any its
    subset is infrequent"
  • ? much cost for computation

? 0.6 1,2 2,3 1,3
1,2,3 ? 1,2,4,5 for ? 0.6
1,2,3,4,5,6,7 ? 1,3,5,6,7 for ? 0.6
1,2,3 ? 1,4,5 for ? 0.6
1,2,3 ? included in all subset ? not for any
8
k-pseudo Inclusion
  • Use threshold for non-included items
  • k-pseudo inclusion P\T ?k for threshold k ?
    0
  • ( k-pseudo occurrence / occurrence set /
    frequency )
  • ? monotonicity is kept
  • ? able to find characterizations such as
  • "many transactions include at least 3
    items of P"

1,2,3 ? 1,2,4,5 for k 1
1,2,3,4,5,6,7 ? 1,3,5,6,7 for k 1
1,2,3 ? 1,4,5 for k 1
9
k Pseudo Frequent Itemset
  • k-pseudo frequent itemset itemset k-pseudo
    included in at least s transactions of D

1-pseudo frequent itemsets for s3 1,2,3
1,2,4 1,2,5 1,2,7 1,2,9 1,3,7 1,3,9
1,4,7 1,4,9 1,5,7 1,5,9 1,6,7 1,6,9
1,7,8 1,7,9 1,8,9 2,3,7 2,3,9 2,4,7
2,4,9 2,5,7 2,5,8 2,5,9 2,6,7 2,6,9
2,7,8 2,7,9 2,8,9 3,7,9 4,7,9 5,7,9
6,7,9 7,8,9 1,2,7,9 1,3,7,9
1,4,7,91,5,7,9 1,6,7,9 1,7,8,9 2,3,7,9
2,4,7,9 2,5,7,9 2,6,7,9 2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Many trivial patterns How to efficiently
enumerate?
10
Enumeration using Monotonicity
  • Pseudo frequent itemsets have monotone property
    thereby simple backtrack algorithm work
  • For each k-pseudo frequent itemset P, compute
    k-pseudo frequency of each
  • Pe
  • If the k-pseudo frequency of Pe
  • is no less than s, generate recursive
  • call to enumerate k-pseudo frequent itemsets
    including Pe

Polynomial time enumeration
How to efficiently computate?
11
Computing k-Pseudo Occurrences
  • Define Occh(P) T?D P\T h
  • ? set of transactions missing just h
    items of P
  • ? Occ?k(P) ?h?kOcch(P)
  • Occh(P?e) Occh(P)nOcc(e) ?
    Occh-1(P)\Occ(e)
  • ? update of pseudo occurrence
  • set is done by taking intersection
  • compute Occh(P)nOcc(e)
  • for all pair of e and h

A B C D E F G
A B C D E F G
A B E F G
A B C D F
B A C D F
A B C F
A B C D
A B C D
A B C D
B C F
Occ0 Occ1 Occ2
C D
8 9 10 11 12
P
12
Taking Intersections Efficiently
  • Occh(P?e) Occh(P)nOcc(e) ?
    Occh-1(P)\Occ(e)
  • ? having the same properties as usual
    occurrences
  • ? can use many existing techniques for updating
    occurrence set
  • (down project, delivery, bitmap)
  • Database reduction (FP-tree)
  • is also available
  • In deeper levels of recursion,
  • transactions to be scanned
  • becomes few, thereby
  • the computation is fast

1 A,C,D 2 A,B,C,E,F 3 B 4 B 5 A,B 6 A 7
A,C,D,E 8 C 9 A,C,D,E
A 1,2,5,6,7,9 B 2,3,4,5 C 1,2,7,8,9 D
1,7,9 E 2,7,9 F 2
13
Using Bottom-wideness
  • Backtrack (depth-first search) generates
    several recursive calls in each iteration
  • ? The computation tree spreads exponentially by
    going down
  • ? The computation time is dominated by the
    bottom level iterations on the recursion tree

Since occurrences to be computed is few in
lower levels,
long time
short time
Amortized computation time is reduced to that of
bottom levels
14
For Large Minimum Support
  • When s is large, we access many transactions on
    the bottom levels
  • ? Improvements by bottom-wideness is not
    drastic
  • Reduce the database to speed up the bottoms
  • (1) Delete items less than the maximum item in P
  • (2) Delete items being infrequent on the
    occurrence set database
  • (since it never be added in the recursive call)
  • (3) unify the same transactions
  • The database size is constant in the
  • bottom levels in practice

P1,3, k1, s4
1 3 5
1 2 3 4 6
1 7
2 3 4 6 7
3 4 5 6 7
2 3 4 6 7
No big difference from small s
15
Small Trivial Patterns
  • Under the k-pseudo inclusion, itemsets of size
    no more than k is included in any transaction
  • itemsets of size bit greater than k is also
    included in many transactions
  • ? Many small and trivial frequent itemsets
  • We want to ignore these itemsets in practice
  • ? Consider problem of directly finding
  • pseudo frequent itemsets of size l

16
Directly Finding Large Itemset
  • Need exponential time if search all itemsets of
    size l
  • ? Pruning unnecessary search is crucial
  • ? Take candidates according to partial structure
  • Let P be a k-pseudo frequent itemset of size l
  • WLOG, P1,,l and
  • sorted in decreasing order of
    Occk(P)\Occ(e)
  • Consider the (k-1)-pseudo frequency of itemset
    1,,y
  • Any transaction in Occk(P)\Occ(e), egty
  • (k-1)-pseudo includes 1,,y

17
Search Route to Itemset of Size l
  • Any transaction in Occk(P)\Occ(e), egty
  • (k-1)-pseudo includes 1,,y
  • ? Occk-1(1,,y) ? ?ey1,...,P
    (Occk(P)\Occ(e))
  • average of Occk(P)\Occ(e) is no less than
    (k / P) Occk (P)
  • 1,,y are sorted in increasing order of
    Occk(P)\Occ(e)
  • ? Occk-1(1,,y) ? Occk(P)(P-y)/P

Partial frequency condition
There is a sequence of itemsets from empty set to
P composed only of itemsets satisfying partial
frequency condition
18
Example for Partial Frequency Condition
  • Itemsets satisfying the partial frequency
    condition,
  • for k1, s3, l3

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
1-pseudo frequent itemsets satisfying the
partial frequency condition 1 2 5 7 9
1,2 1,5 1,6 1,7 1,8 1,9 2,3 2,4
2,5 2,6 2,7 2,8 2,9 3,5 4,5 5,6
5,7 5,9 6,7 6,9 7,8 7,9 8,9
D
frequent itemsets to be searched is decreased,
? efficient search is expected
19
Restricted Search Route by P.F.C.
  • Any k-pseudo frequent itemset of size l can be
    found by passing through those satisfying partial
    frequency condition
  • ? Let's do backtrack search
  • Always exist an item whose removal satisfies
    the condition
  • Tail extension is not available
  • (removal of tail may violate condition)
  • Simple hill climbing generates duplications
  • So, use a generation rule to avoid duplication
    (reverse search)

20
Reverse Search for P.F.C.
  • Rule generate itemset P from P\e maximizing
    Occk-1(P\e)
  • (Tie is broken by choosing the minimum
    index)
  • ReverseSearch (P)
  • 1. if P1 then output P return
  • 2. for each e?P do
  • if Pe is a k-pseudo frequent itemset
    satisfying P.F.C. then
  • if e maximizes Occk-1(P\e) then
    ReverseSearch (Pe)
  • 3. end for
  • Occk-1(P\e) can be efficiently computed by
    existing methods

O(PD) time for one iteration
21
Conclusion
  • Introduced ambiguous inclusion relation such
    that at most k items of the pattern is not
    included
  • Pseudo frequent itemset mining under the
    inclusion (monotonicity, intersection, many
    small-trivial patterns)
  • Reverse search for directly finding frequent
    itemset with fixed size

Future works
implementation and experiments extension of
the technique to other pattern mining approach
to inclusion with "ratio r "
Write a Comment
User Comments (0)
About PowerShow.com