An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining

1
An Efficient Polynomial Delay Algorithm
forPseudo Frequent Itemset Mining

Takeaki Uno (National Institute of Informatics)
Hiroki Arimura (Hokkaido University)

2/Oct/2007 Discovery Science 2007
2
Frequent Pattern Mining

problem of finding all frequently appearing
patterns from
(large scale) database
database transaction, tree, string, graph,
vector
pattern subset, tree, path, sequence, graph,
geograph

database
ex1? ,ex3 ? ex2? ,ex4? ex2?, ex3 ?, ex4?
ex2? ,ex3 ? . . .
ex1 ex2 ex3 ex4
? ? ?
? ?
? ? ? ?
? ? ? ?
? ? ?
? ? ?
? ? ?
? ?
ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCA
AATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT

ATGCAT CCCGGGTAA GGCGTTA ATAAGGG .
. .
experiments
Genome info
3
This Research

address transaction database
transaction database each record (transaction)
T of the database is a subset of the itemset E,
i.e., D, ?T ?D, T ? E
frequent itemset subset of E included in at
least s transactions
problems
- so many patterns for finding valuable
patterns
- inclusion is strict, to deal with errors
? "patterns ambiguously included in many
transactions" are impotant

minimum support threshold
We introduce an ambiguous inclusion, and propose
an efficient mining algorithm
4
Related Works

Such frequent itemset mining with ambiguity is
called
fault-tolerant pattern, degenerate pattern, soft
occurrence
- ambiguity for inclusion is, "pattern is
included if the ratio of
included items is more than the threshold
- another approach find combinations of
itemset and
transaction set, such that few pairs of
item and transaction do
not satisfy inclusion relation
- similarity is used, for string matching
and homology search
Few "enumeration type" research with
completeness

Look at practical models and algorithms, from
algorithm theory
5
Notations for F.I.M.

For itemset K,
occurrence of K transaction of D including K
Occ(K) occurrence set of K the set of
occurrences of K
frq(K) frequency of K the size of Occ(K)

Occ( 1,2 ) 1,2,5,6,7,9,
1,2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Occ( 2,7,9 ) 1,2,5,6,7,9,
1,2,7,8,9, 2,7,9
6
Frequent Itemset

Frequent itemset itemset with frequency no
less than s
( s is called minimum support (threshold) )
Ex.)

Itemsets included in no less than 3
transactions 1 2 7 9 1,7
1,9 2,7 2,9 7,9 1,7,9 2,7,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Frequent itemset mining problem of enumerating
all frequent itemsets for given database D and
minimum support s
7
Inclusion with Ambiguity

Ambiguous inclusion relation for itemset P and
transaction T
Popular definition PnT / P ? ? for
threshold ?lt1
? lose monotonicity of frequent itemsets
? there is a frequent itemset s.t. "any its
subset is infrequent"
? much cost for computation

? 0.6 1,2 2,3 1,3
1,2,3 ? 1,2,4,5 for ? 0.6
1,2,3,4,5,6,7 ? 1,3,5,6,7 for ? 0.6
1,2,3 ? 1,4,5 for ? 0.6
1,2,3 ? included in all subset ? not for any
8
k-pseudo Inclusion

Use threshold for non-included items
k-pseudo inclusion P\T ?k for threshold k ?
0
( k-pseudo occurrence / occurrence set /
frequency )
? monotonicity is kept
? able to find characterizations such as
"many transactions include at least 3
items of P"

1,2,3 ? 1,2,4,5 for k 1
1,2,3,4,5,6,7 ? 1,3,5,6,7 for k 1
1,2,3 ? 1,4,5 for k 1
9
k Pseudo Frequent Itemset

k-pseudo frequent itemset itemset k-pseudo
included in at least s transactions of D

1-pseudo frequent itemsets for s3 1,2,3
1,2,4 1,2,5 1,2,7 1,2,9 1,3,7 1,3,9
1,4,7 1,4,9 1,5,7 1,5,9 1,6,7 1,6,9
1,7,8 1,7,9 1,8,9 2,3,7 2,3,9 2,4,7
2,4,9 2,5,7 2,5,8 2,5,9 2,6,7 2,6,9
2,7,8 2,7,9 2,8,9 3,7,9 4,7,9 5,7,9
6,7,9 7,8,9 1,2,7,9 1,3,7,9
1,4,7,91,5,7,9 1,6,7,9 1,7,8,9 2,3,7,9
2,4,7,9 2,5,7,9 2,6,7,9 2,7,8,9
1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
D
Many trivial patterns How to efficiently
enumerate?
10
Enumeration using Monotonicity

Pseudo frequent itemsets have monotone property
thereby simple backtrack algorithm work
For each k-pseudo frequent itemset P, compute
k-pseudo frequency of each
Pe
If the k-pseudo frequency of Pe
is no less than s, generate recursive
call to enumerate k-pseudo frequent itemsets
including Pe

Polynomial time enumeration
How to efficiently computate?
11
Computing k-Pseudo Occurrences

Define Occh(P) T?D P\T h
? set of transactions missing just h
items of P
? Occ?k(P) ?h?kOcch(P)
Occh(P?e) Occh(P)nOcc(e) ?
Occh-1(P)\Occ(e)
? update of pseudo occurrence
set is done by taking intersection
compute Occh(P)nOcc(e)
for all pair of e and h

A B C D E F G
A B C D E F G
A B E F G
A B C D F
B A C D F
A B C F
A B C D
A B C D
A B C D
B C F
Occ0 Occ1 Occ2
C D
8 9 10 11 12
P
12
Taking Intersections Efficiently

Occh(P?e) Occh(P)nOcc(e) ?
Occh-1(P)\Occ(e)
? having the same properties as usual
occurrences
? can use many existing techniques for updating
occurrence set
(down project, delivery, bitmap)
Database reduction (FP-tree)
is also available
In deeper levels of recursion,
transactions to be scanned
becomes few, thereby
the computation is fast

1 A,C,D 2 A,B,C,E,F 3 B 4 B 5 A,B 6 A 7
A,C,D,E 8 C 9 A,C,D,E
A 1,2,5,6,7,9 B 2,3,4,5 C 1,2,7,8,9 D
1,7,9 E 2,7,9 F 2
13
Using Bottom-wideness

Backtrack (depth-first search) generates
several recursive calls in each iteration
? The computation tree spreads exponentially by
going down
? The computation time is dominated by the
bottom level iterations on the recursion tree

Since occurrences to be computed is few in
lower levels,
long time
short time
Amortized computation time is reduced to that of
bottom levels
14
For Large Minimum Support

When s is large, we access many transactions on
the bottom levels
? Improvements by bottom-wideness is not
drastic
Reduce the database to speed up the bottoms
(1) Delete items less than the maximum item in P
(2) Delete items being infrequent on the
occurrence set database
(since it never be added in the recursive call)
(3) unify the same transactions
The database size is constant in the
bottom levels in practice

P1,3, k1, s4
1 3 5
1 2 3 4 6
1 7
2 3 4 6 7
3 4 5 6 7
2 3 4 6 7
No big difference from small s
15
Small Trivial Patterns

Under the k-pseudo inclusion, itemsets of size
no more than k is included in any transaction
itemsets of size bit greater than k is also
included in many transactions
? Many small and trivial frequent itemsets
We want to ignore these itemsets in practice
? Consider problem of directly finding
pseudo frequent itemsets of size l

16
Directly Finding Large Itemset

Need exponential time if search all itemsets of
size l
? Pruning unnecessary search is crucial
? Take candidates according to partial structure
Let P be a k-pseudo frequent itemset of size l
WLOG, P1,,l and
sorted in decreasing order of
Occk(P)\Occ(e)
Consider the (k-1)-pseudo frequency of itemset
1,,y
Any transaction in Occk(P)\Occ(e), egty
(k-1)-pseudo includes 1,,y

17
Search Route to Itemset of Size l

Any transaction in Occk(P)\Occ(e), egty
(k-1)-pseudo includes 1,,y
? Occk-1(1,,y) ? ?ey1,...,P
(Occk(P)\Occ(e))
average of Occk(P)\Occ(e) is no less than
(k / P) Occk (P)
1,,y are sorted in increasing order of
Occk(P)\Occ(e)
? Occk-1(1,,y) ? Occk(P)(P-y)/P

Partial frequency condition
There is a sequence of itemsets from empty set to
P composed only of itemsets satisfying partial
frequency condition
18
Example for Partial Frequency Condition

Itemsets satisfying the partial frequency
condition,
for k1, s3, l3

1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2
1-pseudo frequent itemsets satisfying the
partial frequency condition 1 2 5 7 9
1,2 1,5 1,6 1,7 1,8 1,9 2,3 2,4
2,5 2,6 2,7 2,8 2,9 3,5 4,5 5,6
5,7 5,9 6,7 6,9 7,8 7,9 8,9
D
frequent itemsets to be searched is decreased,
? efficient search is expected
19
Restricted Search Route by P.F.C.

Any k-pseudo frequent itemset of size l can be
found by passing through those satisfying partial
frequency condition
? Let's do backtrack search
Always exist an item whose removal satisfies
the condition
Tail extension is not available
(removal of tail may violate condition)
Simple hill climbing generates duplications
So, use a generation rule to avoid duplication
(reverse search)

20
Reverse Search for P.F.C.

Rule generate itemset P from P\e maximizing
Occk-1(P\e)
(Tie is broken by choosing the minimum
index)
ReverseSearch (P)
1. if P1 then output P return
2. for each e?P do
if Pe is a k-pseudo frequent itemset
satisfying P.F.C. then
if e maximizes Occk-1(P\e) then
ReverseSearch (Pe)
3. end for
Occk-1(P\e) can be efficiently computed by
existing methods

O(PD) time for one iteration
21
Conclusion

Introduced ambiguous inclusion relation such
that at most k items of the pattern is not
included
Pseudo frequent itemset mining under the
inclusion (monotonicity, intersection, many
small-trivial patterns)
Reverse search for directly finding frequent
itemset with fixed size

Future works
implementation and experiments extension of
the technique to other pattern mining approach
to inclusion with "ratio r "

Write a Comment

User Comments (0)

About PowerShow.com

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining PowerPoint PPT Presentation