Frequent Closed Pattern Search By Row and Feature Enumeration - PowerPoint PPT Presentation

About This Presentation

Title:

Frequent Closed Pattern Search By Row and Feature Enumeration

Description:

Title: Slide 1 Author: Administrator Last modified by: pf Created Date: 2/4/2003 4:33:56 PM Document presentation format: On-screen Show Company: NUS – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 39

Provided by: uncEdu

Learn more at: http://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Closed Pattern Search By Row and Feature Enumeration

1
Frequent Closed Pattern Search By Row and Feature
Enumeration
2
Outline

Problem Definition
Related Work Feature Enumeration Algorithms
CARPENTER Row Enumeration Algorithm
COBBLER Combined Enumeration Algorithm

3
Problem Definition

Frequent Closed Pattern1) frequent pattern has
support value higher than the threshold2)
closed pattern there exists no superset which
has the same support value
Problem DefinitionGiven a dataset D which
contains records consist of features, our problem
is to discover all frequent closed patterns
respect to a user defined support threshold.

4
Related Work

Searching Strategybreadth-first depth-first
search
Data Formathorizontal format vertical
format
Data Compression Method
diffset, fp-tree, etc.

5
Typical Algorithms

CLOSET
feature enumeration
horizontal format
depth-first search
fp-tree technique

APRIORI
feature enumeration
horizontal format
breadth-first search
CHARM
feature enumeration
vertical format
depth-first search
deffset technique

6
CARPENTER

CARPENTER stands for Closed Pattern Discovery by
Transposing Tables that are Extremely Long
Motivation
Algorithm
Prune Method
Experiment

7
Motivation

Bioinformatic datasets typically contain large
number of features with small number of rows.
Running time of most of the previous algorithms
will increase exponentially with the average
length of the transactions.
CARPENTERs search space is much smaller than
that of the previous algorithms on these kind of
datasets and therefore has a better performance.

8
Algorithm

The main idea of CARPENTER is to mine the dataset
row-wise.
2 steps
First, transpose the dataset
Second , search in the row enumeration tree.

9
Transpose Table

Feature a, b, c, d.
Row r1, r2 , r3, r4.

r1 a b c
r2 b c d
r3 b c d
r4 d
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
b
c
d r4
transpose
project on (r2 r3)
original table
transposed table
projected table
10
Row Enumeration Tree
r1r2r3r4
r1 r2 r3 bc
r1 r2 bc

According to the transposed table, we build the
row enumeration tree which enumerates row ids
with a pre-defined order.
We do a depth first search in the row enumeration
tree with out any prune strategies.

r1 r2 r4
r1 r3 bc
r1 r3 r4
minsup2 bc r1r2r3 bcd r2r3 d r2r3r4
r1 abc
r1 r4
r2 r3 bcd
r2 r3 r4 d

r2 bcd
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
r2 r4 d
r3 bcd
r3 r4 d
r4 d
11
Prune Method 1

In the enumeration tree, the depth of a node is
the corresponding support value.
Prune a branch if there wont be enough depth in
that branch, which means the support of patterns
found in the branch will not exceed the minimum
support.

minsup 4
r2 r3 bcd
r2 bcd
r2 r4 d
depth 1 sup 1
2 sub-nodes
Max support value in branch r2 will be 3,
therefore prune this branch.
12
Prune Method 2

If rj has 100 support in projects table of ri,
prune the branch of rj.

r2 bcd
r2 r3 bcd
r2 r3 r4 d
r2 r4 d
b r3
c r3
d r3 r4
b
c
d r4
r2 r3 bcd
r2 r3 r4 d
r3 has 100 support in the projected table of
r2, therefore branch r2 r3 will be pruned and
whole branch is reconstructed.
13
Prune Method 3

At any node in the enumeration tree, if the
corresponding itemset of the node has been found
before, we prune the branch rooted at this node.

r2 bcd
r2 r3 bcd
r2 r4 d
r3 bcd
r3 r4 d
Since itemset bcd has been found before, the
branch rooted at r3 will be pruned.
14
Performance

We compare 3 algorithms, CARPENTER, CHARM and
CLOSET.
Dataset (Lung Cancer) has 181 rows with 12533
features.
We set 3 parameters, minsup, Length Ratio and Row
Ratio.

15
minsup
Lung Cancer, 181 rows, length ratio 0.6,row ratio
1. Running time of CARNPENTER changes from 3 to
14 second
16
Length Ratio
Lung Cancer, 181 rows, sup 7 (4), row ratio
1 Running time of CARPENTER changes from 3 to 33
seconds
17
Row Ratio
Lung Cancer, 181 rows, length ratio 0.6,sup 7
(4) Running time of CARPENTER changes from 9 to
178 seconds
18
Conclusion

We propose an algorithm call CARPENTER for
finding closed pattern on long biological
datasets.
CARPENTER perform row enumeration instead of
column enumeration since the number of rows in
such datasets are significantly smaller than the
number of features.
Performance studies show that CARPENTER is much
more efficient in finding closed patterns
compared to existing feature enumeration
algorithms.

19
COBBLER

Motivation
Algorithm
Performance

20
Motivation

With the development of CARPENTER, existing
algorithms can be separated into two parts.
Feature enumeration CHARM, CLOSET, etc.
Row enumeration CARPENTER
We have two motivations to combine these two
enumeration methods

21
Motivation

1. We can see that these two enumeration methods
have their own advantages on different type of
data set. Given a dataset, the characteristic of
its sub-dataset may change.

sub-dataset
dataset
project
more features than rows
more rows than features
2. Given a dataset with both large number of rows
and features, a single row enumeration algorithm
or a single feature enumeration method can not
handle the dataset.
22
Algorithm

There are two main points in the COBBLER
algorithm
How to build an enumeration tree for COBBLER.
How to decide when the algorithm should switch
from one enumeration to another.
Therefore, we will introduce the idea of dynamic
enumeration tree and switching condition

23
Dynamic Enumeration Tree

We call the new kind of enumeration tree used in
COBBLER the dynamic enumeration tree.
In dynamic enumeration tree, different sub-tree
may use different enumeration method.

original
transposed
r1 a b c
r2 a c d
r3 b c
r4 d
a r1 r2
b r1 r3
c r1 r2 r3
d r2 r4
We use the table as an example in later discussion
24
Single Enumeration Tree
r1r2r3r4
abcd
r1r2r3 c
abc r1
ab r1
r1r2 ac
r1r2r4
abd
ac r1r2
acd r2
r1r3 bc
r1r3r4
a r1r2
r1 abc
ad r2
r1r4
r2r3r4
bc r1r3
bcd
r2r3 c

b r1r3

r2 acd
r1 a b c
r2 a c d
r3 b c
r4 d
bd
r2r4 d
c r1r2r3
cd r2
r3r4
r3 bc
d r2r4
Feature enumeration
Row enumeration
r4 d
25
Dynamic Enumeration Tree
abcd
abc r1
r1r2 c
ab r1
r1 bc
r1 bc
r2 cd
abd
a r1r2
ac r1r2
acd r2
r2 cd
a r1r2
ad r2
r1 c
r1r3 c

b r1r3
r3 c
abc r1 ac r1r2 acd r2
b r1
c r1 r2
d r2
c r1r2r3
r2 d
d r2r4
Feature enumeration to Row enumeration
26
Dynamic Enumeration Tree
r1r2r3r4
ab
r1r2r3 c
a r2
r1r2 ac
ac r2
r1r2r4
r1r3 bc
r1r3r4
b r3
bc r3
r1 abc
r1 abc
c r2r3
r1r4
ac r1
acd
a r1
ad

r2 acd
cd
ac r1r2 bc r1r3 c r1r2r3
c r1r3
d r4
bc r1
b r1
r3 bc
c r1r2
r4 d
Row enumeration to Feature Enumeration
27
Dynamic Enumeration Tree

When we use different condition to decide the
switching, the structure of the dynamic
enumeration tree will change.
No matter how it switches, the result set of
closed pattern will be the same as the result of
the single enumeration .

28
Switching Condition

The main idea of the switching condition is to
estimate the processing time of the a enumeration
sub-tree, i.e., row enumeration sub-tree or
feature enumeration sub-tree.
Define some characters.

29
Switching Condition
30
Switching Condition

Suppose r10, S(f1)0.8, S(f2)0.5, S(f3)0.5,
S(f4)0.3 and minsup2
Then the estimated deepest node under f1 is
f1f2f3, since
S(f1)S(f2)S(f3)r2 gtminsup
S(f1)S(f2)S(f3)S(f4)r0.6 lt minsup

31
Experiments

We compare 3 algorithms, COBBLER, CHARM and
CLOSET.
One real-life dataset and one synthetic data.
We set 3 parameters, minsup, Length Ratio and Row
Ratio.

32
minsup
Synthetic data
Real-life data (thrombin)
33
Length and Row ratio
Synthetic data
34
Discussion

The combination of row and feature enumeration
also makes some disadvantage
The cost to calculate the switching condition and
the cost of bad decision.
The increased cost in pruning, maintain two set
of pruning system.

35
Discussion

We may use other more complicated data structure
in our algorithm to improve the performance,
e.g., the vertical data format and diffset
technique.
And more efficient switching condition may
improve the algorithm further.

36
Conclusion

The COBBLER algorithm gives better performance on
dataset where the advantage of switching can be
shown, e.g., complex dataset or dataset has both
large number of rows and features.
For simple characteristic data, a single
enumeration algorithm may be better.

37
Future Work