Title: Frequent Closed Pattern Search By Row and Feature Enumeration
1Frequent Closed Pattern Search By Row and Feature
Enumeration
2Outline
- Problem Definition
- Related Work Feature Enumeration Algorithms
- CARPENTER Row Enumeration Algorithm
- COBBLER Combined Enumeration Algorithm
3Problem Definition
- Frequent Closed Pattern1) frequent pattern has
support value higher than the threshold2)
closed pattern there exists no superset which
has the same support value - Problem DefinitionGiven a dataset D which
contains records consist of features, our problem
is to discover all frequent closed patterns
respect to a user defined support threshold.
4Related Work
- Searching Strategybreadth-first depth-first
search - Data Formathorizontal format vertical
format - Data Compression Method
- diffset, fp-tree, etc.
5Typical Algorithms
- CLOSET
- feature enumeration
- horizontal format
- depth-first search
- fp-tree technique
- APRIORI
- feature enumeration
- horizontal format
- breadth-first search
- CHARM
- feature enumeration
- vertical format
- depth-first search
- deffset technique
6CARPENTER
- CARPENTER stands for Closed Pattern Discovery by
Transposing Tables that are Extremely Long - Motivation
- Algorithm
- Prune Method
- Experiment
7Motivation
- Bioinformatic datasets typically contain large
number of features with small number of rows. - Running time of most of the previous algorithms
will increase exponentially with the average
length of the transactions. - CARPENTERs search space is much smaller than
that of the previous algorithms on these kind of
datasets and therefore has a better performance.
8Algorithm
- The main idea of CARPENTER is to mine the dataset
row-wise. - 2 steps
- First, transpose the dataset
- Second , search in the row enumeration tree.
9Transpose Table
- Feature a, b, c, d.
- Row r1, r2 , r3, r4.
r1 a b c
r2 b c d
r3 b c d
r4 d
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
b
c
d r4
transpose
project on (r2 r3)
original table
transposed table
projected table
10Row Enumeration Tree
r1r2r3r4
r1 r2 r3 bc
r1 r2 bc
- According to the transposed table, we build the
row enumeration tree which enumerates row ids
with a pre-defined order. - We do a depth first search in the row enumeration
tree with out any prune strategies.
r1 r2 r4
r1 r3 bc
r1 r3 r4
minsup2 bc r1r2r3 bcd r2r3 d r2r3r4
r1 abc
r1 r4
r2 r3 bcd
r2 r3 r4 d
r2 bcd
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
r2 r4 d
r3 bcd
r3 r4 d
r4 d
11Prune Method 1
- In the enumeration tree, the depth of a node is
the corresponding support value. - Prune a branch if there wont be enough depth in
that branch, which means the support of patterns
found in the branch will not exceed the minimum
support.
minsup 4
r2 r3 bcd
r2 bcd
r2 r4 d
depth 1 sup 1
2 sub-nodes
Max support value in branch r2 will be 3,
therefore prune this branch.
12Prune Method 2
- If rj has 100 support in projects table of ri,
prune the branch of rj.
r2 bcd
r2 r3 bcd
r2 r3 r4 d
r2 r4 d
b r3
c r3
d r3 r4
b
c
d r4
r2 r3 bcd
r2 r3 r4 d
r3 has 100 support in the projected table of
r2, therefore branch r2 r3 will be pruned and
whole branch is reconstructed.
13Prune Method 3
- At any node in the enumeration tree, if the
corresponding itemset of the node has been found
before, we prune the branch rooted at this node.
r2 bcd
r2 r3 bcd
r2 r4 d
r3 bcd
r3 r4 d
Since itemset bcd has been found before, the
branch rooted at r3 will be pruned.
14Performance
- We compare 3 algorithms, CARPENTER, CHARM and
CLOSET. - Dataset (Lung Cancer) has 181 rows with 12533
features. - We set 3 parameters, minsup, Length Ratio and Row
Ratio.
15minsup
Lung Cancer, 181 rows, length ratio 0.6,row ratio
1. Running time of CARNPENTER changes from 3 to
14 second
16Length Ratio
Lung Cancer, 181 rows, sup 7 (4), row ratio
1 Running time of CARPENTER changes from 3 to 33
seconds
17Row Ratio
Lung Cancer, 181 rows, length ratio 0.6,sup 7
(4) Running time of CARPENTER changes from 9 to
178 seconds
18Conclusion
- We propose an algorithm call CARPENTER for
finding closed pattern on long biological
datasets. - CARPENTER perform row enumeration instead of
column enumeration since the number of rows in
such datasets are significantly smaller than the
number of features. - Performance studies show that CARPENTER is much
more efficient in finding closed patterns
compared to existing feature enumeration
algorithms.
19COBBLER
- Motivation
- Algorithm
- Performance
20Motivation
- With the development of CARPENTER, existing
algorithms can be separated into two parts. - Feature enumeration CHARM, CLOSET, etc.
- Row enumeration CARPENTER
- We have two motivations to combine these two
enumeration methods
21Motivation
- 1. We can see that these two enumeration methods
have their own advantages on different type of
data set. Given a dataset, the characteristic of
its sub-dataset may change.
sub-dataset
dataset
project
more features than rows
more rows than features
2. Given a dataset with both large number of rows
and features, a single row enumeration algorithm
or a single feature enumeration method can not
handle the dataset.
22Algorithm
- There are two main points in the COBBLER
algorithm - How to build an enumeration tree for COBBLER.
- How to decide when the algorithm should switch
from one enumeration to another. - Therefore, we will introduce the idea of dynamic
enumeration tree and switching condition
23Dynamic Enumeration Tree
- We call the new kind of enumeration tree used in
COBBLER the dynamic enumeration tree. - In dynamic enumeration tree, different sub-tree
may use different enumeration method.
original
transposed
r1 a b c
r2 a c d
r3 b c
r4 d
a r1 r2
b r1 r3
c r1 r2 r3
d r2 r4
We use the table as an example in later discussion
24Single Enumeration Tree
r1r2r3r4
abcd
r1r2r3 c
abc r1
ab r1
r1r2 ac
r1r2r4
abd
ac r1r2
acd r2
r1r3 bc
r1r3r4
a r1r2
r1 abc
ad r2
r1r4
r2r3r4
bc r1r3
bcd
r2r3 c
b r1r3
r2 acd
r1 a b c
r2 a c d
r3 b c
r4 d
bd
r2r4 d
c r1r2r3
cd r2
r3r4
r3 bc
d r2r4
Feature enumeration
Row enumeration
r4 d
25Dynamic Enumeration Tree
abcd
abc r1
r1r2 c
ab r1
r1 bc
r1 bc
r2 cd
abd
a r1r2
ac r1r2
acd r2
r2 cd
a r1r2
ad r2
r1 c
r1r3 c
b r1r3
r3 c
abc r1 ac r1r2 acd r2
b r1
c r1 r2
d r2
c r1r2r3
r2 d
d r2r4
Feature enumeration to Row enumeration
26Dynamic Enumeration Tree
r1r2r3r4
ab
r1r2r3 c
a r2
r1r2 ac
ac r2
r1r2r4
r1r3 bc
r1r3r4
b r3
bc r3
r1 abc
r1 abc
c r2r3
r1r4
ac r1
acd
a r1
ad
r2 acd
cd
ac r1r2 bc r1r3 c r1r2r3
c r1r3
d r4
bc r1
b r1
r3 bc
c r1r2
r4 d
Row enumeration to Feature Enumeration
27Dynamic Enumeration Tree
- When we use different condition to decide the
switching, the structure of the dynamic
enumeration tree will change. - No matter how it switches, the result set of
closed pattern will be the same as the result of
the single enumeration .
28Switching Condition
- The main idea of the switching condition is to
estimate the processing time of the a enumeration
sub-tree, i.e., row enumeration sub-tree or
feature enumeration sub-tree. - Define some characters.
29Switching Condition
30Switching Condition
- Suppose r10, S(f1)0.8, S(f2)0.5, S(f3)0.5,
S(f4)0.3 and minsup2 - Then the estimated deepest node under f1 is
f1f2f3, since - S(f1)S(f2)S(f3)r2 gtminsup
- S(f1)S(f2)S(f3)S(f4)r0.6 lt minsup
31Experiments
- We compare 3 algorithms, COBBLER, CHARM and
CLOSET. - One real-life dataset and one synthetic data.
- We set 3 parameters, minsup, Length Ratio and Row
Ratio.
32minsup
Synthetic data
Real-life data (thrombin)
33Length and Row ratio
Synthetic data
34Discussion
- The combination of row and feature enumeration
also makes some disadvantage - The cost to calculate the switching condition and
the cost of bad decision. - The increased cost in pruning, maintain two set
of pruning system.
35Discussion
- We may use other more complicated data structure
in our algorithm to improve the performance,
e.g., the vertical data format and diffset
technique. - And more efficient switching condition may
improve the algorithm further.
36Conclusion
- The COBBLER algorithm gives better performance on
dataset where the advantage of switching can be
shown, e.g., complex dataset or dataset has both
large number of rows and features. - For simple characteristic data, a single
enumeration algorithm may be better.
37Future Work
- Using other data structure and technique in the
algorithm. - Extend COBBLER to handle dataset that can not be
fitted into memory.
38Thanks