Frequent Closed Pattern Search By Row and Feature Enumeration - PowerPoint PPT Presentation

About This Presentation
Title:

Frequent Closed Pattern Search By Row and Feature Enumeration

Description:

Title: Slide 1 Author: Administrator Last modified by: pf Created Date: 2/4/2003 4:33:56 PM Document presentation format: On-screen Show Company: NUS – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 39
Provided by: uncEdu
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Frequent Closed Pattern Search By Row and Feature Enumeration


1
Frequent Closed Pattern Search By Row and Feature
Enumeration
2
Outline
  • Problem Definition
  • Related Work Feature Enumeration Algorithms
  • CARPENTER Row Enumeration Algorithm
  • COBBLER Combined Enumeration Algorithm

3
Problem Definition
  • Frequent Closed Pattern1) frequent pattern has
    support value higher than the threshold2)
    closed pattern there exists no superset which
    has the same support value
  • Problem DefinitionGiven a dataset D which
    contains records consist of features, our problem
    is to discover all frequent closed patterns
    respect to a user defined support threshold.

4
Related Work
  • Searching Strategybreadth-first depth-first
    search
  • Data Formathorizontal format vertical
    format
  • Data Compression Method
  • diffset, fp-tree, etc.

5
Typical Algorithms
  • CLOSET
  • feature enumeration
  • horizontal format
  • depth-first search
  • fp-tree technique
  • APRIORI
  • feature enumeration
  • horizontal format
  • breadth-first search
  • CHARM
  • feature enumeration
  • vertical format
  • depth-first search
  • deffset technique

6
CARPENTER
  • CARPENTER stands for Closed Pattern Discovery by
    Transposing Tables that are Extremely Long
  • Motivation
  • Algorithm
  • Prune Method
  • Experiment

7
Motivation
  • Bioinformatic datasets typically contain large
    number of features with small number of rows.
  • Running time of most of the previous algorithms
    will increase exponentially with the average
    length of the transactions.
  • CARPENTERs search space is much smaller than
    that of the previous algorithms on these kind of
    datasets and therefore has a better performance.

8
Algorithm
  • The main idea of CARPENTER is to mine the dataset
    row-wise.
  • 2 steps
  • First, transpose the dataset
  • Second , search in the row enumeration tree.

9
Transpose Table
  • Feature a, b, c, d.
  • Row r1, r2 , r3, r4.

r1 a b c
r2 b c d
r3 b c d
r4 d
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
b
c
d r4
transpose
project on (r2 r3)
original table
transposed table
projected table
10
Row Enumeration Tree
r1r2r3r4
r1 r2 r3 bc
r1 r2 bc
  • According to the transposed table, we build the
    row enumeration tree which enumerates row ids
    with a pre-defined order.
  • We do a depth first search in the row enumeration
    tree with out any prune strategies.

r1 r2 r4
r1 r3 bc
r1 r3 r4
minsup2 bc r1r2r3 bcd r2r3 d r2r3r4
r1 abc
r1 r4
r2 r3 bcd
r2 r3 r4 d

r2 bcd
a r1
b r1 r2 r3
c r1 r2 r3
d r2 r3 r4
r2 r4 d
r3 bcd
r3 r4 d
r4 d
11
Prune Method 1
  • In the enumeration tree, the depth of a node is
    the corresponding support value.
  • Prune a branch if there wont be enough depth in
    that branch, which means the support of patterns
    found in the branch will not exceed the minimum
    support.

minsup 4
r2 r3 bcd
r2 bcd
r2 r4 d
depth 1 sup 1
2 sub-nodes
Max support value in branch r2 will be 3,
therefore prune this branch.
12
Prune Method 2
  • If rj has 100 support in projects table of ri,
    prune the branch of rj.

r2 bcd
r2 r3 bcd
r2 r3 r4 d
r2 r4 d
b r3
c r3
d r3 r4
b
c
d r4
r2 r3 bcd
r2 r3 r4 d
r3 has 100 support in the projected table of
r2, therefore branch r2 r3 will be pruned and
whole branch is reconstructed.
13
Prune Method 3
  • At any node in the enumeration tree, if the
    corresponding itemset of the node has been found
    before, we prune the branch rooted at this node.

r2 bcd
r2 r3 bcd
r2 r4 d
r3 bcd
r3 r4 d
Since itemset bcd has been found before, the
branch rooted at r3 will be pruned.
14
Performance
  • We compare 3 algorithms, CARPENTER, CHARM and
    CLOSET.
  • Dataset (Lung Cancer) has 181 rows with 12533
    features.
  • We set 3 parameters, minsup, Length Ratio and Row
    Ratio.

15
minsup
Lung Cancer, 181 rows, length ratio 0.6,row ratio
1. Running time of CARNPENTER changes from 3 to
14 second
16
Length Ratio
Lung Cancer, 181 rows, sup 7 (4), row ratio
1 Running time of CARPENTER changes from 3 to 33
seconds
17
Row Ratio
Lung Cancer, 181 rows, length ratio 0.6,sup 7
(4) Running time of CARPENTER changes from 9 to
178 seconds
18
Conclusion
  • We propose an algorithm call CARPENTER for
    finding closed pattern on long biological
    datasets.
  • CARPENTER perform row enumeration instead of
    column enumeration since the number of rows in
    such datasets are significantly smaller than the
    number of features.
  • Performance studies show that CARPENTER is much
    more efficient in finding closed patterns
    compared to existing feature enumeration
    algorithms.

19
COBBLER
  • Motivation
  • Algorithm
  • Performance

20
Motivation
  • With the development of CARPENTER, existing
    algorithms can be separated into two parts.
  • Feature enumeration CHARM, CLOSET, etc.
  • Row enumeration CARPENTER
  • We have two motivations to combine these two
    enumeration methods

21
Motivation
  • 1. We can see that these two enumeration methods
    have their own advantages on different type of
    data set. Given a dataset, the characteristic of
    its sub-dataset may change.

sub-dataset
dataset
project
more features than rows
more rows than features
2. Given a dataset with both large number of rows
and features, a single row enumeration algorithm
or a single feature enumeration method can not
handle the dataset.
22
Algorithm
  • There are two main points in the COBBLER
    algorithm
  • How to build an enumeration tree for COBBLER.
  • How to decide when the algorithm should switch
    from one enumeration to another.
  • Therefore, we will introduce the idea of dynamic
    enumeration tree and switching condition

23
Dynamic Enumeration Tree
  • We call the new kind of enumeration tree used in
    COBBLER the dynamic enumeration tree.
  • In dynamic enumeration tree, different sub-tree
    may use different enumeration method.

original
transposed
r1 a b c
r2 a c d
r3 b c
r4 d
a r1 r2
b r1 r3
c r1 r2 r3
d r2 r4
We use the table as an example in later discussion
24
Single Enumeration Tree
r1r2r3r4
abcd
r1r2r3 c
abc r1
ab r1
r1r2 ac
r1r2r4
abd
ac r1r2
acd r2
r1r3 bc
r1r3r4
a r1r2
r1 abc
ad r2
r1r4
r2r3r4
bc r1r3
bcd
r2r3 c

b r1r3

r2 acd
r1 a b c
r2 a c d
r3 b c
r4 d
bd
r2r4 d
c r1r2r3
cd r2
r3r4
r3 bc
d r2r4
Feature enumeration
Row enumeration
r4 d
25
Dynamic Enumeration Tree
abcd
abc r1
r1r2 c
ab r1
r1 bc
r1 bc
r2 cd
abd
a r1r2
ac r1r2
acd r2
r2 cd
a r1r2
ad r2
r1 c
r1r3 c

b r1r3
r3 c
abc r1 ac r1r2 acd r2
b r1
c r1 r2
d r2
c r1r2r3
r2 d
d r2r4
Feature enumeration to Row enumeration
26
Dynamic Enumeration Tree
r1r2r3r4
ab
r1r2r3 c
a r2
r1r2 ac
ac r2
r1r2r4
r1r3 bc
r1r3r4
b r3
bc r3
r1 abc
r1 abc
c r2r3
r1r4
ac r1
acd
a r1
ad

r2 acd
cd
ac r1r2 bc r1r3 c r1r2r3
c r1r3
d r4
bc r1
b r1
r3 bc
c r1r2
r4 d
Row enumeration to Feature Enumeration
27
Dynamic Enumeration Tree
  • When we use different condition to decide the
    switching, the structure of the dynamic
    enumeration tree will change.
  • No matter how it switches, the result set of
    closed pattern will be the same as the result of
    the single enumeration .

28
Switching Condition
  • The main idea of the switching condition is to
    estimate the processing time of the a enumeration
    sub-tree, i.e., row enumeration sub-tree or
    feature enumeration sub-tree.
  • Define some characters.

29
Switching Condition
30
Switching Condition
  • Suppose r10, S(f1)0.8, S(f2)0.5, S(f3)0.5,
    S(f4)0.3 and minsup2
  • Then the estimated deepest node under f1 is
    f1f2f3, since
  • S(f1)S(f2)S(f3)r2 gtminsup
  • S(f1)S(f2)S(f3)S(f4)r0.6 lt minsup

31
Experiments
  • We compare 3 algorithms, COBBLER, CHARM and
    CLOSET.
  • One real-life dataset and one synthetic data.
  • We set 3 parameters, minsup, Length Ratio and Row
    Ratio.

32
minsup
Synthetic data
Real-life data (thrombin)
33
Length and Row ratio
Synthetic data
34
Discussion
  • The combination of row and feature enumeration
    also makes some disadvantage
  • The cost to calculate the switching condition and
    the cost of bad decision.
  • The increased cost in pruning, maintain two set
    of pruning system.

35
Discussion
  • We may use other more complicated data structure
    in our algorithm to improve the performance,
    e.g., the vertical data format and diffset
    technique.
  • And more efficient switching condition may
    improve the algorithm further.

36
Conclusion
  • The COBBLER algorithm gives better performance on
    dataset where the advantage of switching can be
    shown, e.g., complex dataset or dataset has both
    large number of rows and features.
  • For simple characteristic data, a single
    enumeration algorithm may be better.

37
Future Work
  • Using other data structure and technique in the
    algorithm.
  • Extend COBBLER to handle dataset that can not be
    fitted into memory.

38
Thanks
Write a Comment
User Comments (0)
About PowerShow.com