Applying Pruning Techniques to SingleClass Emerging Substring Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Applying Pruning Techniques to SingleClass Emerging Substring Mining

1
Applying Pruning Techniques to Single-Class
Emerging Substring Mining

Speaker Sarah Chan
Supervisor Dr. B. C. M. Kao
M.Phil. Probation Talk
CSIS DB Seminar
Aug 30, 2002

2
Presentation Outline

Introduction
The single-class ES mining problem
Data structure merged suffix tree
Algorithms baseline, s-pruning, g-pruning,
l-pruning
Performance evaluation
Conclusions

3
Introduction

Emerging Substrings (ESs)
A new type of KDD patterns
Substrings whose supports (or frequencies)
increase significantly from one class to another
(measured by a growth rate)
Motivation Emerging Patterns (EPs) by Dong and
Li
Jumping Emerging Substrings (JESs) as a
specialization of ESs
Substrings which can only be found in one class
but not others

4
Introduction

Emerging Substrings (ESs)
Usefulness
Capture sharp contrasts between datasets, or
trends over time
Provide knowledge for building sequence
classifiers
Applications (virtually endless)
Language identification, purchase behavior
analysis, financial data analysis,
bioinformatics, melody track selection,web-log
mining, content-based e-mail processing systems,

5
Introduction

Mining ESs
Brute-force approach
To enumerate all possible substrings in the
database, find their support counts in each
class, and check growth rate
But a huge sequence database contains millions of
sequences (GenBank has 15 million sequences in
2001), and
No. of substrings in a sequence increases
exponentially with sequence length (A typical
human genome has 3 billion characters)
? Too many candidates
? Expensive in terms of time ( O(D2 n3) )
and memory
Other shortcomings repeated substrings, common
substrings, (Please refer to seminar020201)

6
Introduction

Mining ESs
An Apriori-like approach
E.g. if both abcd bcde are frequent in D,
generate candidate abcde
Find frequent substrings and check growth rate
Still requires many database scans
A candidate may not be contained in any sequence
in D
Apriori property does not hold for ESs abcde can
be an ES even if both abcd bcde are not

We need algorithms which are more efficient
and
which allow us to filter out ES candidates

7
Introduction

Mining ESs
Our approach A suffix tree-based framework
A compact way of storing all substrings, with
support counters maintained
Deal with suffixes (not substrings) of sequences
Do not consider substrings not existing in the
database
Time complexity O( lg(?) D n2 )
Techniques for pruning of ES candidates can be
easily applied

8
Basic Definitions

Sequence
An ordered set of symbols over an alphabet ?
Class
In a sequence database, each sequence ?i has a
class label Ci ? C the set of all class labels
? does not belong to Ck ? ? belongs to Ck
Dataset
If database D is associated with m class labels,
we can partition D into m datasets, such that all
sequences in dataset Di have class label Ci
? ? Dk ? ? ? Dk

9
Basic Definitions

Count and support of string s in dataset D
countD(s) no. of sequences in D that contain s
suppD(s) countD(s) / D
Growth rate of string s from D1 to D2
growthRateD1?D2(s) suppD2(s) / suppD1(s)
growth rate 0 if suppD1(s) suppD1(s) 0
growth rate 8 if suppD1(s) 0 and suppD2(s) gt 0

10
ES and JES

Emerging Substring (ES)
Given ?s and ?g, a string s is an ES from Dk to
Dk (or s is an ES of Ck) if these hold
support condition suppDk(s) ?s
growth rate condition growthRateDk?Dk(s) ?g
Jumping Emerging Substring (JES)
It is an ES with 8 growth rate
JES of Ck suppDk(s) 0 and suppDk(s) gt 0

11
ES and JES

Example

With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd
12
ES and JES

Example

With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd growthRateD1?D2(b
) (3/4) / (2/4) 1.5
13
ES and JES

Example

With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd JESs are
underlined
14
The ES Mining Problem

The ES mining problem
Given a database D, the set C of all class
labels, a support threshold ?s and a growth rate
threshold ?g, to discover the set of all ESs for
each class Cj ? C
The single-class ES mining problem
A target class Ck is specified and our goal is to
discover the set of all ESs of Ck
Ck opponent class

15
Merged Suffix Tree

Suffix tree
Represent all the substrings of a length-n
sequence in O(n) space
Merged suffix tree
Represent all the substrings of all sequences in
a dataset Dk in O(Dk n) space
Each node has a support counter for each dataset
Each node is associated with a substring and
related to one or more substrings
Each edge is denoted as an index range istart,
iend)
E.g. if ? abcd, then ?1, 3) ab

16
Merged Suffix Tree

Example
(c1, c2) (count in Ck, count in Ck)

A
17
Merged Suffix Tree

Example
(c1, c2) (count in Ck, count in Ck)

countDk(a) 2, countDk(a) 1
A
18
Merged Suffix Tree

Example
Node Y is associated with abcd (concatenation)
and related to abc abcd (all share Ys
counters)
An implicit node Z is associated with abc

Z
Y
19
Algorithms

The baseline algorithm
Consists of 3 phases
Three pruning techniques
Support threshold pruning (s-pruning algorithm)
Growth rate threshold pruning (g-pruning
algorithm)
Length threshold pruning (l-pruning algorithm)

20
Baseline Algorithm

1. Construction Phase (C-Phase)
A merged tree MT is built from all the sequences
of the target class Ck each suffix sj of each
sequence is matched against substrings in the
tree
Update c1 counter for substrings contained in sj
(but a sequence should not contribute twice to
the same counter)
Explicitize implicit nodes when necessary
When a mismatch occurs, add a new edge and a new
leaf to represent the unmatched part of sj

21
Baseline Algorithm

1. Construction Phase (C-Phase)
Example

ab
3
Update of c1 counter
(2, 0)
abc
c
Explicitization of implicit node Update of edges
cd
(2, 0)
d
(1, 0)
22
Baseline Algorithm

1. Construction Phase (C-Phase)
Example

ab
4
(3, 0)
e
Addition of new edge and leaf node
c
(1, 0)
(2, 0)
abe
d
(1, 0)
23
Baseline Algorithm

2. Update Phase (U-Phase)
MT is updated with all the sequences of the
opponent class Ck
Only update c2 counter for substrings that are
already present in the tree, but not introduce
any substring that is only present in Dk
Only internal nodes will be added (no new leaf
nodes)
Resultant tree MT

24
Baseline Algorithm

3. eXtraction Phase (X-Phase)
All ESs of Ck are extracted by a pre-order tree
traversal on MT
At each node X, we check the values of its
counters, ?s and ?g, to determine whether its
related substrings can satisfy both the support
and growth rate conditions
If the related substrings of a node X cannot
fulfill the support condition, we can ignore the
subtree rooted at X
Baseline algorithm C-U-X phases

25
s -Pruning Algorithm

Observations
The c2 counter of each substring ? in MT would be
updated in the U-Phase if it is contained in some
sequence in Dk
If ? is infrequent with respect to Dk, it is not
qualified to be an ES of Ck and all its
descendent nodes will not even be visited in the
X-Phase
Pruning idea
To prune infrequent substrings in MT after the
C-Phase

26
s -Pruning Algorithm

?s-Pruning Phase (Ps-Phase)
With the use of ?s, all substrings being
infrequent in Dk are pruned by a pre-order
traversal on MT
Resultant tree MTs (input to the U-Phase)
s-pruning algorithm C-Ps-U-X phases

27
g -Pruning Algorithm

Observations
As sequences in Dk are being added to MT, value
of the c2 counter of some nodes would become
larger
? Support of these nodes' related substrings in
Dk is monotonically increasing
? Ratio of the support of these substrings in Dk
to that in Dk is monotonically decreasing
At some point, this ratio may become less than
?g. When this happens, these substrings have
actually lost their candidature for being ESs of
Ck

28
g -Pruning Algorithm

Pruning idea
To prune substrings in MT as soon as they are
found to be failing the growth rate requirement
?g-Update Phase (Ug-Phase)
When the support count of a substring in Dk
increases, check if it still satisfies the growth
rate condition. If not, prune substring by path
compression or node deletion
Supported by istart, iq, iend) representation of
edges
g-pruning algorithm C-Ug-X phases

29
l-Pruning Algorithm

Observations
Longer substrings often have lower support than
shorter ones ? less likely to fulfill the support
condition for ESs
It is not desirable to append these longer
substrings to the tree in the C-Phase and
subsequently prune them in the Ps-Phase (for the
s-pruning algorithm)
Pruning idea
To limit the length of substrings to be added to
MT in the tree construction phase

30
l-Pruning Algorithm

?l-Construction Phase (Cl-Phase)
Only match (min(sj, ?l) symbols of each suffix
against the tree (ignore the remainder) ? a
smaller MT is built
Unlike the previous two pruning approaches, it
may result in ES loss
l-pruning algorithm Cl-U-X phases

31
Summary of Phases

Baseline C-U-X
s-pruning C-Ps-U-X (earlier use of ?s)
g-pruning C-Ug-X (earlier use of ?g)
l-pruning Cl-U-X (addition of ?l)

Combination of the use of pruning techniques
lts , ggt, ltl , sgt, ltl , ggt, ltl , s , ggt

32
Performance Evaluation

Dataset CI3 (music feature in midi tracks)

Goal to extract ESs from target class melody
(opponent class non-melody)
Assumptions all sequences are pre-stored in
memory (appended in a vector, starting ending
positions of each sequence recorded)

33
Number of ESs Mined
34
Take a look at the tree size

When ?s 0.50, ?g 2

35
Baseline Algorithm C-U-X

Performance same for all ?s and ?g
Time about 35s

36
s -Pruning Algorithm C-Ps-U-X

Faster than baseline alg.by 25-45
But reduction in time lt reduction in tree size
Performance improve with ? in ?s, same for all ?g

37
g -Pruning Algorithm C-Ug-X

When ?g ?, faster than baseline alg. by 2-5
When ?g 2 or 5, slower than baseline alg. by
1-4
Performance improve with ? in ?g, same for all ?s

38
sg -Pruning Algorithm C-Ps-Ug-X

Faster than baseline, s-pruning, g-pruning
alg.(all cases)
Faster than baseline alg. for 31-54(2 or 5),
47-81(?)
Performance improve with ? in ?s and ?g

39
Target Class Melody
(?g 2)

Performance of algorithms
(fastest) sg-pruning gt s-pruning gt baseline gt
g-pruning

40
What If Target Class Non-Melody?
(?g 2)

Performance of algorithms
(fastest) s-pruning gt sg-pruning gt baseline gt
g-pruning

41
What If Target Class Non-Melody?

sg-pruning performs worse than s-pruning
Due to overhead in node creation (g-pruning
requires one more index for each edge)
Not much performance gain with s-pruning (just
3-5) or sg-pruning (1-3)
Bottleneck formation of MT (over 93 time is
spent in the C-Phase)
In fact, these pruning techniques are very
effective since much time is saved in the U-Phase
42-80 (for s-pruning) and 54-85 (for sg-pruning)

42
l-Pruning Algorithm Loss of ESs
?l
?s, ?g
avg. seq. length 331 max. seq. length 1085
?l

Except when ?s 0.25, there is loss of
non-jumping ESs only when ?l lt 20 (15 for the
case of JESs)

43
l-Pruning Algorithm Time Saved
?l
?s, ?g
avg. seq. length 331 max. seq. length 1085
?l

Time saved becomes obvious when ?l lt 100
For ?s ? 0.50, can save over 30 time without ES
loss

44
To be Explored . . .

ls-pruning
lg-pruning
lsg-pruning

45
Conclusions

ESs of a class are substrings which occur more
frequently in that class rather than other
classes.
ESs are useful features as they capture
distinguishing characteristics of data classes.
We have proposed a suffix tree-based framework
for mining ESs.

46
Conclusions

Three basic techniques for pruning ES candidates
have been described, and most of them have been
proven effective
Future work to study whether pruning techniques
can be efficiently applied to suffix tree merging
algorithms or other ES mining models.

47
Applying Pruning Techniques to Single-Class
Emerging Substring Mining
- The End -

Write a Comment

User Comments (0)

About PowerShow.com

Applying Pruning Techniques to SingleClass Emerging Substring Mining PowerPoint PPT Presentation