Title: Applying Pruning Techniques to SingleClass Emerging Substring Mining
 1Applying Pruning Techniques to Single-Class 
Emerging Substring Mining
- Speaker Sarah Chan 
 - Supervisor Dr. B. C. M. Kao 
 - M.Phil. Probation Talk 
 - CSIS DB Seminar 
 - Aug 30, 2002
 
  2Presentation Outline
- Introduction 
 - The single-class ES mining problem 
 - Data structure merged suffix tree 
 - Algorithms baseline, s-pruning, g-pruning, 
 l-pruning  - Performance evaluation 
 - Conclusions
 
  3Introduction
- Emerging Substrings (ESs) 
 - A new type of KDD patterns 
 - Substrings whose supports (or frequencies) 
increase significantly from one class to another 
(measured by a growth rate)  - Motivation Emerging Patterns (EPs) by Dong and 
Li  - Jumping Emerging Substrings (JESs) as a 
specialization of ESs  - Substrings which can only be found in one class 
but not others 
  4Introduction
- Emerging Substrings (ESs) 
 - Usefulness 
 - Capture sharp contrasts between datasets, or 
trends over time  - Provide knowledge for building sequence 
classifiers  - Applications (virtually endless) 
 - Language identification, purchase behavior 
analysis, financial data analysis, 
bioinformatics, melody track selection,web-log 
mining, content-based e-mail processing systems,  
  5Introduction
- Mining ESs 
 - Brute-force approach 
 - To enumerate all possible substrings in the 
database, find their support counts in each 
class, and check growth rate  - But a huge sequence database contains millions of 
sequences (GenBank has 15 million sequences in 
2001), and  - No. of substrings in a sequence increases 
exponentially with sequence length (A typical 
human genome has 3 billion characters)  -  ? Too many candidates 
 -  ? Expensive in terms of time ( O(D2 n3) ) 
and memory  - Other shortcomings repeated substrings, common 
substrings,  (Please refer to seminar020201) 
  6Introduction
- Mining ESs 
 - An Apriori-like approach 
 - E.g. if both abcd  bcde are frequent in D, 
generate candidate abcde  - Find frequent substrings and check growth rate 
 - Still requires many database scans 
 - A candidate may not be contained in any sequence 
in D  - Apriori property does not hold for ESs abcde can 
be an ES even if both abcd  bcde are not 
-  We need algorithms which are more efficient 
and  -  which allow us to filter out ES candidates 
 
  7Introduction
- Mining ESs 
 - Our approach A suffix tree-based framework 
 - A compact way of storing all substrings, with 
support counters maintained  - Deal with suffixes (not substrings) of sequences 
 - Do not consider substrings not existing in the 
database  - Time complexity O( lg(?) D n2 ) 
 - Techniques for pruning of ES candidates can be 
easily applied 
  8Basic Definitions
- Sequence 
 - An ordered set of symbols over an alphabet ? 
 - Class 
 - In a sequence database, each sequence ?i has a 
class label Ci ? C  the set of all class labels  - ? does not belong to Ck ? ? belongs to Ck 
 - Dataset 
 - If database D is associated with m class labels, 
we can partition D into m datasets, such that all 
sequences in dataset Di have class label Ci  - ? ? Dk ? ? ? Dk
 
  9Basic Definitions
- Count and support of string s in dataset D 
 - countD(s)  no. of sequences in D that contain s 
 - suppD(s)  countD(s) / D 
 - Growth rate of string s from D1 to D2 
 - growthRateD1?D2(s)  suppD2(s) / suppD1(s) 
 - growth rate  0 if suppD1(s)  suppD1(s)  0 
 - growth rate  8 if suppD1(s)  0 and suppD2(s) gt 0
 
  10ES and JES
- Emerging Substring (ES) 
 - Given ?s and ?g, a string s is an ES from Dk to 
Dk (or s is an ES of Ck) if these hold  -  support condition suppDk(s)  ?s 
 -  growth rate condition growthRateDk?Dk(s)  ?g 
 - Jumping Emerging Substring (JES) 
 - It is an ES with 8 growth rate 
 - JES of Ck suppDk(s)  0 and suppDk(s) gt 0
 
  11ES and JES
With ?g  1.5 ESs from D2 to D1  a, abc, bcd, 
abcd ESs from D1 to D2  b, abd 
 12ES and JES
With ?g  1.5 ESs from D2 to D1  a, abc, bcd, 
abcd ESs from D1 to D2  b, abd growthRateD1?D2(b
)  (3/4) / (2/4)  1.5 
 13ES and JES
With ?g  1.5 ESs from D2 to D1  a, abc, bcd, 
abcd ESs from D1 to D2  b, abd JESs are 
underlined 
 14The ES Mining Problem
- The ES mining problem 
 - Given a database D, the set C of all class 
labels, a support threshold ?s and a growth rate 
threshold ?g, to discover the set of all ESs for 
each class Cj ? C  - The single-class ES mining problem 
 - A target class Ck is specified and our goal is to 
discover the set of all ESs of Ck  - Ck  opponent class 
 
  15Merged Suffix Tree
- Suffix tree 
 - Represent all the substrings of a length-n 
sequence in O(n) space  - Merged suffix tree 
 - Represent all the substrings of all sequences in 
a dataset Dk in O(Dk n) space  - Each node has a support counter for each dataset 
 - Each node is associated with a substring and 
related to one or more substrings  - Each edge is denoted as an index range istart, 
iend)  - E.g. if ?  abcd, then ?1, 3)  ab
 
  16Merged Suffix Tree
- Example 
 - (c1, c2)  (count in Ck, count in Ck)
 
A 
 17Merged Suffix Tree
- Example 
 - (c1, c2)  (count in Ck, count in Ck)
 
countDk(a)  2, countDk(a)  1
A 
 18Merged Suffix Tree
- Example 
 - Node Y is associated with abcd (concatenation) 
 -  and related to abc  abcd (all share Ys 
counters)  - An implicit node Z is associated with abc
 
Z
Y 
 19Algorithms
- The baseline algorithm 
 - Consists of 3 phases 
 - Three pruning techniques 
 - Support threshold pruning (s-pruning algorithm) 
 - Growth rate threshold pruning (g-pruning 
algorithm)  - Length threshold pruning (l-pruning algorithm) 
 
  20Baseline Algorithm
- 1. Construction Phase (C-Phase) 
 - A merged tree MT is built from all the sequences 
of the target class Ck  each suffix sj of each 
sequence is matched against substrings in the 
tree  - Update c1 counter for substrings contained in sj 
(but a sequence should not contribute twice to 
the same counter)  - Explicitize implicit nodes when necessary 
 - When a mismatch occurs, add a new edge and a new 
leaf to represent the unmatched part of sj 
  21Baseline Algorithm
- 1. Construction Phase (C-Phase) 
 - Example
 
ab
3
Update of c1 counter
(2, 0)
abc
c
Explicitization of implicit node Update of edges
cd
(2, 0)
d
(1, 0) 
 22Baseline Algorithm
- 1. Construction Phase (C-Phase) 
 - Example
 
ab
4
(3, 0)
e
Addition of new edge and leaf node
c
(1, 0)
(2, 0)
abe
d
(1, 0) 
 23Baseline Algorithm
- 2. Update Phase (U-Phase) 
 - MT is updated with all the sequences of the 
opponent class Ck  - Only update c2 counter for substrings that are 
already present in the tree, but not introduce 
any substring that is only present in Dk  - Only internal nodes will be added (no new leaf 
nodes)  - Resultant tree MT 
 
  24Baseline Algorithm
- 3. eXtraction Phase (X-Phase) 
 - All ESs of Ck are extracted by a pre-order tree 
traversal on MT  - At each node X, we check the values of its 
counters, ?s and ?g, to determine whether its 
related substrings can satisfy both the support 
and growth rate conditions  - If the related substrings of a node X cannot 
fulfill the support condition, we can ignore the 
subtree rooted at X  - Baseline algorithm C-U-X phases
 
  25s -Pruning Algorithm
- Observations 
 - The c2 counter of each substring ? in MT would be 
updated in the U-Phase if it is contained in some 
sequence in Dk  - If ? is infrequent with respect to Dk, it is not 
qualified to be an ES of Ck and all its 
descendent nodes will not even be visited in the 
X-Phase  - Pruning idea 
 - To prune infrequent substrings in MT after the 
C-Phase 
  26s -Pruning Algorithm
- ?s-Pruning Phase (Ps-Phase) 
 - With the use of ?s, all substrings being 
infrequent in Dk are pruned by a pre-order 
traversal on MT  - Resultant tree MTs (input to the U-Phase) 
 - s-pruning algorithm C-Ps-U-X phases
 
  27g -Pruning Algorithm
- Observations 
 - As sequences in Dk are being added to MT, value 
of the c2 counter of some nodes would become 
larger  - ? Support of these nodes' related substrings in 
Dk is monotonically increasing  - ? Ratio of the support of these substrings in Dk 
to that in Dk is monotonically decreasing  - At some point, this ratio may become less than 
?g. When this happens, these substrings have 
actually lost their candidature for being ESs of 
Ck 
  28g -Pruning Algorithm
- Pruning idea 
 - To prune substrings in MT as soon as they are 
found to be failing the growth rate requirement  - ?g-Update Phase (Ug-Phase) 
 - When the support count of a substring in Dk 
increases, check if it still satisfies the growth 
rate condition. If not, prune substring by path 
compression or node deletion  - Supported by istart, iq, iend) representation of 
edges  - g-pruning algorithm C-Ug-X phases
 
  29l-Pruning Algorithm
- Observations 
 - Longer substrings often have lower support than 
shorter ones ? less likely to fulfill the support 
condition for ESs  - It is not desirable to append these longer 
substrings to the tree in the C-Phase and 
subsequently prune them in the Ps-Phase (for the 
s-pruning algorithm)  - Pruning idea 
 - To limit the length of substrings to be added to 
MT in the tree construction phase 
  30l-Pruning Algorithm
- ?l-Construction Phase (Cl-Phase) 
 - Only match (min(sj, ?l) symbols of each suffix 
against the tree (ignore the remainder) ? a 
smaller MT is built  - Unlike the previous two pruning approaches, it 
may result in ES loss  -  l-pruning algorithm Cl-U-X phases
 
  31Summary of Phases
- Baseline C-U-X 
 -  s-pruning C-Ps-U-X (earlier use of ?s) 
 -  g-pruning C-Ug-X (earlier use of ?g) 
 -  l-pruning Cl-U-X (addition of ?l)
 
-  Combination of the use of pruning techniques 
 -  lts , ggt, ltl , sgt, ltl , ggt, ltl , s , ggt 
 
  32Performance Evaluation
- Dataset CI3 (music feature in midi tracks) 
 
- Goal to extract ESs from target class melody 
 (opponent class non-melody)  - Assumptions all sequences are pre-stored in 
memory (appended in a vector, starting  ending 
positions of each sequence recorded) 
  33Number of ESs Mined 
 34Take a look at the tree size
  35Baseline Algorithm C-U-X 
- Performance same for all ?s and ?g 
 - Time about 35s
 
  36s -Pruning Algorithm C-Ps-U-X 
- Faster than baseline alg.by 25-45 
 - But reduction in time lt reduction in tree size 
 - Performance improve with ? in ?s, same for all ?g
 
  37g -Pruning Algorithm C-Ug-X 
- When ?g  ?, faster than baseline alg. by 2-5 
 - When ?g  2 or 5, slower than baseline alg. by 
1-4  - Performance improve with ? in ?g, same for all ?s
 
  38sg -Pruning Algorithm C-Ps-Ug-X 
- Faster than baseline, s-pruning, g-pruning 
alg.(all cases)  - Faster than baseline alg. for 31-54(2 or 5), 
47-81(?)  - Performance improve with ? in ?s and ?g 
 
  39Target Class Melody
(?g  2)
- Performance of algorithms 
 -  (fastest) sg-pruning gt s-pruning gt baseline gt 
g-pruning 
  40What If Target Class  Non-Melody?
(?g  2)
- Performance of algorithms 
 -  (fastest) s-pruning gt sg-pruning gt baseline gt 
g-pruning  
  41What If Target Class  Non-Melody?
- sg-pruning performs worse than s-pruning 
 - Due to overhead in node creation (g-pruning 
requires one more index for each edge)  - Not much performance gain with s-pruning (just 
3-5) or sg-pruning (1-3)  - Bottleneck formation of MT (over 93 time is 
spent in the C-Phase)  - In fact, these pruning techniques are very 
effective since much time is saved in the U-Phase  - 42-80 (for s-pruning) and 54-85 (for sg-pruning)
 
  42l-Pruning Algorithm   Loss of ESs
?l
?s, ?g
avg. seq. length  331 max. seq. length  1085
?l
- Except when ?s  0.25, there is loss of 
non-jumping ESs only when ?l lt 20 (15 for the 
case of JESs) 
  43l-Pruning Algorithm   Time Saved
?l
?s, ?g
avg. seq. length  331 max. seq. length  1085
?l
- Time saved becomes obvious when ?l lt 100 
 - For ?s ? 0.50, can save over 30 time without ES 
loss 
  44To be Explored . . .
- ls-pruning 
 - lg-pruning 
 - lsg-pruning 
 
  45Conclusions
- ESs of a class are substrings which occur more 
frequently in that class rather than other 
classes.  - ESs are useful features as they capture 
distinguishing characteristics of data classes.  - We have proposed a suffix tree-based framework 
for mining ESs. 
  46Conclusions
- Three basic techniques for pruning ES candidates 
have been described, and most of them have been 
proven effective  - Future work to study whether pruning techniques 
can be efficiently applied to suffix tree merging 
algorithms or other ES mining models. 
  47Applying Pruning Techniques to Single-Class 
Emerging Substring Mining
- The End -