Title: An Efficient Index Structure for String Databases
 1An Efficient Index Structure for String Databases
- Tamer Kahveci 
 - Ambuj K. Singh 
 - Presented By 
 - Atul Ugalmugale/Nikita Rasam
 
  2- Issue ? 
 - Find similar substrings in a large database, that 
 is similar to a given query string quickly, 
using a small index structure  - In some applications we store, search and analyze 
long sequences of discrete characters, which we 
call strings  - There is a frequent need to find similarities 
between genetic data, web data and event 
sequences. 
  3- Applications ? 
 - Information Retrieval  A typical application of 
information retrieval is text searching given a 
large collection of documents and some text 
keywords we want to find the documents which 
contain these keywords.  
-  searching keywords through the net usually by 
mtallica we mean metallica 
  4- Computational Biology  The problem is similar in 
computational biology here we have a long DNA 
sequence and we want to find subsequences in it 
that match approximately a query sequence.  - ATGCATACGATCGATT 
 -  TGCAATGGCTTAGCTAAnimal species from the same 
family are bound to have more similar DNAs 
  5- Video data can be viewed as an event sequence if 
some pre-specified set of events are detected and 
stored as a sequence. Searching similar event 
subsequences can be used to find related video 
segments. 
  6- String search algorithms proposed so far are 
in-memory algorithms.  - Scan the whole database for each query. 
 - Size of the string database grows faster than the 
available memory capacity, and extensive memory 
requirements make the search techniques 
impractical.  - Suffer from disk I/Os when the database is too 
large  - Performance deteriorates for long query patterns
 
  7- Similarity Metrics 
 - The difference between two strings s1 and s2 is 
generally defined as the minimum number of edit 
operations to transform s1 to s2 called edit 
distance ED.  - Edit operations 
 - Insert 
 - Delete 
 - Replace 
 
  8- Suppose we have two strings x,y 
 - e.g. x  kitten, y  sitting 
 - and we want to transform x into y. 
 - A closer look 
 - k i t t e n 
 - s i t t i n g 
 - 1st step kitten ?sitten (Replace) 
 - 2nd step sitten?sittin (Replace) 
 - 3rd step sittin?sitting (Insert)s 
 - What is the edit distance between survey and 
surgery?  - s u r v e y --- s u r g e y replace 
(1) --- s u r g e r y insert (1)  - Edit distance  2 
 
  9- In the general version of edit distance, 
different operations may have different costs, or 
the costs depend on the characters involved.  - For example replacement could be more expensive 
than insertion, or replacing a with o could 
be less expensive than replacing a with k.  - This is called as weighted edit distance.
 
  10- Global Alignment 
 - Global alignment (or similarity) of s1 and s2 is 
defined as the maximum valued alignment of s1 and 
s2.  - Given two strings S1 and S2, the global alignment 
of them is obtained by inserting spaces into S1 
or S2 and at the ends so that are of the same 
length and then writing them one against the 
other  - Example 
 - qacdbd  qawdb qac_dbd qa_wdb_ 
 - Edits and alignments are dual. 
 - A sequence of edits can be converted into a 
global alignment.  - An alignment can be converted into a sequence of 
edits  
  11- Local Alignment 
 -  Given two strings X and Y find two substrings x 
and y from X and Y, respectively, such that their 
alignment score (in the global sense) is maximum 
over all pairs of such substrings. (empty 
substrings are allowed)  -  S(x,y)  2 , x  y 
 -  -2, x ! y 
 -  -1, x  _ or y  _
 
Xpqraxabcstvq Yyxaxbacsll xaxabcs yaxbacs
a x a b _ c s a x _ b a c s 22-12-1228 
 12String Matching Problem
- Whole Matching  
 -  finding the edit distance ED(q,s) between a data 
string s and a query string q.  - Substring Matching  
 -  Consider all substrings sij of s which are 
close to the query string.  - Two Types of Queries  
 -  Range search seeks all the substrings of S which 
are within an edit distance of r to a given query 
q (r  range query)  -  K-nearest neighbor search seeks the K closest 
substrings of S to q. 
  13Challenges in solving the substring matching 
problem
- Finding the edit distance is very costly in 
terms of both time and space.  - The strings in the database may be very long. 
 - The database size for most applications grows 
exponentially.  - New approach to overcome challenges 
 - Define a lower bound distance for substring 
searching  - Improve this lower bound by using the idea of 
wavelet transformation  - Use the MRS index structure based on the 
aforementioned distance formulations 
  14A dynamic programming algorithm for computing the 
edit distance
- Problem find the edit distance between strings x 
and y.  - Create a (x1)(y1) matrix C, where Ci,j 
represents the minimum number of operations to 
match x1..i with y1..j. The matrix is constructed 
as follows.  -  Ci,0  I 
 -  C0,j  j 
 -  Ci,j  min(Ci-1,j-1)cost, replace 
 -  (Ci,j-1)1, insert 
 -  (Ci-1,j)1 delete 
 -  cost  0 if xiyi, else 1
 
  15How do we perform substring search?
- The same dynamic programming algorithm can be 
used to find the most similar substrings of a 
query sting q.  - The difference is that we set C0,j0 for all j, 
since any text position could be the potential 
start of a match.  - If the similarity distance bound is k, we report 
all positions, where Cm k (m is the last row  m 
 q). 
  16Frequency Vector
- Let s be a string from the alphabet ??1, ..., 
??. Let ni be the number of occurrences of the 
character ?i in s for 1?i??, then  -  frequency vector f(s) n1, ..., n?. 
 - Example 
 - s  AATGATAG 
 - f(s)  nA, nC, nG, nT  4, 0, 2, 2 
 - Let s be a string from the alphabet ??1, ..., 
??. Let f(s) v1, ..., v?, be the frequency 
vector of s then ? ?i-1 vi  s.  -  An edit operation on s has one of the following 
effects on f(s), for 1 ? i , j ? ?, and i ! j   - vi  vi  1 
 - vi  vi - 1 
 - vi  vi  1 and vj  vj - 1 
 
  17Effect of Edit Operations on Frequency Vector
- Delete  decreases an entry by 1. 
 - Insert  increases an entry by 1. 
 - Replace  Insert  Delete 
 - Example 
 - s  AATGATAG  f(s)  4, 0, 2, 2 
 - (del. G), s  AAT.ATAG  f(s)  4, 0, 1, 2 
 - (ins. C), s  AACTATAG  f(s)  4, 1, 1, 2 
 - (A?C), s  ACCTATAG  f(s)  3, 2, 1, 2
 
  18Frequency Distance
- Let u and v be integer points in ? dimensional 
space. The frequency distance, FD 1 (u,v) between 
u and v is defined as the minimum number of steps 
in order to go from u to v ( or equivalently from 
v to u) by moving to a neighbor point at each 
step.  -  frequency vector f(s) n1, ..., n?. 
 - Let s 1 and s 2 be two strings from the alphabet 
??1, ..., ?? then  - FD 1 (f(s 1), f(s 2)) ? ED (s 1 ,s 2) 
 
  19An Approximation to ED Frequency Distance (FD1)
- s  AATGATAG  f(s)4, 0, 2, 2 
 - q  ACTTAGC  f(q)2, 2, 1, 2 
 - pos  (4-2)  (2-1)  3 
 - neg  (2-0)  2 
 - FD1(f(s),f(q))  3 
 - ED(q,s)  4 
 - FD1(f(s1),f(s2))maxpos,neg. 
 - FD1(f(s1),f(s2))? ED(s1,s2).
 
  20Frequency Distance Calculation/ u and v are ? 
dimensional integer points /Algorithm  FD 1 
(u,v) posDistance  negDistance  0For i  1 
to ?
- FD1(u, v)  max  posDist, negDist  
 
  21Wavelet Vector ComputationLet s  c1c2cn be a 
string from the alphabet ??1, ..., ?? then Kth 
level wavelet transformation, ?k (s) , 0 log2n of s is defined as ?k (s)  vk,1, ..., 
vk,n/2k where vk,I  Ak,i , Bk,i,
-  f (ci) k  0 
 -  Ak-1,2i  Ak-1,2i1 0 
 -  0 k  0 
 -  Ak-1,2i - Ak-1,2i1 0 
 -  0
 
 Ak,i 
 Bk,i  
 22Using Local Information Wavelet Decomposition of 
Strings
- s  AATGATAC  f(s)4, 1, 1, 2 
 - s  AATG  ATAC  s1  s2 
 - f(s1)  2, 0, 1, 1 
 - f(s2)  2, 1, 0, 1 
 - ?1(s) f(s1)f(s2)  4, 1, 1, 2 
 - ?2(s) f(s1)-f(s2)  0, -1, 1, 0
 
  23Wavelet Decomposition of a String General Idea
- Ai,j  f(s(j2i  (j1)2i-1)) 
 - Bi,j  Ai-1,2j - Ai-1,2j1 
 
First wavelet coefficient
Second wavelet coefficient
?(s) 
 24Wavelet Transformation Example
- s  T C A C n  s  4 
 - ?0(s)  v0,0 , v0,1 , v0,2 , v0,3 
 -    (A0,0, B0,0), (A0,1, B0,1), (A0,2, 
B0,2), (A0,3, B0,3)   -    (f(t), 0), (f(c), 0), (f(a), 
0), (f(c), 0)   -   (0,0,1, 0), (0,1,0, 0), (1,0,0, 0), 
(0,1,0, 0)   - ?1(s)   (0,1,1, 0,-1,1), (1,1,0, 
1,-1,0)   - ?2(s)   ( 1,2,1, -1,0,1 ) 
 
First wavelet coefficient
Second wavelet coefficient 
 25Wavelet Distance Calculation 
 26Maximum Frequency Distance Calculation
- FD(s1,s2)  
 - max  FD1(f (s1), f (s2)), FD2(?(s1),?(s2))  
 - FD1 is the Frequency Distance 
 - FD2 is the Wavelet Distance 
 
  27MRS-Index Structure Creation
s1
w2a 
 28MRS-Index Structure Creation
s1 
 29MRS-Index Structure Creation
s1 
 30MRS-Index Structure Creation
s1
...
slide c times
cbox capacity 
 31MRS-Index Structure Creation
s1
... 
 32MRS-Index Structure Creation
s1
Ta,1
...
W2a 
 33Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1 
 34MRS-Index Structure 
 35MRS-index properties
- Relative MBR volume (Precision) decreases when 
 - c increases. 
 - w decreases. 
 - MBRs are highly clustered.
 
Box volume
Box Capacity 
 36Frequency Distance to an MBRLet q be the query 
string of length 2i where a Given an MBR B, we define FD(q,B) min(s belongs 
to B) FD(q,s) 
 37Range Search Algorithm 
 38Range Queries
1. Partition the query string into subqueries at 
various resolutions available in our index.
2. Perform a partial range query for each 
subquery on the corresponding row of the index 
structure, and refine e.
3. Disk pages corresponding to last result set 
are read, and postprocessing is done to elminate 
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q 
 39K-Nearest Neighbor Algorithm 
 40k-Nearest Neighbor Query
k  3 
 41k-Nearest Neighbor Query
k  3 
 42k-Nearest Neighbor Query KSF96, SK98
k  3 
 43k-Nearest Neighbor Query
r
k  3
r  Edit distance to 3rd closest substring 
 44Experimental Settings
- w128, 256, 512, 1024. 
 - Human chromosomes from (www.ncbi.nlm.nih.gov) 
 - chr02, chr18, chr21, chr22 
 - Plotted results are from chr18 dataset. 
 - Queries are selected from data set randomly for 
512 ? q ? 10000.  - An NFA based technique BYN99 is implemented for 
comparison. 
  45Experimental Results 1Effect of Box Capacity 
(10-NN)
- The cost of the MRS-index increases as the box 
capacity increases.  - The cost of the MRS-index is much lower than the 
NFA technique for all these box capacities.  - Although using 2-wavelet coefficient slightly 
improves the performance for the same box 
capacity, the size of the index structure is 
doubled. For same amount of memory, the single 
coefficient version performs better 
  46Experimental Results 2Effect of Window Size 
(10-NN)
- The MRS-index structure outperforms the NFA 
technique for all the window sizes.  - The performance of the MRS index structure itself 
improves as the window size increases. 
  47Experimental Results 3k-NN queries
- The performance of the MRS-index structure drops 
for large values of k , it still performs better 
than the NFA technique.  - Achieved speedups up to 45 for 10 nearest 
neighbors. The speedup for 200 nearest neighbors 
is 3.  - As the number of nearest neighbors increases, the 
performance of the MRS-index structure approaches 
to that of the NFA technique. 
  48Experimental Results 4Range Queries
- The MRS-index structure performed up to 12 times 
faster than the NFA technique. The performance of 
the MRS-index structure improved when the queries 
are selected from different data strings. This is 
because the DNA strings have a high self 
similarity.  - The performance of the MRS index structure 
deteriorates as the error rate increases. This is 
because the size of the candidate set increases 
as the error rate increases. 
  49Discussion
- In-memory (index size is 1-2 of the database 
size).  - Lossless search. 
 - 3 to 45 times faster than NFA technique for k-NN 
queries.  - 2 to 12 times faster than NFA technique for range 
queries.  - Can be used to speedup any previously defined 
technique. 
  50THANK YOU