Title: An Efficient Index Structure for String Databases
1An Efficient Index Structure for String Databases
- Tamer Kahveci
- Ambuj K. Singh
Department of Computer Science University of
California Santa Barbara
http//www.cs.ucsb.edu/tamer
2Whole/Substring Matching Problem
- Find similar substrings in a database, that are
similar to a given query string quickly, using a
small index structure (1-2 of database size).
database string
query string
3String Similarity
- Motivation
- Applications
- Genetic sequence databases, NCBI
- Text databases, spell checkers, web search.
- Video databases (e.g. VIRAGE, MEDIA360)
- Database size is too large. Most of the
techniques available are in-memory. - Space requirement of current indexes is too large.
Base Pairs (millions)
Year
4Outline
- Motivation background
- Our contribution
- Frequency vector, frequency distance wavelet
transform - Multi-resolution index structure
- k-NN range queries
- Experimental results
- Conclusion
5Notation
- q query string.
- m,n length of strings.
- r range query radius.
- ? r/q error rate.
6String Similarity an example
- A C T - - T A G C
- R I I D
- A A T G A T A G -
7Background
- Edit operations
- Insert
- Delete
- Replace
- Edit distance (ED) between s1 and s2 minimum
number of edit operations to transform s1 to s2. - Finding the edit distance is costly.
- O(mn) time and space if m and n are lengths of s1
and s2 if dynamic programming is used NW70,
SW81.
8Related Work
- Lossless search
- Online
- Mye86 (Myers) reduce space requirement to
O(rn), where r is query radius. - WM92 (Wu, Manber) binary masks, O(rn).
- BYN99 (Beaze-Yates, Navarro) NFA
- Offline (index based)
- Mye94 (Myers) condensed r-neighborhood.
- BYN97 (Beaze-Yates, Navarro) dictionary.
- Lossy search
- AG90 (Altschul, Gish) BLAST.
- FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER. - GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree
9Outline
- Motivation background
- Our contribution
- Frequency vector, frequency distance wavelet
transform - Multi-resolution index structure
- k-NN range queries
- Experimental results
- Conclusion
10Frequency Vector
- Let s be a string from the alphabet ??1, ...,
??. Let ni be the number of occurrences of the
character ?i in s for 1?i??, then - frequency vector f(s) n1, ..., n?.
- Example
- s AATGATAG
- f(s) nA, nC, nG, nT 4, 0, 2, 2
11Effect of Edit Operations on Frequency Vector
- Delete decreases an entry by 1.
- Insert increases an entry by 1.
- Replace Insert Delete
- Example
- s AATGATAG gt f(s) 4, 0, 2, 2
- (del. G), s AAT.ATAG gt f(s) 4, 0, 1, 2
- (ins. C), s AACTATAG gt f(s) 4, 1, 1, 2
- (A?C), s ACCTATAG gt f(s) 3, 2, 1, 2
12An Approximation to EDFrequency Distance (FD1)
- s AATGATAG gt f(s)4, 0, 2, 2
- q ACTTAGC gt f(q)2, 2, 1, 2
- pos (4-2) (2-1) 3
- neg (2-0) 2
- FD1(f(s),f(q)) 3
- ED(q,s) 4
- FD1(f(s1),f(s2))maxpos,neg.
- FD1(f(s1),f(s2))? ED(s1,s2).
13An Illustration of Frequency Distance Edit
Distance
v1
v2
14Using Local Information Wavelet Decomposition of
Strings
- s AATGATAC gt f(s)4, 1, 1, 2
- s AATG ATAC s1 s2
- f(s1) 2, 0, 1, 1
- f(s2) 2, 1, 0, 1
- ?1(s) f(s1)f(s2) 4, 1, 1, 2
- ?2(s) f(s1)-f(s2) 0, -1, 1, 0
15Wavelet Decomposition of a String General Idea
- Ai,j f(s(j2i (j1)2i-1))
- Bi,j Ai-1,2j - Ai-1,2j1
First wavelet coefficient
Second wavelet coefficient
?(s)
16Wavelet Decomposition ED
- Define FD(s1,s2)maxFD1, FD2.
17Outline
- Motivation background
- Our contribution
- Frequency vector, frequency distance wavelet
transform - Multi-resolution index structure
- k-NN and range queries
- Experimental results
- Conclusion
18MRS-Index Structure Creation
s1
w2a
19MRS-Index Structure Creation
s1
20MRS-Index Structure Creation
s1
21MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
22MRS-Index Structure Creation
s1
...
23MRS-Index Structure Creation
s1
Ta,1
...
W2a
24Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
25MRS-Index Structure
26MRS-index properties
- Relative MBR volume (Precision) decreases when
- c increases.
- w decreases.
- MBRs are highly clustered.
Box volume
Box Capacity
27Outline
- Motivation background
- Our contribution
- Frequency vector, frequency distance wavelet
transform - Multi-resolution index structure
- k-NN range queries
- Experimental results
- Conclusion
28Range Queries KS01
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
29k-Nearest Neighbor Query KSF96, SK98
k 3
30k-Nearest Neighbor Query
k 3
r Edit distance to 3rd closest substring
31k-Nearest Neighbor Query
r
k 3
32k-Nearest Neighbor Query
k 3
33Outline
- Motivation background
- Our contribution
- Experimental results
- Conclusion
34Experimental Settings
- w128, 256, 512, 1024.
- Human chromosomes from (www.ncbi.nlm.nih.gov)
- chr02, chr18, chr21, chr22
- Plotted results are from chr18 dataset.
- Queries are selected from data set randomly for
512 ? q ? 10000. - An NFA based technique BYN99 is implemented for
comparison.
35Experimental Results 1Effect of Box Capacity
(10-NN)
36Experimental Results 2Effect of Window Size
(10-NN)
37Experimental Results 3k-NN queries
38Experimental Results 4Range Queries
39Outline
- Motivation background
- Our Contribution
- Experimental results
- Discussion conclusion
40Discussion
- In-memory (index size is 1-2 of the database
size). - Lossless search.
- 3 to 45 times faster than NFA technique for k-NN
queries. - 2 to 12 times faster than NFA technique for range
queries. - Can be used to speedup any previously defined
technique.
41Future Work
- Extend to weighted edit distance and affine gaps.
- Extend to local similarity (substring/substring)
search. - Compare the quality of answers and speed to BLAST
(lossy search). - Use as a preprocessing step to BLAST.
- Apply the MRS index structure for larger alphabet
size (e.g. protein sequences.).
42Related Work
- Lossless search
- Online
- Mye86 (Myers) reduce space requirement to
O(rn), where r is query radius. - WM92 (Wu, Manber) binary masks, O(rn).
- BYN99 (Beaze-Yates, Navarro) NFA
- Offline (index based)
- Mye94 (Myers) condensed r-neighborhood.
- BYN97 (Beaze-Yates, Navarro) dictionary.
- Lossy search
- AG90 (Altschul, Gish) BLAST.
- FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER. - GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree
43Related Work (Similar problems)
- BYP92 (Beaze-Yates, Perleberg) only replace is
allowed. - Gus97 (Gusfield) exact matching, suffix trees.
- JKS00 (Jagadish, Koudas, Srivastava) exact
matching with wild-cards for multidimensional
strings, elided trees and R-tree.
44THANK YOU
45Frequency Distance to an MBR