An Efficient Index Structure for String Databases - PowerPoint PPT Presentation

About This Presentation

Title:

An Efficient Index Structure for String Databases

Description:

Find similar substrings in a database, that are similar to a given query string ... Extend to weighted edit distance and affine gaps. ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 46

Provided by: tam7152

Learn more at: https://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient Index Structure for String Databases

1
An Efficient Index Structure for String Databases

Tamer Kahveci
Ambuj K. Singh

Department of Computer Science University of
California Santa Barbara
http//www.cs.ucsb.edu/tamer
2
Whole/Substring Matching Problem

Find similar substrings in a database, that are
similar to a given query string quickly, using a
small index structure (1-2 of database size).

database string
query string
3
String Similarity

Motivation
Applications
Genetic sequence databases, NCBI
Text databases, spell checkers, web search.
Video databases (e.g. VIRAGE, MEDIA360)
Database size is too large. Most of the
techniques available are in-memory.
Space requirement of current indexes is too large.

Base Pairs (millions)
Year
4
Outline

Motivation background
Our contribution
Frequency vector, frequency distance wavelet
transform
Multi-resolution index structure
k-NN range queries
Experimental results
Conclusion

5
Notation

q query string.
m,n length of strings.
r range query radius.
? r/q error rate.

6
String Similarity an example

A C T - - T A G C
R I I D
A A T G A T A G -

7
Background

Edit operations
Insert
Delete
Replace
Edit distance (ED) between s1 and s2 minimum
number of edit operations to transform s1 to s2.
Finding the edit distance is costly.
O(mn) time and space if m and n are lengths of s1
and s2 if dynamic programming is used NW70,
SW81.

8
Related Work

Lossless search
Online
Mye86 (Myers) reduce space requirement to
O(rn), where r is query radius.
WM92 (Wu, Manber) binary masks, O(rn).
BYN99 (Beaze-Yates, Navarro) NFA
Offline (index based)
Mye94 (Myers) condensed r-neighborhood.
BYN97 (Beaze-Yates, Navarro) dictionary.
Lossy search
AG90 (Altschul, Gish) BLAST.
FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER.
GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree

9
Outline

Motivation background
Our contribution
Frequency vector, frequency distance wavelet
transform
Multi-resolution index structure
k-NN range queries
Experimental results
Conclusion

10
Frequency Vector

Let s be a string from the alphabet ??1, ...,
??. Let ni be the number of occurrences of the
character ?i in s for 1?i??, then
frequency vector f(s) n1, ..., n?.
Example
s AATGATAG
f(s) nA, nC, nG, nT 4, 0, 2, 2

11
Effect of Edit Operations on Frequency Vector

Delete decreases an entry by 1.
Insert increases an entry by 1.
Replace Insert Delete
Example
s AATGATAG gt f(s) 4, 0, 2, 2
(del. G), s AAT.ATAG gt f(s) 4, 0, 1, 2
(ins. C), s AACTATAG gt f(s) 4, 1, 1, 2
(A?C), s ACCTATAG gt f(s) 3, 2, 1, 2

12
An Approximation to EDFrequency Distance (FD1)

s AATGATAG gt f(s)4, 0, 2, 2
q ACTTAGC gt f(q)2, 2, 1, 2
pos (4-2) (2-1) 3
neg (2-0) 2
FD1(f(s),f(q)) 3
ED(q,s) 4
FD1(f(s1),f(s2))maxpos,neg.
FD1(f(s1),f(s2))? ED(s1,s2).

13
An Illustration of Frequency Distance Edit
Distance
v1
v2
14
Using Local Information Wavelet Decomposition of
Strings

s AATGATAC gt f(s)4, 1, 1, 2
s AATG ATAC s1 s2
f(s1) 2, 0, 1, 1
f(s2) 2, 1, 0, 1
?1(s) f(s1)f(s2) 4, 1, 1, 2
?2(s) f(s1)-f(s2) 0, -1, 1, 0

15
Wavelet Decomposition of a String General Idea

Ai,j f(s(j2i (j1)2i-1))
Bi,j Ai-1,2j - Ai-1,2j1

First wavelet coefficient
Second wavelet coefficient
?(s)
16
Wavelet Decomposition ED

Define FD(s1,s2)maxFD1, FD2.

17
Outline

Motivation background
Our contribution
Frequency vector, frequency distance wavelet
transform
Multi-resolution index structure
k-NN and range queries
Experimental results
Conclusion

18
MRS-Index Structure Creation
s1
w2a
19
MRS-Index Structure Creation
s1
20
MRS-Index Structure Creation
s1
21
MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
22
MRS-Index Structure Creation
s1
...
23
MRS-Index Structure Creation
s1
Ta,1
...
W2a
24
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
25
MRS-Index Structure
26
MRS-index properties

Relative MBR volume (Precision) decreases when
c increases.
w decreases.
MBRs are highly clustered.

Box volume
Box Capacity
27
Outline

Motivation background
Our contribution
Frequency vector, frequency distance wavelet
transform
Multi-resolution index structure
k-NN range queries
Experimental results
Conclusion

28
Range Queries KS01
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
29
k-Nearest Neighbor Query KSF96, SK98
k 3
30
k-Nearest Neighbor Query
k 3
r Edit distance to 3rd closest substring
31
k-Nearest Neighbor Query
r
k 3
32
k-Nearest Neighbor Query
k 3
33
Outline

Motivation background
Our contribution
Experimental results
Conclusion

34
Experimental Settings

w128, 256, 512, 1024.
Human chromosomes from (www.ncbi.nlm.nih.gov)
chr02, chr18, chr21, chr22
Plotted results are from chr18 dataset.
Queries are selected from data set randomly for
512 ? q ? 10000.
An NFA based technique BYN99 is implemented for
comparison.

35
Experimental Results 1Effect of Box Capacity
(10-NN)
36
Experimental Results 2Effect of Window Size
(10-NN)
37
Experimental Results 3k-NN queries
38
Experimental Results 4Range Queries
39
Outline

Motivation background
Our Contribution
Experimental results
Discussion conclusion

40
Discussion

In-memory (index size is 1-2 of the database
size).
Lossless search.
3 to 45 times faster than NFA technique for k-NN
queries.
2 to 12 times faster than NFA technique for range
queries.
Can be used to speedup any previously defined
technique.

41
Future Work

Extend to weighted edit distance and affine gaps.
Extend to local similarity (substring/substring)
search.
Compare the quality of answers and speed to BLAST
(lossy search).
Use as a preprocessing step to BLAST.
Apply the MRS index structure for larger alphabet
size (e.g. protein sequences.).

42
Related Work

Lossless search
Online
Mye86 (Myers) reduce space requirement to
O(rn), where r is query radius.
WM92 (Wu, Manber) binary masks, O(rn).
BYN99 (Beaze-Yates, Navarro) NFA
Offline (index based)
Mye94 (Myers) condensed r-neighborhood.
BYN97 (Beaze-Yates, Navarro) dictionary.
Lossy search
AG90 (Altschul, Gish) BLAST.
FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
FLASH, QUASAR, REPUTER, MumMER.
GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree

43
Related Work (Similar problems)

BYP92 (Beaze-Yates, Perleberg) only replace is
allowed.
Gus97 (Gusfield) exact matching, suffix trees.
JKS00 (Jagadish, Koudas, Srivastava) exact
matching with wild-cards for multidimensional
strings, elided trees and R-tree.

44
THANK YOU
45
Frequency Distance to an MBR

Write a Comment

User Comments (0)