An Efficient Index Structure for String Databases - PowerPoint PPT Presentation

About This Presentation
Title:

An Efficient Index Structure for String Databases

Description:

Find similar substrings in a database, that are similar to a given query string ... Extend to weighted edit distance and affine gaps. ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 46
Provided by: tam7152
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Index Structure for String Databases


1
An Efficient Index Structure for String Databases
  • Tamer Kahveci
  • Ambuj K. Singh

Department of Computer Science University of
California Santa Barbara
http//www.cs.ucsb.edu/tamer
2
Whole/Substring Matching Problem
  • Find similar substrings in a database, that are
    similar to a given query string quickly, using a
    small index structure (1-2 of database size).

database string
query string
3
String Similarity
  • Motivation
  • Applications
  • Genetic sequence databases, NCBI
  • Text databases, spell checkers, web search.
  • Video databases (e.g. VIRAGE, MEDIA360)
  • Database size is too large. Most of the
    techniques available are in-memory.
  • Space requirement of current indexes is too large.

Base Pairs (millions)
Year
4
Outline
  • Motivation background
  • Our contribution
  • Frequency vector, frequency distance wavelet
    transform
  • Multi-resolution index structure
  • k-NN range queries
  • Experimental results
  • Conclusion

5
Notation
  • q query string.
  • m,n length of strings.
  • r range query radius.
  • ? r/q error rate.

6
String Similarity an example
  • A C T - - T A G C
  • R I I D
  • A A T G A T A G -

7
Background
  • Edit operations
  • Insert
  • Delete
  • Replace
  • Edit distance (ED) between s1 and s2 minimum
    number of edit operations to transform s1 to s2.
  • Finding the edit distance is costly.
  • O(mn) time and space if m and n are lengths of s1
    and s2 if dynamic programming is used NW70,
    SW81.

8
Related Work
  • Lossless search
  • Online
  • Mye86 (Myers) reduce space requirement to
    O(rn), where r is query radius.
  • WM92 (Wu, Manber) binary masks, O(rn).
  • BYN99 (Beaze-Yates, Navarro) NFA
  • Offline (index based)
  • Mye94 (Myers) condensed r-neighborhood.
  • BYN97 (Beaze-Yates, Navarro) dictionary.
  • Lossy search
  • AG90 (Altschul, Gish) BLAST.
  • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
    FLASH, QUASAR, REPUTER, MumMER.
  • GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree

9
Outline
  • Motivation background
  • Our contribution
  • Frequency vector, frequency distance wavelet
    transform
  • Multi-resolution index structure
  • k-NN range queries
  • Experimental results
  • Conclusion

10
Frequency Vector
  • Let s be a string from the alphabet ??1, ...,
    ??. Let ni be the number of occurrences of the
    character ?i in s for 1?i??, then
  • frequency vector f(s) n1, ..., n?.
  • Example
  • s AATGATAG
  • f(s) nA, nC, nG, nT 4, 0, 2, 2

11
Effect of Edit Operations on Frequency Vector
  • Delete decreases an entry by 1.
  • Insert increases an entry by 1.
  • Replace Insert Delete
  • Example
  • s AATGATAG gt f(s) 4, 0, 2, 2
  • (del. G), s AAT.ATAG gt f(s) 4, 0, 1, 2
  • (ins. C), s AACTATAG gt f(s) 4, 1, 1, 2
  • (A?C), s ACCTATAG gt f(s) 3, 2, 1, 2

12
An Approximation to EDFrequency Distance (FD1)
  • s AATGATAG gt f(s)4, 0, 2, 2
  • q ACTTAGC gt f(q)2, 2, 1, 2
  • pos (4-2) (2-1) 3
  • neg (2-0) 2
  • FD1(f(s),f(q)) 3
  • ED(q,s) 4
  • FD1(f(s1),f(s2))maxpos,neg.
  • FD1(f(s1),f(s2))? ED(s1,s2).

13
An Illustration of Frequency Distance Edit
Distance
v1
v2
14
Using Local Information Wavelet Decomposition of
Strings
  • s AATGATAC gt f(s)4, 1, 1, 2
  • s AATG ATAC s1 s2
  • f(s1) 2, 0, 1, 1
  • f(s2) 2, 1, 0, 1
  • ?1(s) f(s1)f(s2) 4, 1, 1, 2
  • ?2(s) f(s1)-f(s2) 0, -1, 1, 0

15
Wavelet Decomposition of a String General Idea
  • Ai,j f(s(j2i (j1)2i-1))
  • Bi,j Ai-1,2j - Ai-1,2j1

First wavelet coefficient
Second wavelet coefficient
?(s)
16
Wavelet Decomposition ED
  • Define FD(s1,s2)maxFD1, FD2.

17
Outline
  • Motivation background
  • Our contribution
  • Frequency vector, frequency distance wavelet
    transform
  • Multi-resolution index structure
  • k-NN and range queries
  • Experimental results
  • Conclusion

18
MRS-Index Structure Creation
s1
w2a
19
MRS-Index Structure Creation
s1
20
MRS-Index Structure Creation
s1
21
MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
22
MRS-Index Structure Creation
s1
...
23
MRS-Index Structure Creation
s1
Ta,1
...
W2a
24
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
25
MRS-Index Structure
26
MRS-index properties
  • Relative MBR volume (Precision) decreases when
  • c increases.
  • w decreases.
  • MBRs are highly clustered.

Box volume
Box Capacity
27
Outline
  • Motivation background
  • Our contribution
  • Frequency vector, frequency distance wavelet
    transform
  • Multi-resolution index structure
  • k-NN range queries
  • Experimental results
  • Conclusion

28
Range Queries KS01
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
29
k-Nearest Neighbor Query KSF96, SK98
k 3
30
k-Nearest Neighbor Query
k 3
r Edit distance to 3rd closest substring
31
k-Nearest Neighbor Query
r
k 3
32
k-Nearest Neighbor Query
k 3
33
Outline
  • Motivation background
  • Our contribution
  • Experimental results
  • Conclusion

34
Experimental Settings
  • w128, 256, 512, 1024.
  • Human chromosomes from (www.ncbi.nlm.nih.gov)
  • chr02, chr18, chr21, chr22
  • Plotted results are from chr18 dataset.
  • Queries are selected from data set randomly for
    512 ? q ? 10000.
  • An NFA based technique BYN99 is implemented for
    comparison.

35
Experimental Results 1Effect of Box Capacity
(10-NN)
36
Experimental Results 2Effect of Window Size
(10-NN)
37
Experimental Results 3k-NN queries
38
Experimental Results 4Range Queries
39
Outline
  • Motivation background
  • Our Contribution
  • Experimental results
  • Discussion conclusion

40
Discussion
  • In-memory (index size is 1-2 of the database
    size).
  • Lossless search.
  • 3 to 45 times faster than NFA technique for k-NN
    queries.
  • 2 to 12 times faster than NFA technique for range
    queries.
  • Can be used to speedup any previously defined
    technique.

41
Future Work
  • Extend to weighted edit distance and affine gaps.
  • Extend to local similarity (substring/substring)
    search.
  • Compare the quality of answers and speed to BLAST
    (lossy search).
  • Use as a preprocessing step to BLAST.
  • Apply the MRS index structure for larger alphabet
    size (e.g. protein sequences.).

42
Related Work
  • Lossless search
  • Online
  • Mye86 (Myers) reduce space requirement to
    O(rn), where r is query radius.
  • WM92 (Wu, Manber) binary masks, O(rn).
  • BYN99 (Beaze-Yates, Navarro) NFA
  • Offline (index based)
  • Mye94 (Myers) condensed r-neighborhood.
  • BYN97 (Beaze-Yates, Navarro) dictionary.
  • Lossy search
  • AG90 (Altschul, Gish) BLAST.
  • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST,
    FLASH, QUASAR, REPUTER, MumMER.
  • GWWV00 (Giladi, Walker, Wang, Volkmuth) SST-Tree

43
Related Work (Similar problems)
  • BYP92 (Beaze-Yates, Perleberg) only replace is
    allowed.
  • Gus97 (Gusfield) exact matching, suffix trees.
  • JKS00 (Jagadish, Koudas, Srivastava) exact
    matching with wild-cards for multidimensional
    strings, elided trees and R-tree.

44
THANK YOU
45
Frequency Distance to an MBR
Write a Comment
User Comments (0)
About PowerShow.com