Title: Project Results
1Project Results
- Alexandra Martinez
- Computational Molecular Biology
- CISE, University of Florida
- Spring 2004
2Outline
- Project Overview
- Implementation
- Results
- Benchmark
- Future Work
3Project Overview
- Map the strings of the db into an integer space.
- Frequency Vector
- Vector of Wavelet Coefficients
- Use a distance function in this integer space,
which is a lower bound on the edit distance. - Frequency Distance
- Wavelet Distance
- Cluster the vectors of consecutive substrings
into Minimum Bounding Rectangles (MBRs). - Obtain an array of MBRs for different resolutions.
4Outline
- Project Overview
- Implementation
- Results
- Benchmark
- Future Work
5Implementation
- Test Database Plant Genomics Database (Kishore).
- Indexing built over nucleotide sequences.
- Programming done in Java platform independent.
- Interact to DB through JDBC/ODBC.
- Index stores two vectors for each nucleotide
sequence in DB - Frequency Vector
- Wavelet Vector
- String matching based on the maximum of
- Frequency Distance
- Wavelet Distance
- Cluster WT-vectors of consecutive substrings into
Minimum Bounding Rectangles (MBRs) at different
resolutions.
6Frequency Distance Calculation
- FD1(u, v) max posDist, negDist
7Wavelet Vector Computation
- f (ci) k 0
- Ak-1,2i Ak-1,2i1 0 lt k lt log2n
- 0 k 0
- Ak-1,2i - Ak-1,2i1 0 lt k lt log2n
- 0ltilt(n/2k)-1
Ak,i
Bk,i
8Wavelet Distance Calculation
9Maximum Frequency Distance Calculation
- FD(s1,s2)
- max FD1(f (s1), f (s2)), FD2(?(s1),?(s2))
- FD1 is the Frequency Distance
- FD2 is the Wavelet Distance
10Outline
- Problem Definition
- Implementation
- Results
- Benchmark
- Future Work
11DB Augmented with Vectors
12Output Samples
- Species Name allium_cepa
- Gene Type ces_a9
- See output file with vectors MBRs
- See WT vectors for substrings of different sizes
16, 32, 64 - See some query results for
- Populus tremuloides (5)
- Arabidopsis thaliana (1)
- Allium cepa (1)
13Performance of Dist Functions
14Outline
- Problem Definition
- Implementation
- Results
- Benchmark
- Future Work
15Comparing Query Results
- Joint tests
- 5 queries from Populus tremuloides (a6-a10)
- 1 query from Arabidopsis thaliana (a1)
- 1 query from Allium cepa (a31)
16Populus tremuloides, ces_a6
- Query from Populus tremuloides, ces_a6
- Error for WT search 5
- Query results
- common sequences in output set 2
- common species in output set 4
- common genus in output set 4
- See comparative table
17Populus tremuloides, ces_a7
- Query from Populus tremuloides, ces_a7
- Error for WT search 8
- Query results
- common sequences in output set 5
- common species in output set 3
- common genus in output set 3
- See comparative table
18Populus tremuloides, ces_a8
- Query from Populus tremuloides, ces_a8
- Error for WT search 10
- Query results
- common sequences in output set 4
- common species in output set 8
- common genus in output set 8
- See comparative table
19Populus tremuloides, ces_a9
- Query from Populus tremuloides, ces_a9
- Error for WT search 5
- Query results
- common sequences in output set 1
- common species in output set 5
- common genus in output set 7
- See comparative table
20Populus tremuloides, ces_a10
- Query from Populus tremuloides, ces_a10
- Error for WT search 8
- Query results
- common sequences in output set 2
- common species in output set 4
- common genus in output set 4
- See comparative table
21Allium cepa, ces_a31
- Query from Allium cepa, ces_a31
- Error for WT search 6
- Query results
- common sequences in output set 3
- common species in output set 3
- common genus in output set 4
- See comparative table
22Arabidopsis thaliana, ces_a1
- Query from Arabidopsis thaliana, ces_a1
- Error for WT search 6
- Query results
- common sequences in output set 0
- common species in output set 0
- common genus in output set 0
- See comparative table
23Outline
- Project Overview
- Implementation
- Results
- Benchmark
- Future Work
24Future Work
- Add a query capability to the web interface of
the database that makes use of this index. - Improve performance for substring matching.
- Strategy Use local information
- Compute vectors of substrings rather than vectors
of the entire sequence. - Group vectors into MBRs.
- Compute distance to MBRs.
- Tackle the superstring matching problem.
25References
- T. Kahveci, A. K. Singh. Efficient Index
Structures for String Databases. VLDB 2001
351-360.
26THANKS
27APPENDIX
28Frequency Vectors
29Frequency Vector
- s string from alphabet ??1, ..., ??
- ni number of occurrences of ?i in s (1 ? i
? ?) - Define the frequency vector of s as f(s)n1,
..., n? - Example
- s AATGATAG
- f(s) nA, nC, nG, nT 4, 0, 2, 2
30Frequency Distance (FD1) A Lower Bound on the ED
- Define FD1(u, v) as the minimum number of steps
in order to go from u to v (or viceversa) by
moving to a neighbor point at each step. - Two points u and v in sdim space are neighbors if
one of them can be obtained from the other by a
single edit operation.
31Frequency Distance Example
- s AATGATAG gt f(s)4, 0, 2, 2
- t ACTTAGC gt f(t)2, 2, 1, 2
- pos (4-2) (2-1) 3
- neg (2-0) 2
- FD1(f(s), f(t)) 3
- ED(s, t) 4
- FD1( f(s), f(t) ) maxpos, neg
- FD1( f(s), f(t) ) ? ED(s, t)
32Wavelet Vectors
33Wavelet Transformation String Decomposition
i
k
First wavelet coefficient
Second wavelet coefficient
?(s)
- Ak,i Ak-1,2i Ak-1,2i1 0ltklt(log2n)
- Bk,i Ak-1,2i - Ak-1,2i1 0ltilt(n/2k)-1
34Wavelet Distance (FD2) A Lower Bound on the ED
- Maximum Frequency Distance FD(s1,s2)
- max FD1(f(s1), f(s2)), FD2(?(s1),?(s2))
35Wavelet Transformation Example
- s T C A C n s 4
- ?0(s) v0,0 , v0,1 , v0,2 , v0,3
- (A0,0, B0,0), (A0,1, B0,1), (A0,2,
B0,2), (A0,3, B0,3) - (f(t), 0), (f(c), 0), (f(a),
0), (f(c), 0) - (0,0,0,1, 0), (0,1,0,0, 0), (1,0,0,0,
0), (0,1,0,0, 0) - ?1(s) (0,1,0,1, 0,-1,0,1),
(1,1,0,0, 1,-1,0,0) - ?2(s) ( 1,2,0,1, -1,0,0,1
)
First wavelet coefficient
Second wavelet coefficient
36MRS Index Structure
37MRS Index Creation
s1
w2a
MBR
38MRS Index Creation
s1
transform
39MRS Index Creation
s1
MBR
40MRS Index Creation
s1
...
slide c times
cbox capacity
MBR
41MRS Index Creation
s1
...
MBRs containing wavelet coefficients of
substrings of s1
42MRS Index Creation
s1
Ta,1
...
W2a
Tree of MBRs for a resolution of W2a over s1
43Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
44MRS Index Structure
j
1jd
Database
Resolution levels
Ti,j index for j th string and window size 2i
i
aib
45Range Queries
46Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q