Project Results - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Project Results

Description:

Frequency Distance Calculation. FD1(u, v) = max { posDist, negDist } Wavelet ... Wavelet Distance Calculation. Maximum Frequency Distance Calculation. FD(s1,s2) ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 47
Provided by: Alex355
Category:

less

Transcript and Presenter's Notes

Title: Project Results


1
Project Results
  • Alexandra Martinez
  • Computational Molecular Biology
  • CISE, University of Florida
  • Spring 2004

2
Outline
  • Project Overview
  • Implementation
  • Results
  • Benchmark
  • Future Work

3
Project Overview
  • Map the strings of the db into an integer space.
  • Frequency Vector
  • Vector of Wavelet Coefficients
  • Use a distance function in this integer space,
    which is a lower bound on the edit distance.
  • Frequency Distance
  • Wavelet Distance
  • Cluster the vectors of consecutive substrings
    into Minimum Bounding Rectangles (MBRs).
  • Obtain an array of MBRs for different resolutions.

4
Outline
  • Project Overview
  • Implementation
  • Results
  • Benchmark
  • Future Work

5
Implementation
  • Test Database Plant Genomics Database (Kishore).
  • Indexing built over nucleotide sequences.
  • Programming done in Java platform independent.
  • Interact to DB through JDBC/ODBC.
  • Index stores two vectors for each nucleotide
    sequence in DB
  • Frequency Vector
  • Wavelet Vector
  • String matching based on the maximum of
  • Frequency Distance
  • Wavelet Distance
  • Cluster WT-vectors of consecutive substrings into
    Minimum Bounding Rectangles (MBRs) at different
    resolutions.

6
Frequency Distance Calculation
  • FD1(u, v) max posDist, negDist

7
Wavelet Vector Computation
  • f (ci) k 0
  • Ak-1,2i Ak-1,2i1 0 lt k lt log2n
  • 0 k 0
  • Ak-1,2i - Ak-1,2i1 0 lt k lt log2n
  • 0ltilt(n/2k)-1

Ak,i
Bk,i
8
Wavelet Distance Calculation
9
Maximum Frequency Distance Calculation
  • FD(s1,s2)
  • max FD1(f (s1), f (s2)), FD2(?(s1),?(s2))
  • FD1 is the Frequency Distance
  • FD2 is the Wavelet Distance

10
Outline
  • Problem Definition
  • Implementation
  • Results
  • Benchmark
  • Future Work

11
DB Augmented with Vectors
12
Output Samples
  • Species Name allium_cepa
  • Gene Type ces_a9
  • See output file with vectors MBRs
  • See WT vectors for substrings of different sizes
    16, 32, 64
  • See some query results for
  • Populus tremuloides (5)
  • Arabidopsis thaliana (1)
  • Allium cepa (1)

13
Performance of Dist Functions
14
Outline
  • Problem Definition
  • Implementation
  • Results
  • Benchmark
  • Future Work

15
Comparing Query Results
  • Joint tests
  • 5 queries from Populus tremuloides (a6-a10)
  • 1 query from Arabidopsis thaliana (a1)
  • 1 query from Allium cepa (a31)

16
Populus tremuloides, ces_a6
  • Query from Populus tremuloides, ces_a6
  • Error for WT search 5
  • Query results
  • common sequences in output set 2
  • common species in output set 4
  • common genus in output set 4
  • See comparative table

17
Populus tremuloides, ces_a7
  • Query from Populus tremuloides, ces_a7
  • Error for WT search 8
  • Query results
  • common sequences in output set 5
  • common species in output set 3
  • common genus in output set 3
  • See comparative table

18
Populus tremuloides, ces_a8
  • Query from Populus tremuloides, ces_a8
  • Error for WT search 10
  • Query results
  • common sequences in output set 4
  • common species in output set 8
  • common genus in output set 8
  • See comparative table

19
Populus tremuloides, ces_a9
  • Query from Populus tremuloides, ces_a9
  • Error for WT search 5
  • Query results
  • common sequences in output set 1
  • common species in output set 5
  • common genus in output set 7
  • See comparative table

20
Populus tremuloides, ces_a10
  • Query from Populus tremuloides, ces_a10
  • Error for WT search 8
  • Query results
  • common sequences in output set 2
  • common species in output set 4
  • common genus in output set 4
  • See comparative table

21
Allium cepa, ces_a31
  • Query from Allium cepa, ces_a31
  • Error for WT search 6
  • Query results
  • common sequences in output set 3
  • common species in output set 3
  • common genus in output set 4
  • See comparative table

22
Arabidopsis thaliana, ces_a1
  • Query from Arabidopsis thaliana, ces_a1
  • Error for WT search 6
  • Query results
  • common sequences in output set 0
  • common species in output set 0
  • common genus in output set 0
  • See comparative table

23
Outline
  • Project Overview
  • Implementation
  • Results
  • Benchmark
  • Future Work

24
Future Work
  • Add a query capability to the web interface of
    the database that makes use of this index.
  • Improve performance for substring matching.
  • Strategy Use local information
  • Compute vectors of substrings rather than vectors
    of the entire sequence.
  • Group vectors into MBRs.
  • Compute distance to MBRs.
  • Tackle the superstring matching problem.

25
References
  • T. Kahveci, A. K. Singh. Efficient Index
    Structures for String Databases. VLDB 2001
    351-360.

26
THANKS
27
APPENDIX
28
Frequency Vectors
29
Frequency Vector
  • s string from alphabet ??1, ..., ??
  • ni number of occurrences of ?i in s (1 ? i
    ? ?)
  • Define the frequency vector of s as f(s)n1,
    ..., n?
  • Example
  • s AATGATAG
  • f(s) nA, nC, nG, nT 4, 0, 2, 2

30
Frequency Distance (FD1) A Lower Bound on the ED
  • Define FD1(u, v) as the minimum number of steps
    in order to go from u to v (or viceversa) by
    moving to a neighbor point at each step.
  • Two points u and v in sdim space are neighbors if
    one of them can be obtained from the other by a
    single edit operation.

31
Frequency Distance Example
  • s AATGATAG gt f(s)4, 0, 2, 2
  • t ACTTAGC gt f(t)2, 2, 1, 2
  • pos (4-2) (2-1) 3
  • neg (2-0) 2
  • FD1(f(s), f(t)) 3
  • ED(s, t) 4
  • FD1( f(s), f(t) ) maxpos, neg
  • FD1( f(s), f(t) ) ? ED(s, t)

32
Wavelet Vectors
33
Wavelet Transformation String Decomposition
i
k
First wavelet coefficient
Second wavelet coefficient
?(s)
  • Ak,i Ak-1,2i Ak-1,2i1 0ltklt(log2n)
  • Bk,i Ak-1,2i - Ak-1,2i1 0ltilt(n/2k)-1

34
Wavelet Distance (FD2) A Lower Bound on the ED
  • Maximum Frequency Distance FD(s1,s2)
  • max FD1(f(s1), f(s2)), FD2(?(s1),?(s2))

35
Wavelet Transformation Example
  • s T C A C n s 4
  • ?0(s) v0,0 , v0,1 , v0,2 , v0,3
  • (A0,0, B0,0), (A0,1, B0,1), (A0,2,
    B0,2), (A0,3, B0,3)
  • (f(t), 0), (f(c), 0), (f(a),
    0), (f(c), 0)
  • (0,0,0,1, 0), (0,1,0,0, 0), (1,0,0,0,
    0), (0,1,0,0, 0)
  • ?1(s) (0,1,0,1, 0,-1,0,1),
    (1,1,0,0, 1,-1,0,0)
  • ?2(s) ( 1,2,0,1, -1,0,0,1
    )

First wavelet coefficient
Second wavelet coefficient
36
MRS Index Structure
37
MRS Index Creation
s1
w2a
MBR
38
MRS Index Creation
s1
transform
39
MRS Index Creation
s1
MBR
40
MRS Index Creation
s1
...
slide c times
cbox capacity
MBR
41
MRS Index Creation
s1
...
MBRs containing wavelet coefficients of
substrings of s1
42
MRS Index Creation
s1
Ta,1
...
W2a
Tree of MBRs for a resolution of W2a over s1
43
Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
44
MRS Index Structure
j
1jd
Database
Resolution levels
Ti,j index for j th string and window size 2i
i
aib
45
Range Queries
46
Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
Write a Comment
User Comments (0)
About PowerShow.com