Similarity Searches in Sequence Databases

About This Presentation

Title:

Similarity Searches in Sequence Databases

Description:

Based on sequential scanning. Proposed Approach. Goal. No false dismissal ... Better-than LB-Scan. Performance Evaluation (3) Query Processing Time ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 68

Provided by: sanghy

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Similarity Searches in Sequence Databases

1
Similarity Searches in Sequence Databases

Sang-Hyun Park
KMeD Research Group
Computer Science Department
University of California, Los Angeles

2
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

3
What is Sequence?

A sequence is an ordered list of elements.
S ?14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8,
15.1?
Sequences are principal data format in many
applications.

4
What is Similarity Search?

Similarity search finds sequences whose changing
patterns are similar to that of a query sequence.
Example
Detect stocks with similar growth patterns
Find persons with similar voice clips
Find patients whose brain tumors have similar
evolution patterns
Similarity search helps in clustering, data
mining, and rule discovery.

5
Classification of Similarity Search

Similarity Searches are classified as
Whole sequence searches
Subsequence searches
Example
S ? 1,2,3 ?
Subsequences (S) ?1?, ?2?, ?3?, ?1,2?, ?2,3?,
?1,2,3?
In whole sequence searches, the
sequence S itself is compared with a query
sequence Q.
In subsequence searches, every
possible subsequence of S can be compared with a
query sequence q.

6
Similarity Measure

Lp Distance Metric
L1 Manhattan distance or city-block distance
L2 Euclidean distance
L? maximum distance in any element pairs
requires that two sequences should have the same
length

7
Similarity Measure (2)

Time Warping Distance
Originally introduced in the area of speech
recognition
Allows sequences to be stretched along the time
axis
?3,5,6? ? ?3,3,5,6? ? ?3,3,3,5,6? ?
?3,3,3,5,5,6? ?
Each element of a sequence can be mapped to one
or more neighboring elements of another sequence.
Useful in applications where sequences may be of
different lengths or different sampling rates

Q ?10, 15, 20 ?
S ? 10, 15, 16, 20 ?
8
Similarity Measure (3)

Time Warping Distance (2)
Defined recursively
Computed by dynamic programming technique,
O(SQ)

DTW (S, Q2-) DTW (S2-, Q) DTW (S2-,
Q2-)
DTW (S, Q) DBASE (S1, Q1) min
DBASE (S1, Q1) S1 Q1 P
Q2-
Q
Q1
S2-
S
S1
9
Similarity Measure (4)

Time Warping Distance (3)
S ?4,5,6,7,6,6?, Q ?3,4,3?
When using L1 as a DBASE, DTW (S, Q) 12

Si?Qj min (V1,V2,V3)
Si
V2
V3
V1
Qj
10
False Alarm and False Dismissal

False Alarm
Candidates not similar to a query.
Minimize false alarms for efficiency
False Dismissal
Similar sequences not retrieved by index search
Avoid false dismissals for correctness

data sequences
candidates
candidates
false alarm
similar seq.
similar seq.
false dismissal
11
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

12
Problem Definition

Input
Set of data sequences S
Query sequence Q
Distance tolerance ?
Output
Set of data sequences whose distances to Q are
within ?
Similarity Measure
Time warping distance function, DTW
L? as a distance function for each element pair
If the distance of every element pair is within
?, then DTW(S,Q) ? ?.

13
Previous Approaches

Naïve Scan Ber96
Read every data sequence from database
Apply dynamic programming technique
For m data sequences with average length L,
O(mLQ)
FastMap-Based Technique Yi98
Use FastMap technique for feature extraction
Map features into multi-dimensional points
Use Euclidean distance in index space for
filtering
Could not guarantee no false dismissal

14
Previous Approaches (2)

LB-Scan Yi98
Read every data sequence from database
Apply the lower-bound distance function Dlb which
satisfies the following lower-bound theorem
Dlb (S,Q) ? ? ? DTW (S,Q) ? ?
Faster than the original time warping distance
function (O(SQ) vs. O(SQ))
Guarantee no false dismissal
Based on sequential scanning

15
Proposed Approach

Goal
No false dismissal
High query processing performance
Sketch
Extract a time-warping invariant feature vector
Build a multi-dimensional index
Use a lower-bound distance function for filtering

16
Proposed Approach (2)

Feature Extraction
F(S) ? First(S), Last(S), Max(S), Min(S) ?
F(S) is invariant to time warping transformation.
Distance Function for Feature Vectors

First(S) ? First(Q) Last(S) ? Last(Q)
Max(S) ? Max(Q) Min(S) ? Min(Q)
DFT (F(S), F(Q)) max
17
Proposed Approach (3)

Distance Function for Feature Vectors (2)
Satisfies lower-bounding theorem
DFT (F(S),F(Q)) ? ? ? DTW (S,Q) ? ?
More accurate than Dlb proposed in LB-Scan
Faster than Dlb (O(1) vs. O(SQ))

18
Proposed Approach (4)

Indexing
Build a multi-dimensional index from a set of
feature vectors
Index entry ? First(S), Last(S), Max(S), Min(S),
Identifier(S) ?
Query Processing
Extract a feature vector F(Q)
Perform range queries in index space to find data
points included in the following query rectangle
? First(Q) ? ?, First(Q) ? , Last(Q) ? ?,
Last(Q) ? ,
Max(Q) ? ?, Max(Q) ? , Min(Q) ? ?,
Min(Q) ? ?
Perform post-processing to discard false alarms

19
Performance Evaluation

Implementation
Implemented with C on UNIX operating system
R-tree is used as a multi-dimensional index.
Experimental Setup
SP 500 stock data set (m545, L232)
Random walk synthetic data set
SunSparc Ultra-5

20
Performance Evaluation (2)

Filtering Ratio
Better-than LB-Scan

21
Performance Evaluation (3)

Query Processing Time
Faster than LB-Scan and Naïve-Scan

22
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

23
Problem Definition

Input
Set of data sequences S
Query sequence q
Distance tolerance ?
Output
Set of subsequences whose distances to q are
within ?
Similarity Measure
Time warping distance function, DTW
Any LP metric as a distance function for element
pairs

24
Previous Approaches

Naïve-Scan Ber96
Read every data subsequence from database
Apply dynamic programming technique
For m data sequences with average length n,
O(mL2q)

25
Previous Approaches (2)

ST-Index Fal94
Assume that the minimum query length (w) is known
in advance.
Locates a sliding window of size w at every
possible location
Extract a feature vector inside the window
Map a feature vector into a point and group
trails into MBR (Minimum Bounding Rectangle)
Use Euclidean distance in index space for
filtering
Could not guarantee no false dismissal

26
Proposed Approach

Goal
No false dismissal
High performance
Support diverse similarity measure
Sketch
Convert into sequences of discrete symbols
Build a sparse suffix tree
Use a lower-bound distance function for filtering
Apply branch-pruning to reduce the search space

27
Proposed Approach (2)

Conversion
Generate categories from the distribution of
element values
Maximum-entropy method
Equal-interval method
DISC method
Convert element to the symbol of the
corresponding category
Example
A 0, 1.0, B 1.1, 2.0, C 2.1, 3.0, D
3.1, 4.0
S ?1.3, 1.6, 2.9, 3.3, 1.5, 0.1?
SC ?B, B, C, D, B, A?

28
Proposed Approach (3)

Indexing
Extract suffixes from sequences of discrete
symbols.
Example
From S1C ?A, B, B, A?,
we extract four suffixes ABBA, BBA, BA, A

29
Proposed Approach (4)

Indexing (2)
Build a suffix tree.
Suffix tree is originally proposed to retrieve
substrings exactly matched to the query string.
Suffix tree consists of nodes and edges.
Each suffix is represented by the path from the
root node to a leaf node.
Labels on the path from the root to the internal
node Ni represents the longest common prefix of
the suffixes under Ni
Suffix tree is built with computation and space
complexity, O(mL).

30
Proposed Approach (4)

Indexing (3)
Example suffix tree from S1C ?A, B, B, A? and
S2C ?A, B?

A
B
B
B

A
A
B

A

S1C1-
S2C1-
S1C4-
S1C2-
S1C3-
S2C2-
31
Proposed Approach (5)

Query Processing

query (q, ?)
Index Searching
candidates
answers
Post Processing
suffix tree
data sequences
32
Proposed Approach (6)

Index Searching
Visit each node of suffix tree by depth-first
traversal.
Build lower-bound distance table for q and edge
labels.
Inspect the last columns of newly added rows to
find candidates.
Apply branch-pruning to reduce the search space.
Branch-pruning theorem
If all columns of the last row of the distance
table have values larger than a distance
tolerance ?, adding more rows on this table does
not yield the new values less than or equal to ?.

33
Proposed Approach (7)

Index Searching (2)
Example q ?2, 2, 1?, ? 1.5

N1
A
1
2
2
A
2
2
1
q
..
N2
B
D
1.1
B
1
1
D
2.1
2.1
4.1
N3
N4
A
1
2
2
A
1
2
2
2
2
1

q

2
2
1
q
..
..
34
Proposed Approach (8)

Lower-Bound Distance Function DTW-LB

0 if v is within the range of A (A.min ? v)
P if v is smaller than A.min (v ? A.max)
P if v is larger than A.max
DBASE-LB (A, v)
v
A.max
A.max
A.max
v
A.min
A.min
A.min
v
possible minimum distance 0
possible minimum distance (A.min v)P
possible minimum distance (v A.max)P
35
Proposed Approach (9)

Lower-Bound Distance Function DTW-LB (2)
satisfies the lower-bounding theorem
DTW-LB(sC, q) ? ? ? DTW (s,q) ? ?
computation complexity O(sCq)

DTW-LB (sC, q) DBASE-LB(sC1, q1) min
DTW-LB (sC, q2-) DTW-LB (sC2-, q) DTW-LB
(sC2-, q2-)
36
Proposed Approach (10)

Computation Complexity
m is the number of data sequences.
L is the average length of data sequences.
The left expression is for index searching.
The right expression is for post-processing.
RP (? 1) is the reduction factor by
branch-pruning.
RD (? 1) is the reduction factor by sharing
distance tables.
n is the number of subsequences requiring
post-processing.

37
Proposed Approach (11)

Sparse Indexing
The index size is linear to the number of
suffixes stored.
To reduce the index size, we build a sparse
suffix tree (SST).
That is, we store the suffix SCi- only if
SCi ? SCi1.
Compaction Ratio
Example
SC ?A, A, A, A, C, B, B?
store only three suffixes (SC1-, SC5-, and
SC6-)
compaction ratio C 7/3

38
Proposed Approach (12)

Sparse Indexing (2)
When traversing the suffix tree, we need to find
non-stored suffixes and compute their distances
to q.
Assume that k elements of sC have the same value.
Then, sC1- is stored but sCi- (i2,3,,k)
is not stored.
For non-stored suffixes,
we introduce another lower-bound distance
function.
DTW-LB2 (sCi-, q) DTW-LB(sC, q) (i 1)
? DBASE-LB (sC1, q1)
DTW-LB2 satisfies the lower-bounding theorem.
DTW-LB2 is O(1) when DTW-LB(sC, q) is given.

39
Proposed Approach (13)

Sparse Indexing (3)
With sparse indexing, the complexity becomes
m is the number of data sequences.
L is the average length of data sequences.
C is the compaction ratio.
n is the number of subsequences requiring
post-processing.
RP (? 1) is the reduction factor by
branch-pruning.
RD (? 1) is the reduction factor by sharing
distance tables.

40
Performance Evaluation

Implementation
Implemented with C on UNIX operating system
Experimental Setup
SP 500 stock data set (m545, L232)
Random walk synthetic data set
Maximum-Entropy (ME) categorization
Disk-based suffix tree construction algorithm
SunSparc Ultra-5

41
Performance Evaluation (2)

Comparison with Naïve-Scan
increasing distance-tolerances
SP 500 stock data set, q20

42
Performance Evaluation (3)

Scalability Test
increasing average length of data sequences
random-walk data set, q20,m200

43
Performance Evaluation (4)

Scalability Test (2)
increasing total number of data sequences
random-walk data set, q20, L200

44
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

45
Introduction

We extend the proposed subsequence searching
method to large sequence databases.
In the retrieval of similar subsequences with
time warping distance function,
Sequential Scanning is O(mL2q).
The proposed method is O(mL2q / R) (R ? 1).
It makes search algorithms suffer from severe
performance degradation when L is very large.
For a database with long sequences, we need a new
searching scheme linear to L.

46
SBASS

We propose a new searching scheme Segment-Based
Subsequence Searching scheme (SBASS)
Sequences are divided into a series of piece-wise
segments.
When a query sequence q with k segments is
submitted, q is compared with those subsequences
which consist of k consecutive data segments.
The lengths of segments may be different.
SS represents the segmented sequence of S.
S ?4,5,8,9,11,8,4,3? S 8
SS ??4,5,8,9,11?, ?8,4,3?? SS 2

47
SBASS (2)

Only four subsequences of SS are compared with
QS.
?SS1,SS2?, ?SS2,SS3?, ?SS3,SS4?,
?SS4,SS5?

S
SS3
SS2
SS1
SS4
SS5
SS
qS
qS1
qS2
48
SBASS (3)

For SBASS scheme, we define the piece-wise time
warping distance function (where k qS
sS).
Sequential scanning for SBASS scheme is O(mLq).
We introduce an indexing technique with
O(mLq/R) (R ? 1).

49
Sketch of Proposed Approach

Indexing
Convert sequences to segmented sequences.
Extract a feature vector from each segment.
Categorize feature vectors.
Convert segmented sequences to sequences of
symbols.
Construct suffix tree from sequences of symbols.
Query Processing
Traverse the suffix tree to find candidates.
Discard false alarms in post processing.

50
Segmentation

Approach
Divide at peak points.
Divide further if maximum deviation from
interpolation line is too large.
Eliminate noises.
Compaction Ratio (C) S / SS

too large deviation
noises
51
Feature Extraction

From each subsequence segment, extract a feature
vector
(V1, VL,L, ?, ??)

VL
?
?
V1
L
52
Categorization and Index Construction

Categorization
Group similar feature vectors together using
multi-dimensional categorization methods like
Multi-attribute Type Abstraction Hierarchy
(MTAH).
Assign unique symbol to each category
Convert segmented sequences to sequences of
symbols.
S ?4,5,8,8,8,8,9,11,8,4,3?
SS ??4,5,8,8,8,8,9,11?, ?8,4,3??
SF ?(4,11,8,2,1), (8,3,3,0,1.5)?
SC ?A,B?
From sequences of symbols, construct the suffix
tree.

53
Query Processing

For query processing, we calculate lower-bond
distances between symbols and keep them in table.
Given the query sequence q and the distance
tolerance ?,
Convert q to qS and then to qC.
Search the suffix tree to find those subsequences
whose lower-bound distances to qC are within ?.
Discard false alarms in post processing.

54
Query Processing (2)
Index Searching
candidates
answers
Post Processing
qS
qC
q, ?
suffix tree
data sequences
55
Computation Complexity

Sequential scanning is O(mLq).
Complexity of the proposed search algorithm is
n is the number of subsequences contained in
candidates.
C is the compaction ratio or the average number
of elements in segments.
RD (? 1) is the reduction factor by sharing edges
of suffix tree.

56
Performance Evaluation

Test Set Pseudo Periodic Synthetic Sequences
m 100, L 10,000
Achieved up to 6.5 times speed-up compared to
sequential scanning.

60
50
40
SeqScan
30
time (sec)
Our Approach
20
10
0.2
0.4
0.6
0.8
1.0
distance tolerance
57
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

58
Introduction

So far, we assumed that elements have
single-dimensional numeric values.
Now, we consider multi-dimensional sequences.
Image Sequences
Video Streams

Medical Image Sequence
59
Introduction (2)

In multi-dimensional sequences, elements are
represented by feature vectors.
S ?S1, , SN?, Si (Si1, ,
SiF)
Our proposed subsequence searching techniques are
extended to the retrieval of similar
multi-dimensional subsequences.

60
Introduction (3)

Multi-Dimensional Time Warping Distance
DMTW (S, Q2-)
DMTW (S, Q) DMBASE (S1, Q1) min DMTW
(S2-, Q)
DMTW (S2-,Q2-)
DMBASE (S1, Q1) ( Wi ? S1i ?
Q1i )
F is the number of features in each element.
Wi is the weight of i-th dimension.

61
Sketch of Our Approach

Indexing
Categorize multi-dimensional element values using
MTAH.
Assign unique symbols to categories.
Convert multi-dimensional sequences into
sequences of symbols.
Construct suffix tree from a set of sequences of
symbols.
Query Processing
Traverse suffix tree.
Find candidates whose lower-bound distances to q
are within ?.
Do post processing to discard false alarms.

62
Application to KMeD

In the environment of KMeD, the proposed
technique is applied to the retrieval of medical
image sequences having similar spatio-temporal
characteristics to those of the query sequence.
KMeD CCT95 has the following features
Query by both image and alphanumeric contents
Model temporal, spatial and evolutionary nature
of objects
Formulate queries using conceptual and imprecise
terms
Support cooperative processing

63
Application to KMeD (2)

Query
Medical Image Sequence
Attribute names and their relative weights
Distance tolerance

Size (0.3)
Circularity (0.1)
DistFromLV (0.6)
64
Application to KMeD (3)
Query
Query Analysis
User Model
Contour Extraction
Feature Extraction
Distance Function
matching seq.
Visual Presentation
Similarity Searches
feedback
medical image seq.
index structure
65
Contents

Introduction
Whole Sequence Searches
Subsequence Searches
Segment-Based Subsequence Searches
Multi-Dimensional Subsequence Searches
Conclusion

66
Summary

Sequence is an ordered list of elements.
Similarity search helps in clustering and data
mining.
For sequences of different lengths or different
sampling rates, time warping distance is useful.
We proposed the whole sequence searching method
with spatial access method and lower-bound
distance function.
We proposed the subsequence searching method with
suffix tree and lower-bound distance functions.
We proposed the segment-based subsequence
searching method for large sequence databases.
We extended the subsequence searching method to
the retrieval of similar multi-dimensional
subsequences.

67
Contribution

We proposed the tighter and faster lower-bound
distance function for efficient whole sequence
searches without false dismissal.
We demonstrated the feasibility of using time
warping similarity measure on a suffix tree.
We introduced the branch pruning theorem and the
fast lower-bound distance function for efficient
subsequence searches without false dismissal.
We applied categorization and sparse indexing for
scalability.
We applied the proposed technique to the real
application (KMeD).