Similarity Searches in Sequence Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Similarity Searches in Sequence Databases

Description:

Based on sequential scanning. Proposed Approach. Goal. No false dismissal ... Better-than LB-Scan. Performance Evaluation (3) Query Processing Time ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 68
Provided by: sanghy
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Similarity Searches in Sequence Databases


1
Similarity Searches in Sequence Databases
  • Sang-Hyun Park
  • KMeD Research Group
  • Computer Science Department
  • University of California, Los Angeles

2
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

3
What is Sequence?
  • A sequence is an ordered list of elements.
  • S ?14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8,
    15.1?
  • Sequences are principal data format in many
    applications.

4
What is Similarity Search?
  • Similarity search finds sequences whose changing
    patterns are similar to that of a query sequence.
  • Example
  • Detect stocks with similar growth patterns
  • Find persons with similar voice clips
  • Find patients whose brain tumors have similar
    evolution patterns
  • Similarity search helps in clustering, data
    mining, and rule discovery.

5
Classification of Similarity Search
  • Similarity Searches are classified as
  • Whole sequence searches
  • Subsequence searches
  • Example
  • S ? 1,2,3 ?
  • Subsequences (S) ?1?, ?2?, ?3?, ?1,2?, ?2,3?,
    ?1,2,3?
  • In whole sequence searches, the
    sequence S itself is compared with a query
    sequence Q.
  • In subsequence searches, every
    possible subsequence of S can be compared with a
    query sequence q.

6
Similarity Measure
  • Lp Distance Metric
  • L1 Manhattan distance or city-block distance
  • L2 Euclidean distance
  • L? maximum distance in any element pairs
  • requires that two sequences should have the same
    length

7
Similarity Measure (2)
  • Time Warping Distance
  • Originally introduced in the area of speech
    recognition
  • Allows sequences to be stretched along the time
    axis
  • ?3,5,6? ? ?3,3,5,6? ? ?3,3,3,5,6? ?
    ?3,3,3,5,5,6? ?
  • Each element of a sequence can be mapped to one
    or more neighboring elements of another sequence.
  • Useful in applications where sequences may be of
    different lengths or different sampling rates

Q ?10, 15, 20 ?
S ? 10, 15, 16, 20 ?
8
Similarity Measure (3)
  • Time Warping Distance (2)
  • Defined recursively
  • Computed by dynamic programming technique,
    O(SQ)

DTW (S, Q2-) DTW (S2-, Q) DTW (S2-,
Q2-)
DTW (S, Q) DBASE (S1, Q1) min
DBASE (S1, Q1) S1 Q1 P
Q2-
Q
Q1
S2-
S
S1
9
Similarity Measure (4)
  • Time Warping Distance (3)
  • S ?4,5,6,7,6,6?, Q ?3,4,3?
  • When using L1 as a DBASE, DTW (S, Q) 12

Si?Qj min (V1,V2,V3)
Si
V2
V3
V1
Qj
10
False Alarm and False Dismissal
  • False Alarm
  • Candidates not similar to a query.
  • Minimize false alarms for efficiency
  • False Dismissal
  • Similar sequences not retrieved by index search
  • Avoid false dismissals for correctness

data sequences
candidates
candidates
false alarm
similar seq.
similar seq.
false dismissal
11
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

12
Problem Definition
  • Input
  • Set of data sequences S
  • Query sequence Q
  • Distance tolerance ?
  • Output
  • Set of data sequences whose distances to Q are
    within ?
  • Similarity Measure
  • Time warping distance function, DTW
  • L? as a distance function for each element pair
  • If the distance of every element pair is within
    ?, then DTW(S,Q) ? ?.

13
Previous Approaches
  • Naïve Scan Ber96
  • Read every data sequence from database
  • Apply dynamic programming technique
  • For m data sequences with average length L,
    O(mLQ)
  • FastMap-Based Technique Yi98
  • Use FastMap technique for feature extraction
  • Map features into multi-dimensional points
  • Use Euclidean distance in index space for
    filtering
  • Could not guarantee no false dismissal

14
Previous Approaches (2)
  • LB-Scan Yi98
  • Read every data sequence from database
  • Apply the lower-bound distance function Dlb which
    satisfies the following lower-bound theorem
  • Dlb (S,Q) ? ? ? DTW (S,Q) ? ?
  • Faster than the original time warping distance
    function (O(SQ) vs. O(SQ))
  • Guarantee no false dismissal
  • Based on sequential scanning

15
Proposed Approach
  • Goal
  • No false dismissal
  • High query processing performance
  • Sketch
  • Extract a time-warping invariant feature vector
  • Build a multi-dimensional index
  • Use a lower-bound distance function for filtering

16
Proposed Approach (2)
  • Feature Extraction
  • F(S) ? First(S), Last(S), Max(S), Min(S) ?
  • F(S) is invariant to time warping transformation.
  • Distance Function for Feature Vectors

First(S) ? First(Q) Last(S) ? Last(Q)
Max(S) ? Max(Q) Min(S) ? Min(Q)
DFT (F(S), F(Q)) max
17
Proposed Approach (3)
  • Distance Function for Feature Vectors (2)
  • Satisfies lower-bounding theorem
  • DFT (F(S),F(Q)) ? ? ? DTW (S,Q) ? ?
  • More accurate than Dlb proposed in LB-Scan
  • Faster than Dlb (O(1) vs. O(SQ))

18
Proposed Approach (4)
  • Indexing
  • Build a multi-dimensional index from a set of
    feature vectors
  • Index entry ? First(S), Last(S), Max(S), Min(S),
    Identifier(S) ?
  • Query Processing
  • Extract a feature vector F(Q)
  • Perform range queries in index space to find data
    points included in the following query rectangle
  • ? First(Q) ? ?, First(Q) ? , Last(Q) ? ?,
    Last(Q) ? ,
  • Max(Q) ? ?, Max(Q) ? , Min(Q) ? ?,
    Min(Q) ? ?
  • Perform post-processing to discard false alarms

19
Performance Evaluation
  • Implementation
  • Implemented with C on UNIX operating system
  • R-tree is used as a multi-dimensional index.
  • Experimental Setup
  • SP 500 stock data set (m545, L232)
  • Random walk synthetic data set
  • SunSparc Ultra-5

20
Performance Evaluation (2)
  • Filtering Ratio
  • Better-than LB-Scan

21
Performance Evaluation (3)
  • Query Processing Time
  • Faster than LB-Scan and Naïve-Scan

22
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

23
Problem Definition
  • Input
  • Set of data sequences S
  • Query sequence q
  • Distance tolerance ?
  • Output
  • Set of subsequences whose distances to q are
    within ?
  • Similarity Measure
  • Time warping distance function, DTW
  • Any LP metric as a distance function for element
    pairs

24
Previous Approaches
  • Naïve-Scan Ber96
  • Read every data subsequence from database
  • Apply dynamic programming technique
  • For m data sequences with average length n,
    O(mL2q)

25
Previous Approaches (2)
  • ST-Index Fal94
  • Assume that the minimum query length (w) is known
    in advance.
  • Locates a sliding window of size w at every
    possible location
  • Extract a feature vector inside the window
  • Map a feature vector into a point and group
    trails into MBR (Minimum Bounding Rectangle)
  • Use Euclidean distance in index space for
    filtering
  • Could not guarantee no false dismissal

26
Proposed Approach
  • Goal
  • No false dismissal
  • High performance
  • Support diverse similarity measure
  • Sketch
  • Convert into sequences of discrete symbols
  • Build a sparse suffix tree
  • Use a lower-bound distance function for filtering
  • Apply branch-pruning to reduce the search space

27
Proposed Approach (2)
  • Conversion
  • Generate categories from the distribution of
    element values
  • Maximum-entropy method
  • Equal-interval method
  • DISC method
  • Convert element to the symbol of the
    corresponding category
  • Example
  • A 0, 1.0, B 1.1, 2.0, C 2.1, 3.0, D
    3.1, 4.0
  • S ?1.3, 1.6, 2.9, 3.3, 1.5, 0.1?
  • SC ?B, B, C, D, B, A?

28
Proposed Approach (3)
  • Indexing
  • Extract suffixes from sequences of discrete
    symbols.
  • Example
  • From S1C ?A, B, B, A?,
  • we extract four suffixes ABBA, BBA, BA, A

29
Proposed Approach (4)
  • Indexing (2)
  • Build a suffix tree.
  • Suffix tree is originally proposed to retrieve
    substrings exactly matched to the query string.
  • Suffix tree consists of nodes and edges.
  • Each suffix is represented by the path from the
    root node to a leaf node.
  • Labels on the path from the root to the internal
    node Ni represents the longest common prefix of
    the suffixes under Ni
  • Suffix tree is built with computation and space
    complexity, O(mL).

30
Proposed Approach (4)
  • Indexing (3)
  • Example suffix tree from S1C ?A, B, B, A? and
    S2C ?A, B?

A
B
B
B

A
A
B




A

S1C1-
S2C1-
S1C4-
S1C2-
S1C3-
S2C2-
31
Proposed Approach (5)
  • Query Processing

query (q, ?)
Index Searching
candidates
answers
Post Processing
suffix tree
data sequences
32
Proposed Approach (6)
  • Index Searching
  • Visit each node of suffix tree by depth-first
    traversal.
  • Build lower-bound distance table for q and edge
    labels.
  • Inspect the last columns of newly added rows to
    find candidates.
  • Apply branch-pruning to reduce the search space.
  • Branch-pruning theorem
  • If all columns of the last row of the distance
    table have values larger than a distance
    tolerance ?, adding more rows on this table does
    not yield the new values less than or equal to ?.

33
Proposed Approach (7)
  • Index Searching (2)
  • Example q ?2, 2, 1?, ? 1.5

N1
A
1
2
2
A
2
2
1
q
..
N2
B
D
1.1
B
1
1
D
2.1
2.1
4.1
N3
N4
A
1
2
2
A
1
2
2
2
2
1

q

2
2
1
q
..
..
34
Proposed Approach (8)
  • Lower-Bound Distance Function DTW-LB

0 if v is within the range of A (A.min ? v)
P if v is smaller than A.min (v ? A.max)
P if v is larger than A.max
DBASE-LB (A, v)
v
A.max
A.max
A.max
v
A.min
A.min
A.min
v
possible minimum distance 0
possible minimum distance (A.min v)P
possible minimum distance (v A.max)P
35
Proposed Approach (9)
  • Lower-Bound Distance Function DTW-LB (2)
  • satisfies the lower-bounding theorem
  • DTW-LB(sC, q) ? ? ? DTW (s,q) ? ?
  • computation complexity O(sCq)

DTW-LB (sC, q) DBASE-LB(sC1, q1) min
DTW-LB (sC, q2-) DTW-LB (sC2-, q) DTW-LB
(sC2-, q2-)
36
Proposed Approach (10)
  • Computation Complexity
  • m is the number of data sequences.
  • L is the average length of data sequences.
  • The left expression is for index searching.
  • The right expression is for post-processing.
  • RP (? 1) is the reduction factor by
    branch-pruning.
  • RD (? 1) is the reduction factor by sharing
    distance tables.
  • n is the number of subsequences requiring
    post-processing.

37
Proposed Approach (11)
  • Sparse Indexing
  • The index size is linear to the number of
    suffixes stored.
  • To reduce the index size, we build a sparse
    suffix tree (SST).
  • That is, we store the suffix SCi- only if
    SCi ? SCi1.
  • Compaction Ratio
  • Example
  • SC ?A, A, A, A, C, B, B?
  • store only three suffixes (SC1-, SC5-, and
    SC6-)
  • compaction ratio C 7/3

38
Proposed Approach (12)
  • Sparse Indexing (2)
  • When traversing the suffix tree, we need to find
    non-stored suffixes and compute their distances
    to q.
  • Assume that k elements of sC have the same value.
  • Then, sC1- is stored but sCi- (i2,3,,k)
    is not stored.
  • For non-stored suffixes,
  • we introduce another lower-bound distance
    function.
  • DTW-LB2 (sCi-, q) DTW-LB(sC, q) (i 1)
    ? DBASE-LB (sC1, q1)
  • DTW-LB2 satisfies the lower-bounding theorem.
  • DTW-LB2 is O(1) when DTW-LB(sC, q) is given.

39
Proposed Approach (13)
  • Sparse Indexing (3)
  • With sparse indexing, the complexity becomes
  • m is the number of data sequences.
  • L is the average length of data sequences.
  • C is the compaction ratio.
  • n is the number of subsequences requiring
    post-processing.
  • RP (? 1) is the reduction factor by
    branch-pruning.
  • RD (? 1) is the reduction factor by sharing
    distance tables.

40
Performance Evaluation
  • Implementation
  • Implemented with C on UNIX operating system
  • Experimental Setup
  • SP 500 stock data set (m545, L232)
  • Random walk synthetic data set
  • Maximum-Entropy (ME) categorization
  • Disk-based suffix tree construction algorithm
  • SunSparc Ultra-5

41
Performance Evaluation (2)
  • Comparison with Naïve-Scan
  • increasing distance-tolerances
  • SP 500 stock data set, q20

42
Performance Evaluation (3)
  • Scalability Test
  • increasing average length of data sequences
  • random-walk data set, q20,m200

43
Performance Evaluation (4)
  • Scalability Test (2)
  • increasing total number of data sequences
  • random-walk data set, q20, L200

44
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

45
Introduction
  • We extend the proposed subsequence searching
    method to large sequence databases.
  • In the retrieval of similar subsequences with
    time warping distance function,
  • Sequential Scanning is O(mL2q).
  • The proposed method is O(mL2q / R) (R ? 1).
  • It makes search algorithms suffer from severe
    performance degradation when L is very large.
  • For a database with long sequences, we need a new
    searching scheme linear to L.

46
SBASS
  • We propose a new searching scheme Segment-Based
    Subsequence Searching scheme (SBASS)
  • Sequences are divided into a series of piece-wise
    segments.
  • When a query sequence q with k segments is
    submitted, q is compared with those subsequences
    which consist of k consecutive data segments.
  • The lengths of segments may be different.
  • SS represents the segmented sequence of S.
  • S ?4,5,8,9,11,8,4,3? S 8
  • SS ??4,5,8,9,11?, ?8,4,3?? SS 2

47
SBASS (2)
  • Only four subsequences of SS are compared with
    QS.
  • ?SS1,SS2?, ?SS2,SS3?, ?SS3,SS4?,
    ?SS4,SS5?

S
SS3
SS2
SS1
SS4
SS5
SS
qS
qS1
qS2
48
SBASS (3)
  • For SBASS scheme, we define the piece-wise time
    warping distance function (where k qS
    sS).
  • Sequential scanning for SBASS scheme is O(mLq).
  • We introduce an indexing technique with
    O(mLq/R) (R ? 1).

49
Sketch of Proposed Approach
  • Indexing
  • Convert sequences to segmented sequences.
  • Extract a feature vector from each segment.
  • Categorize feature vectors.
  • Convert segmented sequences to sequences of
    symbols.
  • Construct suffix tree from sequences of symbols.
  • Query Processing
  • Traverse the suffix tree to find candidates.
  • Discard false alarms in post processing.

50
Segmentation
  • Approach
  • Divide at peak points.
  • Divide further if maximum deviation from
    interpolation line is too large.
  • Eliminate noises.
  • Compaction Ratio (C) S / SS

too large deviation
noises
51
Feature Extraction
  • From each subsequence segment, extract a feature
    vector
  • (V1, VL,L, ?, ??)

VL
?
?
V1
L
52
Categorization and Index Construction
  • Categorization
  • Group similar feature vectors together using
    multi-dimensional categorization methods like
    Multi-attribute Type Abstraction Hierarchy
    (MTAH).
  • Assign unique symbol to each category
  • Convert segmented sequences to sequences of
    symbols.
  • S ?4,5,8,8,8,8,9,11,8,4,3?
  • SS ??4,5,8,8,8,8,9,11?, ?8,4,3??
  • SF ?(4,11,8,2,1), (8,3,3,0,1.5)?
  • SC ?A,B?
  • From sequences of symbols, construct the suffix
    tree.

53
Query Processing
  • For query processing, we calculate lower-bond
    distances between symbols and keep them in table.
  • Given the query sequence q and the distance
    tolerance ?,
  • Convert q to qS and then to qC.
  • Search the suffix tree to find those subsequences
    whose lower-bound distances to qC are within ?.
  • Discard false alarms in post processing.

54
Query Processing (2)
Index Searching
candidates
answers
Post Processing
qS
qC
q, ?
suffix tree
data sequences
55
Computation Complexity
  • Sequential scanning is O(mLq).
  • Complexity of the proposed search algorithm is
  • n is the number of subsequences contained in
    candidates.
  • C is the compaction ratio or the average number
    of elements in segments.
  • RD (? 1) is the reduction factor by sharing edges
    of suffix tree.

56
Performance Evaluation
  • Test Set Pseudo Periodic Synthetic Sequences
  • m 100, L 10,000
  • Achieved up to 6.5 times speed-up compared to
    sequential scanning.

60
50
40
SeqScan
30
time (sec)
Our Approach
20
10
0.2
0.4
0.6
0.8
1.0
distance tolerance
57
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

58
Introduction
  • So far, we assumed that elements have
    single-dimensional numeric values.
  • Now, we consider multi-dimensional sequences.
  • Image Sequences
  • Video Streams

Medical Image Sequence
59
Introduction (2)
  • In multi-dimensional sequences, elements are
    represented by feature vectors.
  • S ?S1, , SN?, Si (Si1, ,
    SiF)
  • Our proposed subsequence searching techniques are
    extended to the retrieval of similar
    multi-dimensional subsequences.

60
Introduction (3)
  • Multi-Dimensional Time Warping Distance
  • DMTW (S, Q2-)
  • DMTW (S, Q) DMBASE (S1, Q1) min DMTW
    (S2-, Q)
  • DMTW (S2-,Q2-)
  • DMBASE (S1, Q1) ( Wi ? S1i ?
    Q1i )
  • F is the number of features in each element.
  • Wi is the weight of i-th dimension.

61
Sketch of Our Approach
  • Indexing
  • Categorize multi-dimensional element values using
    MTAH.
  • Assign unique symbols to categories.
  • Convert multi-dimensional sequences into
    sequences of symbols.
  • Construct suffix tree from a set of sequences of
    symbols.
  • Query Processing
  • Traverse suffix tree.
  • Find candidates whose lower-bound distances to q
    are within ?.
  • Do post processing to discard false alarms.

62
Application to KMeD
  • In the environment of KMeD, the proposed
    technique is applied to the retrieval of medical
    image sequences having similar spatio-temporal
    characteristics to those of the query sequence.
  • KMeD CCT95 has the following features
  • Query by both image and alphanumeric contents
  • Model temporal, spatial and evolutionary nature
    of objects
  • Formulate queries using conceptual and imprecise
    terms
  • Support cooperative processing

63
Application to KMeD (2)
  • Query
  • Medical Image Sequence
  • Attribute names and their relative weights
  • Distance tolerance

Size (0.3)
Circularity (0.1)
DistFromLV (0.6)
64
Application to KMeD (3)
Query
Query Analysis
User Model
Contour Extraction
Feature Extraction
Distance Function
matching seq.
Visual Presentation
Similarity Searches
feedback
medical image seq.
index structure
65
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion

66
Summary
  • Sequence is an ordered list of elements.
  • Similarity search helps in clustering and data
    mining.
  • For sequences of different lengths or different
    sampling rates, time warping distance is useful.
  • We proposed the whole sequence searching method
    with spatial access method and lower-bound
    distance function.
  • We proposed the subsequence searching method with
    suffix tree and lower-bound distance functions.
  • We proposed the segment-based subsequence
    searching method for large sequence databases.
  • We extended the subsequence searching method to
    the retrieval of similar multi-dimensional
    subsequences.

67
Contribution
  • We proposed the tighter and faster lower-bound
    distance function for efficient whole sequence
    searches without false dismissal.
  • We demonstrated the feasibility of using time
    warping similarity measure on a suffix tree.
  • We introduced the branch pruning theorem and the
    fast lower-bound distance function for efficient
    subsequence searches without false dismissal.
  • We applied categorization and sparse indexing for
    scalability.
  • We applied the proposed technique to the real
    application (KMeD).
Write a Comment
User Comments (0)
About PowerShow.com