Title: Fast Subsequence Matching in Timeseries Databases
1Fast Subsequence Matching in Time-series
Databases
- C. Faloutsos, M. Ranganathan, and Y.
Manolopoulos - University of MarylandDepartment of Computer
Science and Institute for Systems Research - In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 419--429, Minneapolis, May 1994. - presented byKathy Gray, Barkha Raisoni
2Overview
- The Problem
- Background and Related Work
- Main Approach
- Subsequence Matching
- Performance Results
- Summary
3The Problem
- Business, Financial, Stock
- find companies whose stock prices move similarly
- find other companies that have similar sales
patterns as with our companys product - Scientific
- find past days in which the solar magnetic wind
showed patterns similar to today (predictions of
the earths magnetic field)
4Problem Overview
- Need fast searching methods that will search a
database with time-series of real numbers to
locate subsequences that are similar to a query
subsequence - fast and correct
- small space overhead
- dynamic
- handle varying length data sequences
5Similarity Queries
- Whole Matching
- Given N data sequences of real numbers
S1,S2,...SN and a query sequence Q, we want to
find those data sequences that are within a
certain tolerance ? (distance from Q) - Use a distance preserving transform, such as
Discrete Fourier Transform (DFT) to extract f
features from sequences (i.e., the first f DFT
coefficients) thus mapping them into points in
the f-dimensional feature space - then use any spatial access method (R trees)
- exploits assumption that data and query sequences
have same length
6Similarity Queries
- Subsequence Matching
- Given a collection of N sequences of varying
length real numbers, S1,S2,SN - User specifies a query subsequence, Q of variable
length Len(Q) and tolerance, ?, (maximum
acceptable dis-similarity, or distance) - Find all sequences Si (1?i ? N), along with
corrent offsets, k, such that Si k k Len(Q)
-1 matches the query subsequence - D (Q, Si k k Len(Q) -1) ? ?
7Related Work
- Indexing in text and DNA databases
- can be viewed as 1-dimensional sequences
- but consist of discrete symbols v. continuous
numbers - makes a difference in the feature extraction
- Queries on time-seq or on color images or 3-D
brain scans (whole matching) - F-index method
- apply DFT
- store first few numbers (DFT coefficients)
- sequence mapped into a point in f-dimensional
space - points are then organized in R-tree
8Related Work (contd)
- F-index should not result in false dismissals for
range queries - Condition to be satisfied
- where
- O is the qualifying object
- Dfeature is the Euclidean distance
- F(O) is the feature vector Note proof is
given in the paper
9Main Approach
- Generalize the whole-matching problem - find
approximate-match queries for subsequences of
arbitrary lengths - Map each data sequence into a small set of
multidimensional rectangles in feature space - Then these rectangles can be indexed using
spatial access methods, such as R trees - Small space overhead ? order of magnitudes
savings over Sequential Scan
10Sub-Trail (ST)-Index
- Assume that queries have a minimum duration w
(e.g., w 7 days) - Divide data sequences into
- sliding windows of width, w
- thus producing trails
- i.e., data sequences of Len(Q) mapped to trails
in feature space of Len(Q)-w1 points - Index these trails using I-naive method
11I-naive method
- Given query of length w and tolerance ? Extract
the features of the query and search the spatial
access method for range of query with radius ? - retrieved points correspond to promising
sequences - discard false alarms (outside actual distance
tolerance) - Complete desired answer set
12I-naive Inefficient
- Twice as slow as Sequential Scan!
- 1f increase in storage requirements
- R tree very tall and slow
- Solution
- exploit fact that successive points of trail are
similar - divide trail into sub-trails
- represent each with its minimum bounding
rectangle (MBR) - storage of only a few MBRs required!
13ST-Index ExampleMBRs belonging to same trail may
overlap
14ST-index features
- Map data sequence into set of rectangles in
feature space - Significant improvement with respect to space and
response time - We have to store for each MBR
- tstart tend
- Unique identifier for data sequence (sequence_id)
- Extent of the MBR in each dimension
- (F1low , F1high, F2low , F2 high)
15ST-index Node Structure
F1_min, F1_max F2_min, F2_max
Level Above leaves
..
..
Sequence_id T_start ,T_end F1_min, F1_max F2_min,
F2_max
Leaf Level
..
.
16Now Barkha...
- Questions
- Insertions - how to divide its trail in feature
space into sub-trails - Queries - how to handle queries, especially those
that are longer than w - Performance Results
- Summary
17Sub-trail size
- Aim
- To find a optimal way to divide trail of feature
space into sub-trails - Solution to sub-trails
- To pack points in sub-trails according to
pre-determined fixed number. No optimal value!!! - Use of function of length of the stored seq for
sub-trail size e.g. vLen(S)
18Sub-trail size (contd.)
- Both the methods show poor results
- I-fixed method used
- Use of index with fixed sub-trails
- I-naïve method a special case of I-fixed when
sub-trail length set to 1. - I-adaptive method
- Group points into sub-trail- greedy algorithm
- Use of cost function tries to estimate number of
disk accesses
19ExampleI-fixed method I-adaptive method
Sub-trail size of fixed length 3
20Algorithm Divide-to-Sub-trails
- Definition of Marginal cost
- Consider k sub-trails of with an MBR of sizes
L1,L2..LN - Then the marginal cost in this sub-trail is
- mc DA(L)/k where DA is
disk accesses - Assign the first point of the trail in a
(trivial) sub-trail - FOR each successive point
- IF it increases the marginal cost of the
- current sub-trail
- THEN start another sub-trail
- ELSE include it in the current sub-trail
?
21Query length
- Query of length w
- Algorithm Search_Short
- Query seq mapped to point qf in feature space
with radius ? - Retrieve the sub-trails whose MBRs intersect the
query region using the index - Examine corresponding subsequences of data
sequences to discard the false alarms
22Query length (contd)
- Queries of length greater than w
- Complicated as ST-index only knows subsequences
of length w - Solution proposed - Prefix Search
- Select a subsequence of Q of length w
- (e.g. prefix)
- Use ST-index to search for data subsequences that
match the prefix - Returns superset of qualifying subsequences
23Query length (contd.)
- Lemma
- If two sequences S and Q of same length l agree
within tolerance ? - Then any pair (Sij, Qij)of corresponding
subsequences agree with same tolerance ?. - D (S,Q) ? ? ? D (Sij, Qij) ? ? (1 ? i ? j ?
l) - Note Proof is given in paper
24Query length (contd.)
- Algorithm Search long( MultiPiece method)
- Query sequence Q is broken in p-sub-queries
corresponding to p-spheres in feature space with
radius ? /vp - ST-index is used to retrieve the sub-trails whose
MBRs intersect at least one sub-query regions - Examine corresponding subseq. of the data to
discard the false alarms. - Method based on lemma 3 of the paper
25Performance Results
- Stock price sequence and its trail of 0th and 1st
DFT
26Performance Results (contd..)
I-fixed gives varying resultsdepending on the
length ofits sub-trails I-naïve method ?24 MB 2
times slower than sequential scanning
method!! I-adaptive method ?5Kb
Index space Vs average sub-trail length
27Performance Results (contd.)
Relative response of Seq scanning Vs proposed
method
Analysis for Query length same as w Proposed
method achieves 3 up to 100 times better response
time for selectivities in the range from 10-4
to 10 Len (Q) w 512
28Performance Results (contd..)
Relative wall clock time Vs selectivity in
log-log-scale
Analysis for Query length greater than w
I-adaptive method outperforms sequential scanning
from 2 to 40 times Len (Q) 512 w 128
29Performance Results (contd.)
Points generated with a starting value of 1.5
where step increment is 0.001 Method outperforms
sequential scanning from 100 to 10 times approx
for selectivities up to 10
For random walk data in log-log scale
30Summary
- Proposed idea maps data sequences in set of boxes
in feature space - Method efficiently handles approximate and exact
queries for subsequence matching - Generalization of whole-matching case
- Achieves orders of magnitude savings over
sequential scanning - Small space overhead, dynamic provably
correct - Future work in extension of the method in
- 2 dimensional gray-scale images and then in
general for n-dimensional vector fields