Fast Subsequence Matching in Timeseries Databases - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Fast Subsequence Matching in Timeseries Databases

Description:

on Management of Data, pages 419--429, Minneapolis, May 1994. presented by ... find companies whose stock prices move similarly ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 31

Provided by: kathy128

Category:

more less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Timeseries Databases

1
Fast Subsequence Matching in Time-series
Databases

C. Faloutsos, M. Ranganathan, and Y.
Manolopoulos
University of MarylandDepartment of Computer
Science and Institute for Systems Research
In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 419--429, Minneapolis, May 1994.
presented byKathy Gray, Barkha Raisoni

2
Overview

The Problem
Background and Related Work
Main Approach
Subsequence Matching
Performance Results
Summary

3
The Problem

Business, Financial, Stock
find companies whose stock prices move similarly
find other companies that have similar sales
patterns as with our companys product
Scientific
find past days in which the solar magnetic wind
showed patterns similar to today (predictions of
the earths magnetic field)

4
Problem Overview

Need fast searching methods that will search a
database with time-series of real numbers to
locate subsequences that are similar to a query
subsequence
fast and correct
small space overhead
dynamic
handle varying length data sequences

5
Similarity Queries

Whole Matching
Given N data sequences of real numbers
S1,S2,...SN and a query sequence Q, we want to
find those data sequences that are within a
certain tolerance ? (distance from Q)
Use a distance preserving transform, such as
Discrete Fourier Transform (DFT) to extract f
features from sequences (i.e., the first f DFT
coefficients) thus mapping them into points in
the f-dimensional feature space
then use any spatial access method (R trees)
exploits assumption that data and query sequences
have same length

6
Similarity Queries

Subsequence Matching
Given a collection of N sequences of varying
length real numbers, S1,S2,SN
User specifies a query subsequence, Q of variable
length Len(Q) and tolerance, ?, (maximum
acceptable dis-similarity, or distance)
Find all sequences Si (1?i ? N), along with
corrent offsets, k, such that Si k k Len(Q)
-1 matches the query subsequence
D (Q, Si k k Len(Q) -1) ? ?

7
Related Work

Indexing in text and DNA databases
can be viewed as 1-dimensional sequences
but consist of discrete symbols v. continuous
numbers
makes a difference in the feature extraction
Queries on time-seq or on color images or 3-D
brain scans (whole matching)
F-index method
apply DFT
store first few numbers (DFT coefficients)
sequence mapped into a point in f-dimensional
space
points are then organized in R-tree

8
Related Work (contd)

F-index should not result in false dismissals for
range queries
Condition to be satisfied

where
O is the qualifying object
Dfeature is the Euclidean distance
F(O) is the feature vector Note proof is
given in the paper

9
Main Approach

Generalize the whole-matching problem - find
approximate-match queries for subsequences of
arbitrary lengths
Map each data sequence into a small set of
multidimensional rectangles in feature space
Then these rectangles can be indexed using
spatial access methods, such as R trees
Small space overhead ? order of magnitudes
savings over Sequential Scan

10
Sub-Trail (ST)-Index

Assume that queries have a minimum duration w
(e.g., w 7 days)
Divide data sequences into
sliding windows of width, w
thus producing trails
i.e., data sequences of Len(Q) mapped to trails
in feature space of Len(Q)-w1 points
Index these trails using I-naive method

11
I-naive method

Given query of length w and tolerance ? Extract
the features of the query and search the spatial
access method for range of query with radius ?
retrieved points correspond to promising
sequences
discard false alarms (outside actual distance
tolerance)
Complete desired answer set

12
I-naive Inefficient

Twice as slow as Sequential Scan!
1f increase in storage requirements
R tree very tall and slow
Solution
exploit fact that successive points of trail are
similar
divide trail into sub-trails
represent each with its minimum bounding
rectangle (MBR)
storage of only a few MBRs required!

13
ST-Index ExampleMBRs belonging to same trail may
overlap
14
ST-index features

Map data sequence into set of rectangles in
feature space
Significant improvement with respect to space and
response time
We have to store for each MBR
tstart tend
Unique identifier for data sequence (sequence_id)
Extent of the MBR in each dimension
(F1low , F1high, F2low , F2 high)

15
ST-index Node Structure
F1_min, F1_max F2_min, F2_max
Level Above leaves
..
..
Sequence_id T_start ,T_end F1_min, F1_max F2_min,
F2_max
Leaf Level
..
.
16
Now Barkha...

Questions
Insertions - how to divide its trail in feature
space into sub-trails
Queries - how to handle queries, especially those
that are longer than w
Performance Results
Summary

17
Sub-trail size

Aim
To find a optimal way to divide trail of feature
space into sub-trails
Solution to sub-trails
To pack points in sub-trails according to
pre-determined fixed number. No optimal value!!!
Use of function of length of the stored seq for
sub-trail size e.g. vLen(S)

18
Sub-trail size (contd.)

Both the methods show poor results
I-fixed method used
Use of index with fixed sub-trails
I-naïve method a special case of I-fixed when
sub-trail length set to 1.
I-adaptive method
Group points into sub-trail- greedy algorithm
Use of cost function tries to estimate number of
disk accesses

19
ExampleI-fixed method I-adaptive method
Sub-trail size of fixed length 3
20
Algorithm Divide-to-Sub-trails

Definition of Marginal cost
Consider k sub-trails of with an MBR of sizes
L1,L2..LN
Then the marginal cost in this sub-trail is
mc DA(L)/k where DA is
disk accesses
Assign the first point of the trail in a
(trivial) sub-trail
FOR each successive point
IF it increases the marginal cost of the
current sub-trail
THEN start another sub-trail
ELSE include it in the current sub-trail

?
21
Query length

Query of length w
Algorithm Search_Short
Query seq mapped to point qf in feature space
with radius ?
Retrieve the sub-trails whose MBRs intersect the
query region using the index
Examine corresponding subsequences of data
sequences to discard the false alarms

22
Query length (contd)

Queries of length greater than w
Complicated as ST-index only knows subsequences
of length w
Solution proposed - Prefix Search
Select a subsequence of Q of length w
(e.g. prefix)
Use ST-index to search for data subsequences that
match the prefix
Returns superset of qualifying subsequences

23
Query length (contd.)

Lemma
If two sequences S and Q of same length l agree
within tolerance ?
Then any pair (Sij, Qij)of corresponding
subsequences agree with same tolerance ?.
D (S,Q) ? ? ? D (Sij, Qij) ? ? (1 ? i ? j ?
l)
Note Proof is given in paper

24
Query length (contd.)

Algorithm Search long( MultiPiece method)
Query sequence Q is broken in p-sub-queries
corresponding to p-spheres in feature space with
radius ? /vp
ST-index is used to retrieve the sub-trails whose
MBRs intersect at least one sub-query regions
Examine corresponding subseq. of the data to
discard the false alarms.
Method based on lemma 3 of the paper

25
Performance Results

Stock price sequence and its trail of 0th and 1st
DFT

26
Performance Results (contd..)
I-fixed gives varying resultsdepending on the
length ofits sub-trails I-naïve method ?24 MB 2
times slower than sequential scanning
method!! I-adaptive method ?5Kb
Index space Vs average sub-trail length
27
Performance Results (contd.)
Relative response of Seq scanning Vs proposed
method
Analysis for Query length same as w Proposed
method achieves 3 up to 100 times better response
time for selectivities in the range from 10-4
to 10 Len (Q) w 512
28
Performance Results (contd..)
Relative wall clock time Vs selectivity in
log-log-scale
Analysis for Query length greater than w
I-adaptive method outperforms sequential scanning
from 2 to 40 times Len (Q) 512 w 128
29
Performance Results (contd.)
Points generated with a starting value of 1.5
where step increment is 0.001 Method outperforms
sequential scanning from 100 to 10 times approx
for selectivities up to 10
For random walk data in log-log scale
30
Summary

Proposed idea maps data sequences in set of boxes
in feature space
Method efficiently handles approximate and exact
queries for subsequence matching
Generalization of whole-matching case
Achieves orders of magnitude savings over
sequential scanning
Small space overhead, dynamic provably
correct
Future work in extension of the method in
2 dimensional gray-scale images and then in
general for n-dimensional vector fields