Fast Subsequence Matching in Timeseries Databases - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Fast Subsequence Matching in Timeseries Databases

Description:

on Management of Data, pages 419--429, Minneapolis, May 1994. presented by ... find companies whose stock prices move similarly ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 31
Provided by: kathy128
Category:

less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Timeseries Databases


1
Fast Subsequence Matching in Time-series
Databases
  • C. Faloutsos, M. Ranganathan, and Y.
    Manolopoulos
  • University of MarylandDepartment of Computer
    Science and Institute for Systems Research
  • In Proc. ACM SIGMOD Int. Conf. on Management of
    Data, pages 419--429, Minneapolis, May 1994.
  • presented byKathy Gray, Barkha Raisoni

2
Overview
  • The Problem
  • Background and Related Work
  • Main Approach
  • Subsequence Matching
  • Performance Results
  • Summary

3
The Problem
  • Business, Financial, Stock
  • find companies whose stock prices move similarly
  • find other companies that have similar sales
    patterns as with our companys product
  • Scientific
  • find past days in which the solar magnetic wind
    showed patterns similar to today (predictions of
    the earths magnetic field)

4
Problem Overview
  • Need fast searching methods that will search a
    database with time-series of real numbers to
    locate subsequences that are similar to a query
    subsequence
  • fast and correct
  • small space overhead
  • dynamic
  • handle varying length data sequences

5
Similarity Queries
  • Whole Matching
  • Given N data sequences of real numbers
    S1,S2,...SN and a query sequence Q, we want to
    find those data sequences that are within a
    certain tolerance ? (distance from Q)
  • Use a distance preserving transform, such as
    Discrete Fourier Transform (DFT) to extract f
    features from sequences (i.e., the first f DFT
    coefficients) thus mapping them into points in
    the f-dimensional feature space
  • then use any spatial access method (R trees)
  • exploits assumption that data and query sequences
    have same length

6
Similarity Queries
  • Subsequence Matching
  • Given a collection of N sequences of varying
    length real numbers, S1,S2,SN
  • User specifies a query subsequence, Q of variable
    length Len(Q) and tolerance, ?, (maximum
    acceptable dis-similarity, or distance)
  • Find all sequences Si (1?i ? N), along with
    corrent offsets, k, such that Si k k Len(Q)
    -1 matches the query subsequence
  • D (Q, Si k k Len(Q) -1) ? ?

7
Related Work
  • Indexing in text and DNA databases
  • can be viewed as 1-dimensional sequences
  • but consist of discrete symbols v. continuous
    numbers
  • makes a difference in the feature extraction
  • Queries on time-seq or on color images or 3-D
    brain scans (whole matching)
  • F-index method
  • apply DFT
  • store first few numbers (DFT coefficients)
  • sequence mapped into a point in f-dimensional
    space
  • points are then organized in R-tree

8
Related Work (contd)
  • F-index should not result in false dismissals for
    range queries
  • Condition to be satisfied
  • where
  • O is the qualifying object
  • Dfeature is the Euclidean distance
  • F(O) is the feature vector Note proof is
    given in the paper

9
Main Approach
  • Generalize the whole-matching problem - find
    approximate-match queries for subsequences of
    arbitrary lengths
  • Map each data sequence into a small set of
    multidimensional rectangles in feature space
  • Then these rectangles can be indexed using
    spatial access methods, such as R trees
  • Small space overhead ? order of magnitudes
    savings over Sequential Scan

10
Sub-Trail (ST)-Index
  • Assume that queries have a minimum duration w
    (e.g., w 7 days)
  • Divide data sequences into
  • sliding windows of width, w
  • thus producing trails
  • i.e., data sequences of Len(Q) mapped to trails
    in feature space of Len(Q)-w1 points
  • Index these trails using I-naive method

11
I-naive method
  • Given query of length w and tolerance ? Extract
    the features of the query and search the spatial
    access method for range of query with radius ?
  • retrieved points correspond to promising
    sequences
  • discard false alarms (outside actual distance
    tolerance)
  • Complete desired answer set

12
I-naive Inefficient
  • Twice as slow as Sequential Scan!
  • 1f increase in storage requirements
  • R tree very tall and slow
  • Solution
  • exploit fact that successive points of trail are
    similar
  • divide trail into sub-trails
  • represent each with its minimum bounding
    rectangle (MBR)
  • storage of only a few MBRs required!

13
ST-Index ExampleMBRs belonging to same trail may
overlap
14
ST-index features
  • Map data sequence into set of rectangles in
    feature space
  • Significant improvement with respect to space and
    response time
  • We have to store for each MBR
  • tstart tend
  • Unique identifier for data sequence (sequence_id)
  • Extent of the MBR in each dimension
  • (F1low , F1high, F2low , F2 high)

15
ST-index Node Structure
F1_min, F1_max F2_min, F2_max
Level Above leaves
..
..
Sequence_id T_start ,T_end F1_min, F1_max F2_min,
F2_max
Leaf Level
..
.
16
Now Barkha...
  • Questions
  • Insertions - how to divide its trail in feature
    space into sub-trails
  • Queries - how to handle queries, especially those
    that are longer than w
  • Performance Results
  • Summary

17
Sub-trail size
  • Aim
  • To find a optimal way to divide trail of feature
    space into sub-trails
  • Solution to sub-trails
  • To pack points in sub-trails according to
    pre-determined fixed number. No optimal value!!!
  • Use of function of length of the stored seq for
    sub-trail size e.g. vLen(S)

18
Sub-trail size (contd.)
  • Both the methods show poor results
  • I-fixed method used
  • Use of index with fixed sub-trails
  • I-naïve method a special case of I-fixed when
    sub-trail length set to 1.
  • I-adaptive method
  • Group points into sub-trail- greedy algorithm
  • Use of cost function tries to estimate number of
    disk accesses

19
ExampleI-fixed method I-adaptive method
Sub-trail size of fixed length 3
20
Algorithm Divide-to-Sub-trails
  • Definition of Marginal cost
  • Consider k sub-trails of with an MBR of sizes
    L1,L2..LN
  • Then the marginal cost in this sub-trail is
  • mc DA(L)/k where DA is
    disk accesses
  • Assign the first point of the trail in a
    (trivial) sub-trail
  • FOR each successive point
  • IF it increases the marginal cost of the
  • current sub-trail
  • THEN start another sub-trail
  • ELSE include it in the current sub-trail

?
21
Query length
  • Query of length w
  • Algorithm Search_Short
  • Query seq mapped to point qf in feature space
    with radius ?
  • Retrieve the sub-trails whose MBRs intersect the
    query region using the index
  • Examine corresponding subsequences of data
    sequences to discard the false alarms

22
Query length (contd)
  • Queries of length greater than w
  • Complicated as ST-index only knows subsequences
    of length w
  • Solution proposed - Prefix Search
  • Select a subsequence of Q of length w
  • (e.g. prefix)
  • Use ST-index to search for data subsequences that
    match the prefix
  • Returns superset of qualifying subsequences

23
Query length (contd.)
  • Lemma
  • If two sequences S and Q of same length l agree
    within tolerance ?
  • Then any pair (Sij, Qij)of corresponding
    subsequences agree with same tolerance ?.
  • D (S,Q) ? ? ? D (Sij, Qij) ? ? (1 ? i ? j ?
    l)
  • Note Proof is given in paper

24
Query length (contd.)
  • Algorithm Search long( MultiPiece method)
  • Query sequence Q is broken in p-sub-queries
    corresponding to p-spheres in feature space with
    radius ? /vp
  • ST-index is used to retrieve the sub-trails whose
    MBRs intersect at least one sub-query regions
  • Examine corresponding subseq. of the data to
    discard the false alarms.
  • Method based on lemma 3 of the paper

25
Performance Results
  • Stock price sequence and its trail of 0th and 1st
    DFT

26
Performance Results (contd..)
I-fixed gives varying resultsdepending on the
length ofits sub-trails I-naïve method ?24 MB 2
times slower than sequential scanning
method!! I-adaptive method ?5Kb
Index space Vs average sub-trail length
27
Performance Results (contd.)
Relative response of Seq scanning Vs proposed
method
Analysis for Query length same as w Proposed
method achieves 3 up to 100 times better response
time for selectivities in the range from 10-4
to 10 Len (Q) w 512
28
Performance Results (contd..)
Relative wall clock time Vs selectivity in
log-log-scale
Analysis for Query length greater than w
I-adaptive method outperforms sequential scanning
from 2 to 40 times Len (Q) 512 w 128
29
Performance Results (contd.)
Points generated with a starting value of 1.5
where step increment is 0.001 Method outperforms
sequential scanning from 100 to 10 times approx
for selectivities up to 10
For random walk data in log-log scale
30
Summary
  • Proposed idea maps data sequences in set of boxes
    in feature space
  • Method efficiently handles approximate and exact
    queries for subsequence matching
  • Generalization of whole-matching case
  • Achieves orders of magnitude savings over
    sequential scanning
  • Small space overhead, dynamic provably
    correct
  • Future work in extension of the method in
  • 2 dimensional gray-scale images and then in
    general for n-dimensional vector fields
Write a Comment
User Comments (0)
About PowerShow.com