Fast SubSequence Matching in TimeSeries Databases - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Fast SubSequence Matching in TimeSeries Databases

Description:

Not discrete symbols. Applications for time-series databases. 5. Financial data. Astrological data. Weather data. Sociological data ...many more. Example database ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 30
Provided by: arturola
Category:

less

Transcript and Presenter's Notes

Title: Fast SubSequence Matching in TimeSeries Databases


1
Fast Sub-Sequence Matching in Time-Series
Databases
  • Michael Käser

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AA
2
Outline
  • Time-series databases
  • Building an index
  • Answering queries
  • Evaluation of the method
  • Conclusion

3
The paper
  • Published in 1994
  • Awarded Best paper at SIGMOD 1994

4
Definition of time-series databases
  • Each row is a sequence of numbers
  • Sequences length can be variable
  • Difference to other sequence data like text or
    DNA?
  • Data is based on continuous data that was sampled
    in a certain interval
  • Not discrete symbols

5
Applications for time-series databases
  • Financial data
  • Astrological data
  • Weather data
  • Sociological data
  • many more

6
Example database
7
Searching
  • A query on the database has two properties
  • Query sequence R
  • Query distance e
  • Queries can be categorized by their distance and
    by the length of R

8
Query distance
  • Allows searching for similar data
  • Distance of 0 is exact search
  • Distances between sequences are calculated using
    the Euclidian distance function

9
Length of query
  • Same length as data Searching is easy
  • Shorter than data Do a comparison at every
    possible offset

10
What should be achieved
  • Sequential searching on the sequences is slow
  • The new search method should
  • Improve performance for all query types
  • Require little space overhead
  • Not miss any matching sequences
  • (But can generate few false alarms)

11
How it is achieved
  • Step 1 Extract information of sequences
  • Step 2 Add support for short queries
  • Step 3 Store in efficient data structure
  • Step 4 Query the index

12
Step 1 Extracting features
  • Compress the information of a complete sequence
    into a smaller number of features
  • Number of features f should be defined in advance
  • Transform each sequence to a point in the
    f-dimensional feature space

13
Discrete Fourier Transformation
  • Transforms sequence into another sequence of same
    length
  • Each element of the transformed sequence holds
    information about all elements of the original
    sequence
  • Transformed elements are complex numbers

14
DFT for feature extraction
  • Cut off transformed sequence after f elements
  • Use amplitude of complex number
  • Distance between transformed sequences is always
    smaller than original distance

15
Extracting features in the example
16
Step 2 Extend index for subsequences
  • Define a minimum query length w
  • Use a sliding window over the original data
  • At each window position extract features
  • All transformed points of subsequences form the
    trail of a sequence in the feature space

17
Generating trails in the example
18
Example of trails
19
Step 3 Storage of trails
  • Storing all the points in a trail requires a lot
    of space
  • Searching in all the points is much slower than
    pure sequential searching
  • An efficient data structure for spatial data has
    to be used

20
The R-Tree
  • Data structure for saving multi-dimensional areas
    (i.e. rectangles)
  • Content is in leaf nodes
  • Other nodes are minimum bounding rectangles
    around the child nodes
  • Rectangles can overlap
  • Good algorithms for inserting and deleting exist

21
R-tree example
22
Using the R-tree to store the trails
  • Split each trail into a number of sub trails
  • Put a rectangle around the sub trail
  • Save it together with sequence id and offsets
  • How should the trails be split?
  • Fixed number of points per sub trail is not
    optimal
  • Use an adaptive algorithm that minimizes the
    number of disk accesses

23
Example Selecting sub-trails
24
Step 4 Querying the index
  • Use only the first w elements of query
  • Extract the features of the query
  • Represent it as circle around the feature point
    with query distance as radius
  • Intersect with R-tree nodes
  • Add the offsets associated with each matching
    child node to the result set
  • Recalculate every distance in the result set and
    discard false alarms

25
Better method
  • Split query into p parts of length w
  • Do a query for each part
  • Merge the results
  • The query distance can be reduced to

26
Evaluation
  • Tested on a real database with 329000 points
  • Minimal query length w of 512
  • Queries of length 512 were 3 to 100 times faster
  • Longer queries were 2 to 40 times faster
  • Index size was 5 KB

27
Evaluation
28
Conclusion
  • Proposed method works fast for real-world data
  • Influential paper
  • A lot of research based on it
  • Reducing false alarms
  • Adding constraints to the query
  • Streaming Time Series
  • Improvements in R-Trees
  • many more (250 citations)

29
Your questions?
Write a Comment
User Comments (0)
About PowerShow.com