Title: Fast SubSequence Matching in TimeSeries Databases
1Fast Sub-Sequence Matching in Time-Series
Databases
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AA
2Outline
- Time-series databases
- Building an index
- Answering queries
- Evaluation of the method
- Conclusion
3The paper
- Published in 1994
- Awarded Best paper at SIGMOD 1994
4Definition of time-series databases
- Each row is a sequence of numbers
- Sequences length can be variable
- Difference to other sequence data like text or
DNA?
- Data is based on continuous data that was sampled
in a certain interval - Not discrete symbols
5Applications for time-series databases
- Financial data
- Astrological data
- Weather data
- Sociological data
- many more
6Example database
7Searching
- A query on the database has two properties
- Query sequence R
- Query distance e
- Queries can be categorized by their distance and
by the length of R
8Query distance
- Allows searching for similar data
- Distance of 0 is exact search
- Distances between sequences are calculated using
the Euclidian distance function
9Length of query
- Same length as data Searching is easy
- Shorter than data Do a comparison at every
possible offset
10What should be achieved
- Sequential searching on the sequences is slow
- The new search method should
- Improve performance for all query types
- Require little space overhead
- Not miss any matching sequences
- (But can generate few false alarms)
11How it is achieved
- Step 1 Extract information of sequences
- Step 2 Add support for short queries
- Step 3 Store in efficient data structure
- Step 4 Query the index
12Step 1 Extracting features
- Compress the information of a complete sequence
into a smaller number of features - Number of features f should be defined in advance
- Transform each sequence to a point in the
f-dimensional feature space
13Discrete Fourier Transformation
- Transforms sequence into another sequence of same
length - Each element of the transformed sequence holds
information about all elements of the original
sequence - Transformed elements are complex numbers
14DFT for feature extraction
- Cut off transformed sequence after f elements
- Use amplitude of complex number
- Distance between transformed sequences is always
smaller than original distance
15Extracting features in the example
16Step 2 Extend index for subsequences
- Define a minimum query length w
- Use a sliding window over the original data
- At each window position extract features
- All transformed points of subsequences form the
trail of a sequence in the feature space
17Generating trails in the example
18Example of trails
19Step 3 Storage of trails
- Storing all the points in a trail requires a lot
of space - Searching in all the points is much slower than
pure sequential searching - An efficient data structure for spatial data has
to be used
20The R-Tree
- Data structure for saving multi-dimensional areas
(i.e. rectangles) - Content is in leaf nodes
- Other nodes are minimum bounding rectangles
around the child nodes - Rectangles can overlap
- Good algorithms for inserting and deleting exist
21R-tree example
22Using the R-tree to store the trails
- Split each trail into a number of sub trails
- Put a rectangle around the sub trail
- Save it together with sequence id and offsets
- How should the trails be split?
- Fixed number of points per sub trail is not
optimal - Use an adaptive algorithm that minimizes the
number of disk accesses
23Example Selecting sub-trails
24Step 4 Querying the index
- Use only the first w elements of query
- Extract the features of the query
- Represent it as circle around the feature point
with query distance as radius - Intersect with R-tree nodes
- Add the offsets associated with each matching
child node to the result set - Recalculate every distance in the result set and
discard false alarms
25Better method
- Split query into p parts of length w
- Do a query for each part
- Merge the results
- The query distance can be reduced to
26Evaluation
- Tested on a real database with 329000 points
- Minimal query length w of 512
- Queries of length 512 were 3 to 100 times faster
- Longer queries were 2 to 40 times faster
- Index size was 5 KB
27Evaluation
28Conclusion
- Proposed method works fast for real-world data
- Influential paper
- A lot of research based on it
- Reducing false alarms
- Adding constraints to the query
- Streaming Time Series
- Improvements in R-Trees
- many more (250 citations)
29Your questions?