iSAX: Indexing and Mining Terabyte Sized Time Series - PowerPoint PPT Presentation

About This Presentation

Title:

iSAX: Indexing and Mining Terabyte Sized Time Series

Description:

84 = 4,096 possible SAX word labels. Place time series which map to the same label in the same file on disk. Compute label for query and retrieve matching file ... – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 18

Provided by: drag77

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: iSAX: Indexing and Mining Terabyte Sized Time Series

1

iSAX Indexing and Mining Terabyte Sized Time
Series

Jin Shieh, Eamonn Keogh Computer Science Eng.
Dept. University of California, Riverside
2
Outline

Introduction
Motivating example
iSAX representation
Indexing time series
Experimental evaluation
Conclusion

3
Introduction

Our work extends a popular symbolic
representation of time series to allow for the
indexing and retrieval of millions of time series
Symbolic Aggregate approXimation (SAX)
Represent a time series T of length n in
w-dimensional space using PAA
Where the ith element of is
Then discretize into a vector of symbols
Breakpoints map to a small alphabet a of
symbols

4
Introduction (cont.)

SAX is lower bounding
Given a SAX representations Ta, Sa a lower bound
to the Euclidean distance is
MINDIST(Ta, Sa)
dist(ti,si) is the smallest distance between the
breakpoints that characterize each symbol, 0 if
they overlap

5
Motivating Example

Why not just index using SAX?
For example index 1,000,000 time series using
SAX
Choose SAX parameters
cardinality 8, wordlength 4
84 4,096 possible SAX word labels
Place time series which map to the same label in
the same file on disk
Compute label for query and retrieve matching
file
Time series in file likely to be good approximate
matches
Average label occupancy 1,000,000/4,096 244
(reasonable)

6
Motivating Example (cont.)

In practice, the distribution of time series to
SAX word labels is not uniform!
Empty
Disproportionate percentage of the dataset
Ideal condition We want to give a threshold th,
and have the number of entries n mapped to a
label to be 1 n th
Favor larger n
How can we achieve this? We need to make SAX more
flexible

7
iSAX Representation

SAX uses a single hard-coded cardinality
Unable to differentiate only on dimensions of
interest
We will show that the indexing problem can be
solved if we extend SAX to allow
Different cardinalities within a single word
Comparison of words with different cardinalities
We call this extension indexable SAX (iSAX)

8
iSAX Representation (cont.)

Multi-resolution property
Readily convert to any lower resolution that
differs by a power of two
Lower bounding distance between iSAX words
enforced through examination of both sets of
breakpoints
iSAX offers a bit aware, quantized,
multi-resolution representation with variable
granularity

12,13, 6, 1 1100,1101,0110,0001 6, 6, 3, 0 110 ,110 ,011 ,000 3, 3, 1, 0 11 ,11 ,01 ,00 1, 1, 0, 0 1 ,1 ,0 ,0
9
Indexing with iSAX

Split a set of time series represented by a
common iSAX word into mutually exclusive subsets
(using multi-resolution property)
Increase cardinality along dimensions d, word
length w, 1 d w
Fan-out rate bound by 2d
Iterative doubling
Given a base cardinality b, cardinality at i-th
increase is b2i
Alignment of breakpoints overlap
Allows for index structures which are
hierarchical, with non-overlapping regions, and a
controlled fan-out rate

10
Indexing with iSAX (cont.)

Simple tree-based index (base cardinality b, word
length w, threshold th)
Hierarchically subdivides SAX space until entries
in each subspace falls within th
Leaf nodes point to index files on disk
Internal nodes designate a split in SAX space
Approximate Search
Similar time series often represented by same
iSAX word
Traverse index until leaf
Match iSAX representation at each level
Apply heuristics if no match
Exact Search
Leverage approximate search
Prune search space
Lower bounding distance

11
Experimental Evaluation

We conduct experiments to identify
characteristics of the iSAX representation
Tightness of the lower bound
Indexing performance on massive datasets
Applicability to data-mining algorithms

12
Tightness of Lower Bounds

TLB LowerBoundDist(T,S) / EuclideanDist(T,S)
For a given dataset
Time series length 480, 960, 1440, 1920
Bytes available for representation 16, 24, 32,
40
Results similar across thirty datasets

13
Indexing Performance on Massive Datasets

Indexed random walk datasets of 1, 2, 4, 8
million time series of length 256
Parameters b 4, w 8, th 100
Generated 39,255, 57,365, 92,209, 162,340 index
files
Approximate Search (1000 queries)
Exact Search (100 queries)

Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min)
1M 2M 4M 8M
Exact Search 3.8 5.8 9.0 14.1
Sequential Scan 71.5 104.8 168.8 297.6
Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query
1M 2M 4M 8M
Exact Search 2115.3 3172.5 4925.3 7719.1
Sequential Scan 39255 57365 92209 162340
14
Data Mining

Definition Time Series Set Difference (TSSD)
(A,B). Given two collections of time series A and
B, the time series set difference is the
subsequence in A whose distance from its nearest
neighbor in B is maximal
Electrocardiogram dataset from a 45 year old male
subject with suspected sleep-disordered
breathing
7.2 hours as reference set B (1,000,000 time
series)
8 minutes 39 seconds as novel set A (20,000
time series) where the patient woke up

The Time Series Set Difference discovered between
ECGs recorded during a waking cycle and the
previous 7.2 hours (respiration pattern change in
accordance with change in sleep stages)
15
Data Mining (cont.)

Solutions
Sequential scan A across B
Exact search each entry in A using index on B
Leverage approximate and exact search
Order A by approximate search distance in a queue
Perform exact search using index on B in
descending distance
Suspend if distance becomes lower than next entry
in the queue
If search completes, return as TSSD

Distance Computations Disk Accesses Est. Time
1) Sequential Scan 20,000,000,000 31,196 6.25 days
2) Exact Search 325,604,200 5,676,400 1.04 days
3) Leveraged 2,365,553 43,779 34 minutes
16
Conclusion

Introduced the iSAX representation and shown how
it can be used for indexing time series
Demonstrated scalability and efficacy on massive
datasets
Showed how approximate and exact search can be
used in conjunction to produce exact results on
data mining problems