Title: Mining Time Series Data
1Mining Time Series Data
CS240B Notes by Carlo Zaniolo UCLA CS Dept
With Slides from
A Tutorial on Indexing and Mining Time Series
Data ICDM '01The 2001 IEEE International
Conference on Data Mining November 29, San
Jose Dr Eamonn Keogh Computer Science
Engineering DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2Outline
- Introduction, Motivation
- Similarity Measures
- Properties of distance measures
- Preprocessing the data
- Time warped measures
- Indexing Time Series
- Dimensionality reduction
- Discrete Fourier Transform
- Discrete Wavelet Transform
- Singular Value Decomposition
- Piecewise Linear Approximation
- Symbolic Approximation
- Piecewise Aggregate Approximation
- Adaptive Piecewise Constant Approximation
- Summary, Conclusions
3What are Time Series?
25.1750 25.2250 25.2500 25.2500
25.2750 25.3250 25.3500 25.3500
25.4000 25.4000 25.3250 25.2250
25.2000 25.1750 .. .. 24.6250
24.6750 24.6750 24.6250 24.6250
24.6250 24.6750 24.7500
A time series is a collection of observations
made sequentially in time.
Note that virtually all similarity measurements,
indexing and dimensionality reduction techniques
discussed in this tutorial can be used with other
data types.
4Time Series are Ubiquitous! I
- People measure things...
- The presidents approval rating.
- Their blood pressure.
- The annual rainfall in Riverside.
- The value of their Yahoo stock.
- The number of web hits per second.
- and things change over time.
Thus time series occur in virtually every
medical, scientific and businesses domain.
5Time Series are Ubiquitous! II
A random sample of 4,000 graphics from 15 of the
worlds newspapers published from 1974 to 1989
found that more than 75 of all graphics were
time series (Tufte, 1983).
6Time Series Similarity
Classification
Clustering
Defining the similarity between two time series
is at the heart of most time series data mining
applications/tasks
Rule Discovery
10 ? s 0.5 c 0.3
Thus time series similarity will be the primary
focus of this tutorial.
Query by Content
Query Q (template)
7Why is Working With Time Series so Difficult?
Part I
Answer How do we work with very large databases?
- 1 Hour of EKG data 1 Gigabyte.
- Typical Weblog 5 Gigabytes per week.
- Space Shuttle Database 158 Gigabytes and
growing. - Macho Database 2 Terabytes, updated with 3
gigabytes per day.
Since most of the data lives on disk (or tape),
we need a representation of the data we can
efficiently manipulate.
8Why is Working With Time Series so Difficult?
Part II
Answer We are dealing with subjective notions of
similarity.
The definition of similarity depends on the
user, the domain and the task at hand. We need to
be able to handle this subjectivity.
9Why is working with time series so difficult?
Part III
- Answer Miscellaneous data handling problems.
- Differing data formats.
- Differing sampling rates.
- Noise, missing values, etc.
10Similarity Matching Problem Flavors 1
1 Whole Matching
Query Q (template)
6
1
7
2
8
3
C6 is the best match.
9
4
10
5
Database C
Given a Query Q, a reference database C and a
distance measure, find the Ci that best matches
Q.
11Similarity matching problem flavor 2
Query Q (template)
2 Subsequence Matching
Database C
The best matching subsection.
Given a Query Q, a reference database C and a
distance measure, find the location that best
matches Q.
Note that we can always convert subsequence
matching to whole matching by sliding a window
across the long sequence, and copying the window
contents.
12After all that background we might have forgotten
what we are doing and why we care! So here is a
simple motivator and review..
You go to the doctor because of chest pains. Your
ECG looks strange You doctor wants to search a
database to find similar ECGS, in the hope that
they will offer clues about your condition...
- Two questions
- How do we define similar?
- How do we search quickly?
-
13Similarity is always subjective.(i.e. it depends
on the application)
- All models are wrong, but some are useful
This slide was taken from A practical
Time-Series Tutorial with MATLABpresented at
ECLM PAKDD 2005, by Michalis Vlachos.
14Distance functions
- Metric
- Euclidean Distance
- Correlation
- Triangle Inequality d(x,z) d(x,y) d(y,z)
- Assume d(Q,bestMatch) 20
- and d(Q,B) 150
- Then, since d(A,B)20
- d(Q,A) d(Q,B) d(B,A)
- d(Q,A) 150 20 130
- We do not need to get A from disk
- Non-Metric
- Time Warping
- LCSS longest common sub-sequence
15Preprocessing the data before distance
calculations
- If we naively try to measure the distance between
two raw time series, we may get very
unintuitive results. - This is because Euclidean distance is very
sensitive to some distortions in the data. For
most problems these distortions are not
meaningful, and thus we can and should remove
them. - In the next 4 slides I will discuss the 4 most
common distortions, and how to remove them. - Offset Translation
- Amplitude Scaling
- Linear Trend
- Noise
16Transformation I Offset Translation
D(Q,C)
Q Q - mean(Q)
C C - mean(C)
D(Q,C)
0
50
100
150
200
250
300
17Transformation II Amplitude Scaling
0
100
200
300
400
500
600
700
800
900
1000
Q (Q - mean(Q)) / std(Q)
C (C - mean(C)) / std(C)
D(Q,C)
18Transformation III Linear Trend
After offset translation
And amplitude scaling
Removed linear trend
The intuition behind removing linear trend is
this. Fit the best fitting straight line to the
time series, then subtract that line from the
time series.
19Transformation IIII Noise
Q smooth(Q)
C smooth(C)
The intuition behind removing noise is
this. Average each datapoints value with its
neighbors.
D(Q,C)
20A Quick Experiment to Demonstrate the Utility of
Preprocessing the Data
3
Clustered using Euclidean distance on the raw
data
2
9
6
8
5
7
4
1
Clustered using Euclidean distance on the raw
data, after removing noise, linear trend, offset
translation and amplitude scaling.
9
8
7
5
6
4
3
2
1
21Summary of Preprocessing
The raw time series may have distortions which
we should remove before clustering,
classification etc. Of course, sometimes the
distortions are the most interesting thing about
the data, the above is only a general rule. We
should keep in mind these problems as we consider
the high level representations of time series
which we will encounter later (Fourier
transforms, Wavelets etc). Since these
representations often allow us to handle
distortions in elegant ways.
22Dynamic Time Warping
Fixed Time Axis Sequences are aligned one to
one.
Warped Time Axis Nonlinear alignments are
possible.
Note We will first see the utility of DTW, then
see how it is calculated.
23Utility of Dynamic Time Warping Example II, Data
Mining
Power-Demand Time Series. Each sequence
corresponds to a weeks demand for power in a
Dutch research facility in 1997 van Selow 1999.
Wednesday was a national holiday
24Hierarchical clustering with Euclidean
Distance. ltGroup Average Linkagegt
4
5
3
6
The two 5-day weeks are correctly grouped. Note
however, that the three 4-day weeks are not
clustered together. Also, the two 3-day weeks
are also not clustered together.
7
2
1
25Hierarchical clustering with Dynamic Time
Warping. ltGroup Average Linkagegt
The two 5-day weeks are correctly grouped. The
three 4-day weeks are clustered together. The
two 3-day weeks are also clustered together.
26Dynamic Time-Warping
- (how does it work?)The intuition is that we copy
an element multiple times so as to achieve a
better matching
Euclidean distance d 1 T1 1, 1, 2, 2
T2 1, 2, 2, 2 Warping
distance d 0 T1 1, 1, 2, 2
T2 1, 2, 2, 2
27Computing the Dynamic Time Warp Distance I
Note that the input sequences can be of different
lengths
Q
n
p
C
28Computing the Dynamic Time Warp Distance II
Q
n
p
C
Every possible mapping from Q to C can be
represented as a warping path in the search
matrix. We simply want to find the cheapest
one Although there are exponentially many
such paths, we can find one in only quadratic
time using dynamic programming.
29Complexity of Time Warping
- Time taken to create hierarchical clustering of
power-demand time series. - Time to create dendrogram
- using Euclidean Distance 1.2 seconds
- Time to create dendrogram
- using Dynamic Time Warping 3.40 hours
- How to speed it up.
- Approach 1 Complexity is O(n2). We can reduce it
to O(n) simply by restricting the warping path. - Approach 2 Approximate the time series with some
compressed or downsampled representation, and do
DTW on the new representation.
30Fast Approximations to Dynamic Time Warp Distance
II
22.7 sec
1.3 sec
.. strong visual evidence to suggests it works
well. Good experimental evidence the utility of
the approach on clustering, classification and
query by content problems also has been
demonstrated.
31Weighted Distance Measures I
Intuition For some queries different parts of
the sequence are more important.
Weighting features is a well known technique in
the machine learning community to improve
classification and the quality of clustering.
32Relevance Feedback for Time Series
The original query
The weigh vector. Initially, all weighs are the
same.
Note In this example we are using a piecewise
linear approximation of the data. We will learn
more about this representation later.
33The initial query is executed, and the five best
matches are shown (in the dendrogram)
One by one the 5 best matching sequences will
appear, and the user will rank them from between
very bad (-3) to very good (3)
34Based on the user feedback, both the shape and
the weigh vector of the query are changed.
The new query can be executed. The hope is that
the query shape and weights will converge to the
optimal query.
Two paper consider relevance feedback for time
series. L Wu, C Faloutsos, K Sycara, T. Payne
FALCON Feedback Adaptive Loop for Content-Based
Retrieval. VLDB 2000 297-306
35Motivating Example Revisited...
You go to the doctor because of chest pains. Your
ECG looks strange You doctor wants to search a
database to find similar ECGS, in the hope that
they will offer clues about your condition...
- Two questions
- How do we define similar?
- How do we search quickly?
-
36Indexing Time Series
- We have seen techniques for assessing the
similarity of two time series. - However we have not addressed the problem of
finding the best match to a query in a large
database... - We need someway to index the data...
- A topics extensively discussed in topical
literature that we will not discuss here for lack
of timealso it might not be applicable to data
streams
37Compression Dimensionality Reduction
- Project all sequences into a new space, and
search this space instead.
38An Example of a Dimensionality Reduction
Technique
The graphic shows a time series with 128
points. The raw data used to produce the graphic
is also reproduced as a column of numbers (just
the first 30 or so points are shown).
C
0
20
40
60
80
100
120
140
n 128
39Dimensionality Reduction (cont.)
We can decompose the data into 64 pure sine waves
using the Discrete Fourier Transform (just the
first few sine waves are shown). The Fourier
Coefficients are reproduced as a column of
numbers (just the first 30 or so coefficients are
shown). Note that at this stage we have not done
dimensionality reduction, we have merely changed
the representation...
C
0
20
40
60
80
100
120
140
. . . . . . . . . . . . . .
40An Example of a Dimensionality Reduction
Technique III
Truncated Fourier Coefficients
Fourier Coefficients
1.5698 1.0485 0.7160 0.8406
0.3709 0.4670 0.2667 0.1928
1.5698 1.0485 0.7160 0.8406
0.3709 0.4670 0.2667 0.1928
0.1635 0.1602 0.0992 0.1282
0.1438 0.1416 0.1400 0.1412
0.1530 0.0795 0.1013 0.1150
0.1801 0.1082 0.0812 0.0347
0.0052 0.0017 0.0002 ...
n 128 N 8 Cratio 1/16
C
C
0
20
40
60
80
100
120
140
however, note that the first few sine waves
tend to be the largest (equivalently, the
magnitude of the Fourier coefficients tend to
decrease as you move down the column). We can
therefore truncate most of the small coefficients
with little effect.
We have discarded of the data.
41An Example of a Dimensionality Reduction
Technique IIII
Sorted Truncated Fourier Coefficients
1.5698 1.0485 0.7160 0.8406
0.2667 0.1928 0.1438 0.1416
C
C
0
20
40
60
80
100
120
140
Instead of taking the first few coefficients, we
could take the best coefficients This can help
greatly in terms of approximation quality, but
makes indexing hard (impossible?). Note this
applies also to Wavelets
42Compressed Representations
43Discrete Fourier Transform I
Basic Idea Represent the time series as a linear
combination of sines and cosines, but keep only
the first n/2 coefficients. Why n/2
coefficients? Because each sine wave requires 2
numbers, for the phase (w) and amplitude (A,B).
X
X'
0
20
40
60
80
100
120
140
Jean Fourier 1768-1830
0
1
2
3
4
5
Excellent free Fourier Primer Hagit Shatkay, The
Fourier Transform - a Primer'', Technical Report
CS-95-37, Department of Computer Science, Brown
University, 1995. http//www.ncbi.nlm.nih.gov/CBB
research/Postdocs/Shatkay/
6
7
8
9
44Discrete Fourier Transform II
- Pros and Cons of DFT as a time series
representation. - Good ability to compress most natural signals.
- Fast, off the shelf DFT algorithms exist.
O(nlog(n)). - (Weakly) able to support time warped queries.
- Difficult to deal with sequences of different
lengths. - Cannot support weighted distance measures.
X
X'
0
20
40
60
80
100
120
140
0
1
2
3
4
5
6
Note The related transform DCT, uses only cosine
basis functions. It does not seem to offer any
particular advantages over DFT.
7
8
9
45Discrete Wavelet Transform I
Basic Idea Represent the time series as a linear
combination of Wavelet basis functions, but keep
only the first N coefficients. Although there
are many different types of wavelets, researchers
in time series mining/indexing generally use Haar
wavelets. Haar wavelets seem to be as powerful
as the other wavelets for most problems and are
very easy to code.
Alfred Haar 1885-1933
Excellent free Wavelets Primer Stollnitz, E.,
DeRose, T., Salesin, D. (1995). Wavelets for
computer graphics A primer IEEE Computer
Graphics and Applications.
46X 8, 4, 1, 3
h1 4 mean(8,4,1,3)
h2 2 mean(8,4) - h1
h3 2 (8-4)/2
h4 -1 (1-3)/2
8
7
6
5
4
3
2
1
I have converted a raw time series X 8, 4, 1,
3, into the Haar Wavelet representation H 4,
2 , 2, 1 We can covert the Haar representation
back to raw signal with no loss of information...
h1 4
h2 2
h3 2
h4 -1
X 8, 4, 1, 3
8
7
6
5
4
3
2
1
47Discrete Wavelet Transform III
- Pros and Cons of Wavelets as a time series
representation. - Good ability to compress stationary signals.
- Fast linear time algorithms for DWT exist.
- Able to support some interesting non-Euclidean
similarity measures. - Works best if N is 2some_integer. Otherwise
wavelets approximate the left side of signal at
the expense of the right side. - Cannot support weighted distance measures.
Open Question We have only considered one type
of wavelet, there are many others. Are the other
wavelets better for indexing? YES I. Popivanov,
R. Miller. Similarity Search Over Time Series
Data Using Wavelets. ICDE 2002. NO K. Chan and
A. Fu. Efficient Time Series Matching by
Wavelets. ICDE 1999 Obviously, this question
still open...
48Singular Value Decomposition
Basic Idea Represent the time series as a linear
combination of eigenwaves but keep only the first
N coefficients. SVD is similar to Fourier and
Wavelet approaches is that we represent the data
in terms of a linear combination of shapes (in
this case eigenwaves). SVD differs in that the
eigenwaves are data dependent. SVD has been
successfully used in the text processing
community (where it is known as Latent Symantec
Indexing ) for many yearsbut it is
computationally expensive Good free SVD Primer
Singular Value Decomposition - A Primer. Sonia
Leach
X
X'
SVD
James Joseph Sylvester 1814-1897
0
20
40
60
80
100
120
140
Camille Jordan (1838--1921)
Eugenio Beltrami 1835-1899
49Singular Value Decomposition (cont.)
How do we create the eigenwaves?
We have previously seen that we can regard time
series as points in high dimensional space. We
can rotate the axes such that axis 1 is aligned
with the direction of maximum variance, axis 2 is
aligned with the direction of maximum variance
orthogonal to axis 1 etc. Since the first few
eigenwaves contain most of the variance of the
signal, the rest can be truncated with little
loss.
X
X'
SVD
0
20
40
60
80
100
120
140
50Piecewise Linear Approximation I
Basic Idea Represent the time series as a
sequence of straight lines. Lines could be
connected, in which case we are allowed N/2
lines If lines are disconnected, we are allowed
only N/3 lines Personal experience on dozens of
datasets suggest disconnected is better. Also
only disconnected allows a lower bounding
Euclidean approximation
X
Karl Friedrich Gauss 1777 - 1855
X'
0
20
40
60
80
100
120
140
- Each line segment has
- length
- left_height
- (right_height can be inferred by looking at the
next segment)
- Each line segment has
- length
- left_height
- right_height
51Piecewise Linear Approximation II
- How do we obtain the Piecewise Linear
Approximation? - Optimal Solution is O(n2N), which is too slow for
data mining. - A vast body on work on faster heuristic solutions
to the problem can be classified into the
following classes--CRatio denotes the compression
ratio - Top-Down O(n2N)
- Bottom-Up O(n/CRatio)
- Sliding Window O(n/CRatio)
- Other (genetic algorithms, randomized
algorithms, Bspline wavelets, MDL etc) - Recent extensive empirical evaluation of all
approaches suggest that Bottom-Up is the best
approach overall.
X
X'
0
20
40
60
80
100
120
140
52Piecewise Linear Approximation III
- Pros and Cons of PLA as a time series
representation. - Good ability to compress natural signals.
- Fast linear time algorithms for PLA exist.
- Able to support some interesting non-Euclidean
similarity measures. Including weighted measures,
relevance feedback, fuzzy queries - Already widely accepted in some communities (ie,
biomedical) - Not (currently) indexable by any data structure
(but does allows fast sequential scanning).
X
X'
0
20
40
60
80
100
120
140
53Basic Idea Convert the time series into an
alphabet of discrete symbols. Use string indexing
techniques to manage the data. Potentially an
interesting idea, but all the papers thusfar are
very ad hoc.
Symbolic Approximation
X
X'
C U U C D C U D
0
20
40
60
80
100
120
140
- Pros and Cons of Symbolic Approximation as a time
series representation. - Potentially, we could take advantage of a wealth
of techniques from the very mature field of
string processing. - There is no known technique to allow the support
of Euclidean queries. - It is not clear how we should discretize the
times series (discretize the values, the slope,
shapes? How big of an alphabet? etc)
C
U
U
C
D
C
Key C Constant U Up D Down
U
D
54Basic Idea Represent the time series as a
sequence of box basis functions. Note that each
box is the same length.
Piecewise Aggregate Approximation I
X
X'
0
20
40
60
80
100
120
140
Given the reduced dimensionality representation
we can calculate the approximate Euclidean
distance as...
Independently introduced by two authors Keogh,
Chakrabarti, Pazzani Mehrotra, KAIS
(2000) Byoung-Kee Yi, Christos Faloutsos, VLDB
(2000)
55Piecewise Aggregate Approximation II
- Pros and Cons of PAA as a time series
representation.
- Extremely fast to calculate
- As efficient as other approaches (empirically)
- Support queries of arbitrary lengths
- Can support any Minkowski metric
- Supports non Euclidean measures
- Supports weighted Euclidean distance
- Simple! Intuitive!
- If visualized directly, looks ascetically
unpleasing.
X
X'
0
20
40
60
80
100
120
140
56Basic Idea Generalize PAA to allow the piecewise
constant segments to have arbitrary lengths.
Note that we now need 2 coefficients to
represent each segment, its value and its length.
Adaptive Piecewise Constant Approximation I
X
X
0
20
40
60
80
100
120
140
ltcv1,cr1gt
ltcv2,cr2gt
ltcv3,cr3gt
ltcv4,cr4gt
The intuition is this, many signals have little
detail in some places, and high detail in other
places. APCA can adaptively fit itself to the
data achieving better approximation.
57The high quality of the APCA had been noted by
many researchers. However it was believed that
the representation could not be indexed because
some coefficients represent values, and some
represent lengths. However an indexing method
was discovered! (SIGMOD 2001 best paper award)
Unfortunately, it is non-trivial to understand
and implement.
Adaptive Piecewise Constant Approximation II
X
X
0
20
40
60
80
100
120
140
ltcv1,cr1gt
ltcv2,cr2gt
ltcv3,cr3gt
ltcv4,cr4gt
58Adaptive Piecewise Constant Approximation
- Pros and Cons of APCA as a time series
representation. - Fast to calculate O(n).
- More efficient as other approaches (on some
datasets). - Support queries of arbitrary lengths.
- Supports non Euclidean measures.
- Supports weighted Euclidean distance.
- Support fast exact queries , and even faster
approximate queries on the same data structure. - Somewhat complex implementation.
- If visualized directly, looks ascetically
unpleasing.
X
X
0
20
40
60
80
100
120
140
ltcv1,cr1gt
ltcv2,cr2gt
ltcv3,cr3gt
ltcv4,cr4gt
59Conclusion
- This is just an introduction, with many
unavoidable omissions - There are dozens of papers that offer new
distance measures. - Hidden Markov models do have a sound basis, but
dont scale well. - Time series analysis remains a hot area of
research and the most recent papers have not
been discussed here.