Title: Applications of Sketch Based Techniques to Data Mining Problems
1Applications of Sketch Based Techniques to Data
Mining Problems
- Nikos Koudas
- ATT Labs Research
- joint work with
- G. Cormonde, P. Indyk, S. Muthukrishnan
2Taming Massive Data Sets
- Requirements of data mining algorithms
- operate on very large data sets
- scalability
- incremental
- Most data mining algorithms have super-linear
complexity - Deploying mining algorithms on very large data
sets, most likely will result in terrible
performance
3Sketching
- Reduce the dimensionality of the data in a
systematic way, constructing short data
summaries - Effectively reduce data volume.
- Deploy mining algorithms on the reduced data
volume. - Main issue Preserve data properties, the mining
algorithm is concerned with, (e.g., distances) in
the reduced data volume.
4Applications
- Clustering time series data
- Clustering tabular data sets
5Introduction
- Time series data abound in many applications.
- Financial, performance data, geographical and
meteorological information, solar and space data
etc. - Various works deal with management and analysis
aspects of time series data - Indexing, storage and retrieval
- Analysis and mining (forecasting, outlier and
deviation detection, etc) - Active research area in many research communities.
6Representative Trends
Relaxed period
Average trend
7Usage Examples
- Mining/Prediction
- Identifying periodic trends
- Uncovering unexpected periodic trends
- Performance management
- Networking (routing, traffic engineering,
bandwidth allocation) - System Tuning
- Financial databases
- Cyclic behavior
8Definitions
- Given a time series V and an integer T define
V(T) (ViT1, ViT2, , ViTT),
0 lt ilt n/T-1 - Define
- Ci(V(T)) S1ltjltn/T-1 D(vi,uj)
- Thus
- Relaxed Period min C0(V(T)), for T in l,u
- Average Trend min Ci(V(T)), for T in l,u
9Definitions
n/T vectors for each T, T in l,u
n points in each time series
Relaxed period
T
n/T vectors for each T, each of them a candidate
avg. trend, T in l,u
Average Trend
T
10Algorithms
- There exists a quadratic algorithm for
identifying relaxed periods. - There exists a cubic algorithm for identifying
average trends. - Simply evaluate the clustering for each T in
l,u, it takes linear time to evaluate relaxed
periods for each T and quadratic time to evaluate
for average trends.
11Algorithms
- But can we really run these?
- Consider length of sessions in an ATT service
for each second for a year, it is more than 31M
values and approximately 256MB. - Consider running the previous algorithms for say
10 years or on a finer time scale. - Both brute force algorithms are impractical.
- Can we run faster on large datasets, too large to
be in memory?
12Our Approach
- Identify representative trends faster but provide
approximate answers - General approach (expresses various notions of
representative trends) - Provides guaranteed approximation performance,
with high probability - We present our approach in the following steps
- Define the sketch of a vector
- Algorithms for finding the sketch of all
sub-vectors of width T - Determine the sketch of all sub-vectors of width
in a given range.
13Sketch of a vector
- Given a vector t of length l, we generate its
sketch S(t) as follows - Pick a random vector u of length l, by picking
each component ui from a normal distribution
N(0,1) (normalized to 1). - Define
- S(t)i t.u Sjtj.uj
14Sketch Properties
- Theorem
- For any given set L of vectors of length l, for a
fixed e lt 1/2, if k 9 logL/e2, then for any
pair of vectors u,w we have - (1-e)u-w2 lt S(u)-S(w)2lt(1e)u-w2
with probability 1/2. - By increasing k we can increase the probability
of success - This is the Johnson-Lindenstrauss (JL) lemma.
15Fixed window sketches
- Compute all sketches of sub-vectors of length l
in a sequence of length n. - There are n-l1 such sub-vectors.
- Straightforward application of JL would require
O(nlk) time since there are O(n) sub-vectors,
each of length l and the sketch is of size k, not
practical.
l
k
16Key Observation
- We can compute ALL such sketches fast by using
the fast fourier transform. - The problem of computing sketches of all
sub-vectors of length l simultaneously is exactly
the problem of computing the convolution of two
vectors t and u - Given two vectors A1a and B1b their
convolution is C1ab where - Ck S1ltiltbAk-iBi for 2 lt k lt ab
17Example
(-0.97,-0.2) (-0.4,-2.14,-1.57,-3.11,-0.97)
S10 S20 S30
2 1 3 1
Convolution with
(0.11,0.99) (1.98,1.21,3.08,1.32,0.11)
S11 S21 S31
18Computing all sketches of width in a given range
- Compute all sketches of all sub-vectors of length
between l and u. - Brute force is cubic and prohibitive
- Applying our observation would be quadratic and
still prohibitive - Can we compute all sketches of width in a given
range faster?
19Approach
- We will construct a pool of sketches that we will
pre-compute and store. Following this
preprocessing we will be able to determine the
sketch of any sub-vector in O(1) fairly
accurately. - Pick an l lt L lt u and construct all sketches of
length L as before using convolutions, this is
O(nlog2n) in the worst case. Assume for now that
L a power of 2 actually construct two such pools
S1 and S2
20Approach
- Consider any vector ti,.ij-1 we have two
cases - j some power of 2 (L), in this case we have it
in the pool and we can look it up in O(1) - 2r lt j lt 2r1 in this case we can compute the
sketch as follows - S(ti,.,ij-1)j S1(ti,,i2r-1)j
S2tij-2r,.,ij-1)j both terms belong to
the pool
21Example
S10
S21
S(U)
U
2 1 3 1 2 3 2 1
S1
S2
22Why is this enough?
- Theorem
- For any given set L of vector of length l, for
fixed e lt 1/2 if k 9 log L/e2, then for any
pair of vectors u,w in L - (1-e)u-w2 lt S(u)-S(w)2 lt 2 (1e)
u-w2 with probability 1/2
23Putting it all together
- Given V and l,u range
- relaxed period
- Compute sketches in time O(nlog(u-l)klogu)
- Consider every T in l,u and compute C0(V(T))
for every T. - Choosing k as described will guarantee that we
are at most 2e away from the true relaxed period - Average Trends
- Proceed similarly by evaluating Ci(V(T)) for
every i
24Implementation Issues
- Computing Sketches
- The pool of sketches can be computed with a
single pass over the data set. We only need to
keep a window worth of data across successive
sketch computations. - Retrieving Sketches
- Required sketches are retrieved by performing
random IO. However across successive evaluations
for various values if T, required sketches are
related. Random IO can be limited due to
prefetching.
25Experimental Evaluation
- Real data from a service ATT provides
(utilization information). - Size varying from 16MB (approx. 1 month) to 256MB
(approx 1 year) worth of data. - Evaluated
- Time to construct sketches
- Scalability of sketch construction
- Efficiency of the proposed convolution based
technique - Time to compute relaxed period and average trends
- computing sketches from scratch and with
pre-computed sketches - Using brute force approaches
- Accuracy of sketching
- Comparison with other time series reduction
techniques
26Time to construct sketches
27Time to construct sketches
28Time to construct without convolution
29Time to construct sketches without convolution
30Computing relaxed periods
31Computing Relaxed Periods
32Computing relaxed period with precomputed sketches
33Computing Relaxed Periods Without Precomputed
Sketches
34Brute Force Algorithms
35Brute Force Algorithms
36Computing Average Trend
37Computing Average Trend
38Accuracy of Sketches
39Clustering Tabular Data
- Many applications produce data in two dimensional
array form. - Consider traditional telecommunication
applications - Data are collected from a variety of collection
stations across the country, recording call
volume at some temporal granularity. - 2d call volume data set (spatial ordering of
collection stations versus time) recording
temporal call activity, approx. 18MB/day.
40(No Transcript)
41Clustering tabular data
- Data elements to be clustered are rectangular
data regions. - Clustering might reveal interesting similarities
(in call volume and time) between geographical
regions. - One month 600MB of data.
- Sketch rectangular regions
- extend sketches in 2d
- sketching with respect to any Lp norm p in (0.2
42(No Transcript)
43Summary of results
- Sketch construction scales nicely with respect to
data volume and sketch size. - Convolution based sketch computation is very
effective. - Sketch based approach is orders of magnitude
better than brute force for computing relaxed
periods and average trends. - Performance benefits increase for larger data
sets. - If sketches are pre-computed, clustering can be
performed in seconds even for very large data
sets. - In practice sketches of low dimensionality
provide great accuracy. - Compared with other dimensionality reduction
techniques, the sketch based approach is more
accurate and effective.
44Conclusions
- Scalability to large data volume requirement of
the data mining process. - Effectively reduce data volume using sketches.
- Preserve data properties required by mining
algorithms (e.g., various distances). - Core techniques, various algorithms could benefit
from them. - Very large performance benefits, small loss in
accuracy.
45Contact
- koudas_at_research.att.com
- www.research.att.com/koudas