Review - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Review

Description:

Indexing Time Series using GEMINI' (GEneric Multimedia INdexIng) ... of (often large) observational data sets to find unsuspected relationships and ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 46

Provided by: gkol

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Review

1
Review
2
Time Series Data
A time series is a collection of observations
made sequentially in time.
25.1750 25.1750 25.2250 25.2500
25.2500 25.2750 25.3250 25.3500
25.3500 25.4000 25.4000 25.3250
25.2250 25.2000 25.1750 .. ..
24.6250 24.6750 24.6750 24.6250
24.6250 24.6250 24.6750 24.7500
value axis
time axis
3
Time Series Problems (from a databases
perspective)

The Similarity Problem
X x1, x2, , xn and Y y1, y2, , yn
Define and compute Sim(X, Y)
E.g. do stocks X and Y have similar movements?
Retrieve efficiently similar time series
(Similarity Queries)

4
Similarity Models

Euclidean and Lp based
Dynamic Time Warping
Edit Distance and LCS based
Probabilistic (using Markov Models)
Landmarks
How appropriate a similarity model is depends on
the application

5
Euclidean model
6
Dynamic Time WarpingBerndt, Clifford, 1994

Allows acceleration-deceleration of signals along
the time dimension
Basic idea
Consider X x1, x2, , xn , and Y y1, y2, ,
yn
We are allowed to extend each sequence by
repeating elements
Euclidean distance now calculated between the
extended sequences X and Y

7
Dynamic Time WarpingBerndt, Clifford, 1994
8
Restrictions on Warping Paths

Monotonicity
Path should not go down or to the left
Continuity
No elements may be skipped in a sequence
Warping Window
i j

9
Formulation

Let D(i, j) refer to the dynamic time warping
distance between the subsequences
x1, x2, , xi
y1, y2, , yj
D(i, j) xi yj min D(i 1, j),
D(i 1, j 1),
D(i, j 1)

10
Basic LCS Idea

X 3, 2, 5, 7, 4, 8, 10, 7
Y 2, 5, 4, 7, 3, 10, 8, 6
LCS 2, 5, 7, 10

Sim(X,Y) LCS or Sim(X,Y) LCS /n
Longest Common Subsequence Edit Distance is
another possibility
11
Indexing Time Series using GEMINI

(GEneric Multimedia INdexIng)
Extract a few numerical features, for a quick
and dirty test

12
GEMINI - Pictorially
eg,. std
eg, avg
13
GEMINI

Solution Quick-and-dirty' filter
extract n features (numbers, eg., avg., etc.)
map into a point in n-d feature space
organize points with off-the-shelf spatial access
method (SAM)
discard false alarms

14
GEMINI

Important Q how to guarantee no false
dismissals?
A1 preserve distances (but difficult/impossible)
A2 Lower-bounding lemma if the mapping makes
things look closer, then there are no false
dismissals

15
Feature Extraction

How to extract the features? How to define the
feature space?
Fourier transform
Wavelets transform
Averages of segments (Histograms or APCA)

16
Piecewise Aggregate Approximation (PAA)
Original time series (n-dimensional
vector) Ss1, s2, , sn
n-segment PAA representation (n-d vector) S
sv1 , sv2, , svn
PAA representation satisfies the lower bounding
lemma (Keogh, Chakrabarti, Mehrotra and Pazzani,
2000 Yi and Faloutsos 2000)
17
Can we improve upon PAA?
n-segment PAA representation (n-d vector) S
sv1 , sv2, , svN
18
Dimensionality Reduction

Many problems (like time-series and image
similarity) can be expressed as proximity
problems in a high dimensional space
Given a query point we try to find the points
that are close
But in high-dimensional spaces things are
different!

19
MDS (multidimensional scaling)

Input a set of N items, the pair-wise (dis)
similarities and the dimensionality k
Optimization criterion
stress (?ij(D(Si,Sj) - D(Ski, Skj) )2 /
?ijD(Si,Sj) 2) 1/2
where D(Si,Sj) be the distance between time
series Si, Sj, and D(Ski, Skj) be the Euclidean
distance of the k-dim representations
Steepest descent algorithm
start with an assignment (time series to k-dim
point)
minimize stress by moving points

20
FastMap Faloutsos and Lin, 1995

Maps objects to k-dimensional points so that
distances are preserved well
It is an approximation of Multidimensional
Scaling
Works even when only distances are known
Is efficient, and allows efficient query
transformation

21
Other DR methods

PCA (Principle Component Analysis)
Move the center of the dataset to the center
of the origins. Define the covariance matrix ATA.
Use SVD and project the items on the first k
eigenvectors
Random projections

22
What is Data Mining?

Data Mining is
(1) The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large datasets
(2) The analysis of (often large) observational
data sets to find unsuspected relationships and
to summarize the data in novel ways that are both
understandable and useful to the data owner

23
What is Data Mining?

Data Mining is
(1) The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large datasets
(2) The analysis of (often large) observational
data sets to find unsuspected relationships and
to summarize the data in novel ways that are both
understandable and useful to the data owner

24
Association Rules

Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find all association rules that satisfy
user-specified minimum support and minimum
confidence interval
Example 30 of transactions that contain beer
also contain diapers 5 of transactions contain
these items
30 confidence of the rule
5 support of the rule
We are interested in finding all rules rather
than verifying if a rule holds

25
Problem Decomposition

1. Find all sets of items that have minimum
support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules

26
Mining Frequent Itemsets

Apriori
Key idea A subset of a frequent itemset must
also be a frequent itemset (anti-monotonicity)
Max-miner
Idea Instead of checking all subsets of a long
pattern try to detect long patterns early

27
FP-tree

Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent
pattern mining
Create the tree and then run recursively the
algorithm over the tree (conditional base for
each item)

28
Association Rules

Multi-level association rules each attribute has
a hierarchy. Find rules per level or at different
levels
Quantitative association rules
Numerical attributes
Other methods to find correlation
Lift, correlation coefficient

29
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

30
Major Clustering Approaches

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchical algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based algorithms based on connectivity
and density functions
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other

31
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods k-means and k-medoids
algorithms
k-means (MacQueen67) Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster

32
Optimization problem

The goal is to optimize a score function
The most commonly used is the square error
criterion

33
CLARANS (Randomized CLARA)

CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as
searching a graph where every node is a potential
solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts
with new randomly selected node in search for a
new local optimum
It is more efficient and scalable than both PAM
and CLARA

34
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

35
HAC

Different approaches to merge clusters
Min distance
Average distance
Max distance
Distance of the centers

36
BIRCH

Birch Balanced Iterative Reducing and Clustering
using Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD96)
Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering
Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data)
Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree

37
CURE (Clustering Using REpresentatives )

CURE proposed by Guha, Rastogi Shim, 1998
Stops the creation of a cluster hierarchy if a
level consists of k clusters
Uses multiple representative points to evaluate
the distance between clusters, adjusts well to
arbitrary shaped clusters and avoids single-link
effect

38
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies
DBSCAN Ester, et al. (KDD96)
OPTICS Ankerst, et al (SIGMOD99).
DENCLUE Hinneburg D. Keim (KDD98)
CLIQUE Agrawal, et al. (SIGMOD98)

39
Model based clustering

Assume data generated from K probability
distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
Need to find distribution parameters.
EM Algorithm

40
Classification

Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary 5 L
Good/ bad
Prof. Exec
New applicants data
41
Decision trees

Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.

Salary
Prof teaching
Age
42
Building tree

GrowTree(TrainingData D)
Partition(D)
Partition(Data D)
if (all points in D belong to the same class)
then
return
for each attribute A do
evaluate splits on attribute A
use best split found to partition D into D1 and
D2
Partition(D1)
Partition(D2)

43
Split Criteria

Select the attribute that is best for
classification.
Information Gain
Gini Index
Gini(D) 1 - ?? pj2

Ginisplit(D) n1 gini(D1) n2 gini(D2)
n n
44
SLIQ (Supervised Learning In Quest)

Decision-tree classifier for data mining
Design goals
Able to handle large disk-resident training sets
No restrictions on training-set size

45
Bayesian Classification

Probabilistic approach based on Bayes theorem
MAP (maximum posteriori) hypothesis

46
Naïve Bayes Classifier (I)

A simplified assumption attributes are
conditionally independent
Greatly reduces the computation cost, only count
the class distribution.

47
Bayesian Belief Networks (I)
Age
FamilyH
(FH, A)
(FH, A)
(FH, A)
(FH, A)
M
0.7
0.8
0.5
0.1
Diabetes
Mass
M
0.3
0.2
0.5
0.9
The conditional probability table for the
variable Mass
Insulin
Glucose
Bayesian Belief Networks

Write a Comment

User Comments (0)