Mining Massive RFID, Trajectory, and Traffic Data Sets

About This Presentation

Title:

Mining Massive RFID, Trajectory, and Traffic Data Sets

Description:

Mining Massive RFID, Trajectory, and Traffic Data Sets – PowerPoint PPT presentation

Number of Views:995

Avg rating:1.0/5.0

Slides: 191

Provided by: jiaw190

Category:

more less

Transcript and Presenter's Notes

Title: Mining Massive RFID, Trajectory, and Traffic Data Sets

1
Mining Massive RFID, Trajectory, and Traffic
Data Sets
KDD08 Tutorial

Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei
Li
ACM SIGKDD'08 Conference Tutorial
Las Vegas, NE
August 24, 2008

2
Tutorial Outline

Part I. RFID Data Mining
Part II. Trajectory Data Mining
Part III. Traffic Data Mining
Part IV. Conclusions

3
Part 1. RFID Data Mining

Introduction to RFID Data
Why RFID Data Warehousing and Mining?
RFID Data Warehousing
Mining RFID Data Sets
Conclusions

4
RFID Technology

Radio Frequency Identification (RFID)
Technology that allows a sensor (reader) to read,
from a distance, and without line of sight, a
unique electronic product code (EPC) associated
with a tag

5
Broad Applications of RFID Technology

Supply Chain Management Real-time inventory
tracking
Retail Active shelves monitor product
availability
Access control Toll collection, credit cards,
building access
Airline luggage management Reduce lost/misplaced
luggage
Medical Implant patients with a tag that
contains their medical history
Pet identification Implant RFID tag with pet
owner information

6
Inventory Management
How many pens should we reorder?
7
Asset Tracking
British airways loses 20 million bags a year
8
Electronic Toll Collection
Illinois 1 million drivers a day use I-Pass
9
RFID System (Tag, Reader, Database)
Source www.belgravium.com
10
RFID Data Warehousing and Mining
Flow Mining
Traffic Mining
Other
Adaptive Fastest Path Computation on a Road
Network A Traffic Mining Approach Gonzalez, et
al. VLDB07
Flow Mining
Traffic Mining
Mining Engine
FlowCube Constructing RFID FlowCubes for
Multi-Dimensional Analysis of Commodity
Flows Gonzalez et al. VLDB06
RFID Warehouse
Mining Compressed Commodity Workflows From
Massive RFID Data Sets Gonzalez, et al. CIKM06
Warehousing Engine
Warehousing Engine
Warehousing and Analyzing Massive RFID Data Sets
Gonzalez, et al. ICDE06 (Best Student Paper)
Data Cleaning
Cost-Conscious Cleaning of Massive RFID Data Sets
Gonzalez, et al. ICDE07
RFID Data Site 1
RFID Data Site 2
RFID Data Site k

11
Part 1. RFID Data Mining

Introduction to RFID Data
Why RFID Data Warehousing and Mining?
RFID Data Warehousing
Mining RFID Data Sets
Conclusions

12
Challenges of RFID Data Sets

Data generated by RFID systems is enormous
(peta-bytes in scale!) due to redundancy and low
level of abstraction
Walmart is expected to generate 7 terabytes of
RFID data per day
Data analysis requirements
Highly compact summary of the data
OLAP operations on multi-dimensional view of the
data
Preserving the path structures of RFID data for
analysis
Efficiently drilling down to individual tags when
an interesting pattern is discovered

13
Example Trajectory
(Factory, T1,T2)
(Shipping,T3,T4)
(Warehouse, T5,T6)
(Self, T7,T8)
(Checkout,T9,T10)
14
Data Generation
EPC (L1,T1)(L2,T2)(Ln,Tn)
EPC, Location, Time_in, Time_out
EPC, Location, Time
15
RFID Data Warehouse Modeling

Three models in typical RFID applications
Bulky movements supply-chain management
Scattered movements E-pass tollway system
No movements fixed location sensor networks

Different applications may require different data
warehouse systems
Our discussion will focus on bulky movements

16
Why RFID-Warehousing?

Lossless compression for bulky movement data
Significantly reduce the size of the RFID data
set by redundancy removal and grouping objects
that move and stay together
Data cleaning reasoning based on more complete
info
Multi-reading, miss-reading, error-reading, bulky
movement,
Multi-dimensional summary, multiple views
Multiple dimensional view Product, location,
time,
Store manager Check item movements from the
backroom to different shelves in his store
Region manager Collapse intra-store movements
and look at distribution centers, warehouses, and
stores

17
Example A Supply Chain Store

A retailer with 3,000 stores, selling 10,000
items a day per store
Each item moves 10 times on average before being
sold
Movement recorded as (EPC, location, second)
Data volume 300 million tuples per day (after
redundancy removal)
OLAP query Costly to answer if scanning 1
billion tuples
Avg time for outwear items to move from warehouse
to checkout counter in March 2006?
Mining query
Is there a correlation between the time spent at
transportation and the milk in store S rotten?

18
Part 1. RFID Data Mining

Introduction to RFID Data
Why RFID Data Warehousing and Mining?
RFID Data Warehousing
Mining RFID Data Sets
Conclusions

19
Cleaning of RFID Data Records

Raw Data
(EPC, location, time)
Duplicate records due to multiple readings of a
product at the same location
(r1, l1, t1) (r1, l1, t2) ... (r1, l1, t10)
Cleansed Data Minimal information to store and
removal of raw data
(EPC, Location, time_in, time_out)
(r1, l1, t1, t10)
Warehousing can help fill-up missing records and
correct wrongly-registered information

20
What is a Data Warehouse?
Warehouse
Operational Data Site 1
Cube 1
Cube 2
Extract Transform Load

OLAP
Cube N
Operational Data Site k
Q1
Q2
Q3
Q4
DVD,Chicago,All
DVD
PC
Chicago
TV
Boston
All Possible Groupings (Product, Location, Time)
New York
Q1,All,All
All,All,All
21
Why Do We Need a New Design?

Ex. Avg time that milk stays at the Champaign
Walmart that coming from farm A, and Truck 1?

Paths are lost in the aggregation
22
Data Compression
Raw Data (EPC, Reader, Time)
Cleansed Data (EPC, Reader, T_in,T_out)
Lossless
Redundancy Elimination
Bulky Movement Compression
Lossless
Stay (GID, Reader, T_in,T_out)
Stay (GID, Locale, Day 1, Day2)
Lossy
Path and Item Abstraction
23
Bulky Object Movements
1.1.1.1
1.1.1
1.1.1.2
1.1
1.1.2
1.2
1
i1,i2,,i10000, Dist Center 1, 01/01/08,
01/03/08
1.1
24
Data Compression with GID

Bulky object movements
Objects often move and stay together
If 1000 packs of soda stay together at the
distribution center, register a single record
(GID, distribution center, time_in, time_out)
GID is a generalized identifier that represents
the 1000 packs that stayed together at the
distribution center

shelf 1
store 1
10 pallets (1000 cases)
shelf 2
Dist. Center 1
store 2

Dist. Center2
Factory

10 packs (12 sodas)

20 cases (1000 packs)
25
Movement Graph Producer Consumer
Configurations
26
Non-Spatial Generalization
Category level
Clothing
Type level
Interesting Level
Outerwear
Shoes
SKU level
Shirt
Jacket

Cleansed RFID Database Level
Shirt 1
Shirt n

EPC level
27
Path Generalization
Store View
Transportation
backroom
shelf
checkout
backroom
shelf
checkout
dist. center
truck
Transportation View
dist. center
truck
Store
28
RFID-Cube Architecture
29
RFID Cuboid
30
Example RFID Cuboid
Cleansed RFID Database
Stay Table
Map Table
31
Design Decisions Stay vs. Transition
l1
ln1
Vs.
l2
l
ln2

Measure of Items at location l
Transition n m retrievals
Stay 1 retrieval
Measure of Items from li to lj
Transition 1 retrieval
Stay 2 retrievals

ln
lnm
32
Design Decisions EPC vs. GID Lists
How many pallets traveled path l1, l7, l13?
(r1,l1,t1,t2) (r1,l2,t3,t4) (r2,l1,t1,t2) (r2,l2
,t3,t4) (rk,l1,t1,t2) (rk,l2,t3,t4)

With EPC lists
Retrieve all EPCs with location in l1,l17,l13
With GID lists
Retrieve all GIDs with location in l1,l7,l13
Savings
GIDs ltlt EPCs

(g1,l1,t1,t2) (g2,l2,t3,t4)
33
GID Naming
0.1
0.0

GID Name Encodes Path
Benefit - Speed
Reduce GID Intersections
Cost - Space
Locations
Path length

l2
l1
0.1.1
0.1.0
0.0.0
l4
l3
0.0.0.0
0.1.0.1
l5
l6
34
RFID Cuboid Construction

Build Path Prefix-tree
For each Node
GID parent GID unique id
Aggregate measure for items at each leaf under
node
Generate stay records, merge if necessary

35
Compression by Data/Path Generalization

Data generalization
Analysis usually takes place at a much higher
level of abstraction than the one present in raw
RFID data
Aggregate object movements into fewer records
If interested in time at the day level, merge
records at the minute level into records at the
hour level
Path generalization Merge and/or collapse path
segments
Uninteresting path segments can be ignored or
merged
Multiple item movements within the same store may
be uninteresting to a regional manager and thus
merged

36
Three RFID-Cuboids

Stay Table (GIDs, location, time_in, time_out
measures)
Records information on items that stay together
at a given location
If using record transitions difficult to answer
queries, lots of intersections needed
Map Table (GID, ltGID1,..,GIDngt)
Links together stages that belong to the same
path. Provides additional compression and query
processing efficiency
High level GID points to lower level GIDs
If saving complete EPC Lists high costs of IO to
retrieve long lists, costly query processing
Information Table (EPC list, attribute
1,...,attribute n)
Records path-independent attributes of the items,
e.g., color, manufacturer, price

37
Algorithm Example
Stay Table
Path Tree
l2
l1
0.1 t1,t8 3
0.0 t1,t10 3
l4
l3
l3
0.1.1 t10,t20 2
0.0.0 t20,t30 3
0.1.0 t20,t30 3
r8,r9
0.1.0.1 t35,t50 1
0.1.0.0 t40,t60 2
l5
l6
l5
0.0.0.0 t40,t60 3
r1,r2,r3
r5,r6
r7
38
RFID-Cuboid Construction Algorithm

Build a prefix tree for the paths in the cleansed
database
For each node, record a separate measure for each
group of items that share the same leaf and
information record
Assign GIDs to each node
GID parent GID unique id
Each node generates a stay record for each
distinct measure
If multiple nodes share the same location, time,
and measure, generate a single record with
multiple GIDs

39
Algorithm Properties

Construction Time
Single scan of cleansed data
Compression
Lossless compression for abstraction level
In our experiments we get 80 lossless
compression at the level of abstraction of the
raw data

40
From RFID-Cuboids to RFID-Warehouse

Which cuboids to materialize?
Minimum interesting level
Popular RFID-Cuboids
How?
Run algorithm
Input From the smallest materialized RFID-Cuboid
that is at a lower level of abstraction

41
Query Processing

Traditional OLAP operations
Roll up, drill down, slice, and dice
Implemented efficiently with traditional
techniques, e.g., what is the avg time spent by
milk at the shelf

?stay.location 'shelf', info.product 'milk'
(stay gid info)

Path selection (New operation)
Compute an aggregate measure on the tags that
travel through a set of locations and that match
a selection criteria on path-independent
dimensions

q ? lt ?c info,(?c1 stage1, ..., ?ck stagek) gt
42
Query Processing (II)

Query What is the average time spent from l3 to
l5?
GIDs for l3 lt0.0.0gt, lt0.1.0gt
GIDs for l5 lt0.0.0.0gt, lt0.1.0.1gt
Prefix pairs p1 (lt0.0.0gt,lt0.0.0.0gt)
p2 (lt0.1.0gt,lt0.1.0.1gt)
Retrieve stay records for each pair (including
intermediate steps) and compute measure
Savings No EPC list intersection, remember that
each EPC list may contain millions of different
tags, and retrieving them is a significant IO cost

43
Performance Study RFID-Cube Compression
Compression vs. Cleansed data size P 1000, B
(500,150,40,8,1), k 5 Lossless compression,
cuboid is at the same level of abstraction as
cleansed RFID database
Compression vs. Data Bulkiness P 1000, N
1,000,000, k 5 Map gives significant benefits
for bulky data For data where items move
individually we are better off using tag lists
44
From Distribution Center Model to Gateway-Based
Movement Model

Gateway-based movement model
Supply-chain movement is a merge-shuffle-split
process
Three types of gateways
Out, In, In/Out

Multiple hierarchies for compression and
exploration
Location, Time, Path, Gateway, Info

45
Part 1. RFID Data Mining

Introduction to RFID Data
Why RFID Data Warehousing and Mining?
RFID Data Warehousing
Mining RFID Data Sets
Conclusions

46
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

47
Data Cleaning by Data Mining

RFID data warehouse substantially compresses the
RFID data and facilitate efficient and systematic
data analysis
Data cleaning is essential to RFID applications
Multiple reading, miss reading, errors in
reading, etc.
How RFID warehouse facilitates data cleaning?
Multiple reading automatically resolved when
being compressed
Miss reading gaps can be stitched by simple
look-around
Error reading use future positions to resolve
discrepancies
Data mining helps data cleaning
Multiple cleaning methods can be cross-validated
Cost-sensitive method selection by data mining

48
Cost-Conscious Cleaning of RFID Data (Gonzalez et
al. 07)

Unreliable System
50 loss rate
Interference Water, metal, speed
Large data volume
Thousands of readers
Millions of tags
Key Idea
Use inexpensive methods first, escalate only when
necessary

49
DBN-Based Cleaning (DBNs Dynamic Bayesian
Networks)

No need to remember recent tag readings, we just
update our belief in the item being present given
the readings
Dynamically give more weight to recent
observations
Differentiate between these two cases

Transition Model
uncertain
certain
Present t
Present t-1
hidden
Observation Model
Detect t-1
Detect t
Observed
new belief state
observation model
transition model
old belief state
50
Cleaning Sequence

A cleaning method is a classifier
For a tag case (EPC, time, history of readings)
it assigns a label (location), and gives a
confidence value for the prediction
Cost of applying method is proportional to CPU,
memory, and amortized training costs
Given a set of tag cases and cleaning methods,
determine best method application order to
maximize accuracy and minimize costs
C(M1) 1, C(M2) 1.5, C(M3) 0.5
C(Error) 0.5
SD,M M1 ? M3 ? M2
Greedy algorithm At each iteration choose
cheapest cleaning method (including error cost)
for the set of tag cases still not correctly
classified

51
Cleaning Plan

The cleaning plan is a decision tree
Internal nodes are tag features
Leaves contain all tag cases matching the
conditions on the branch, and define the optimal
cleaning sequence to use on such cases
Tag cases have features that can be used to
segment them

Induction Algorithm

Traditional top down induction of decision trees
Quinlan, ML 86
Split the tag cases as long as we can reduce
cleaning costs

Cleaning sequence cost, before the split
Average cost for each cleaning sequence after
the split
52
Experimental Result

Setup
Diverse environment, different levels of noise,
tag speed, and reader locations
Results
Cleaning plan wins in accuracy and cost
In general, DBN outperforms smoothing window
methods

53
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

54
RFID Data A Path Database View

From raw tuples to cleansed data A Stay Table
view
Raw tuples ltEPC, location, timegt
Stay view (EPC, Location, time_in, time_out)
A data flow view of RFID data path forms
ltEPC, (l1,t1),(l2,t2),...,(lk,tk)gt, where li
location i, ti duration i
The paths can be augmented with path-independent
dimensions to get a Path Database of the form
ltProduct, Manufacturer, Price, Color, (l1,t1),
..., (lk,tk)gt

Path independent dimensions
Path stages
55
What Can Product Flows Tell?
Why was the Milk discarded?
Correlation between operator and returns?
56
Summarizing Flows FlowGraph

Tree shaped workflow
Nodes Locations
Edges Transitions
Each node is annotated with
Distribution of durations at the node
Distribution of transition probabilities
Significant duration, transition exceptions

storage
shelf
factory
backroom
truck
warehouse
57
FlowGraph Example
Duration Dist 1 0.2 2 0.8 Duration
Exceptions Given (f,5) 1 0.0, 2 1.0 Given
(f,10) 1 0.5, 2 0.5
checkout
0.60
0.20
dist. ctr.
truck
shelf
1.00
1.00
0.65
factory
0.20
shelf
checkout
dist. center
0.35
1.00
truck
0.67
0.33
warehouse
Duration Dist 1 0.67 2 0.33 Transition
Dist shelf 0.67 warehouse 0.33
Transition Exceptions Given (t,1) shelf 0.5,
warehouse 0.5 Given (t,2) shelf 1.0,
warehouse 0.0
58
FlowCube

Data cube computed on the path database, by
grouping entries that share the same values on
the path independent dimensions.
Each cuboid has an associated level in the item
and path abstraction lattices.
Level in the item lattice.
(product category, country, price)
Level in the path lattice.
(lttransportation, factory, backroom, shelf,
checkoutgt, hour)
The measure for each cell in the FlowCube is a
FlowGraph computed on the paths aggregated in the
cell.

59
FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
1.0
0.67
factory
truck
1.0
0.33
warehouse
60
Cubing FlowGraphs FlowCube

Fact Table Path Table (EPC, path)
Dimensions
Path independent dimensions
Product, Vendor, Price, etc
Abstraction Level
Each dimension has a concept hierarchy
Paths aggregated according to location, time
Measure
FlowGraph

61
FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
factory
truck
warehouse
62
Cells to Compute

Frequent Cells (Iceberg FlowCube)
Min Support Number of paths in cell
FlowGraph is statistically significant
Non-Redundant Cells
Redundant cell Can be inferred from others
Flow patterns for Milk same for Milk 2
Compression Keep non-redundant general cells

63
FlowCube Computation - Ideas

Transform Paths into Transaction Database
Mine frequent path segments
Mine frequent dimension combinations
Cross Pruning
Infrequent (Factory ? Shelf) for NorthEast
Has to be infrequent in MA
Infrequent (Laptop, MN)
Has to be infrequent for (Factory ? Shelf)

64
Transaction Encoding
Jacket (1 1 1 2)
Jacket
Outerwear
Clothing
Product
(factory,10)(dist,2)(truck,1)(shelf,5)(checkout,0)
1 (factorydisttruck,1) 2 (factoryTransportati
on,1) 3 (factorydisttruck,)
65
One Step Algorithm
Path DB
Freq. Cells Freq. Paths
FlowCube
Encode Transactions
Freq. Pattern Mining
Build FlowGraphs

Integrated Pruning
Pre-Counting level k1
Prune non-related stages
Prune parent-child

66
Two Step Algorithm
Path DB
Cube
Freq Path Mining cell 1
Freq Path Mining cell 2
Build FlowGraphs
Cubing Non-Spatial

Freq Path Mining cell n
FlowGraph is holistic no shared
computation Wasted effort No cross pruning One
cell at a time High IO cost
67
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

68
Path- or Segment- Based Classification and
Cluster Analysis

Classification Given class label (e.g., broken
goods vs. quality ones), construct path-related
predictive models
Take paths or segments as motifs and perform
motif-based high-dimensional information for
classification
Clustering Group similar paths or similar stay
or movements of RFIDs, with other
multi-dimensional information into clusters
It is essential to define new distance measure
and constraints for effective clustering

69
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

70
Frequent Pattern and Sequential Pattern Analysis

Frequent patterns and sequential patterns can be
related to movement segments and paths
Taking movement segments and paths base units,
one can perform multi-dimensional frequent
pattern and sequential pattern analysis
Correlation analysis can be formed in a similar
way
Correlation components can be stay, move
segments, and paths
Efficient and scalable algorithms can be
developed using the warehouse modeling

71
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

72
Outlier Analysis in RFID Data

Outlier detection in RFID data is by-product of
other mining tasks
Data flow analysis Detect those not in the major
flows
Classification Treat outliers and normal data as
different class labels
Cluster analysis Identify those that deviate
substantially in major clusters
Trend analysis Those not following the major
trend
Frequent pattern and sequential pattern analysis
anomaly patterns

73
Mining RFID Data Sets

Data cleaning by data mining
RFID data flow analysis
Path-based classification and cluster analysis
Frequent pattern and sequential pattern analysis
Outlier analysis in RFID data
Linking RFID data mining with others

74
Linking RFID Mining with Others

RFID warehouse and cube model makes the data
mining better organized and more efficient
Real time RFID data mining will need further
development of stream data mining methods
Stream cubing and high dimensional OLAP are two
key method that will benefit RFID mining
RFID data mining is still a young, largely
unexplored field
RFID data mining has close links with sensor data
mining, moving object data mining and stream data
mining
Thus will benefit from rich studies in those
fields

75
Part 1. RFID Data Mining

Introduction to RFID Data
Why RFID Data Warehousing and Mining?
RFID Data Warehousing
Mining RFID Data Sets
Conclusions

76
Part I Conclusions

A new RFID warehouse model
Allows efficient and flexible analysis of RFID
data in multidimensional space
Preserves the structure of the data
Compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing
Mining RFID data
Powerful mining mechanisms can be constructed
with RFID data warehouse
Flowgraph analysis, data cleaning,
classification, clustering, trend analysis,
frequent/sequential pattern analysis, outlier
analysis
Lots can be done in RFID data analysis

77
Part II. Trajectory Data Mining

Introduction to Trajectory Data
Pattern Mining
Clustering
Classification
Outlier Detection

78
Trajectory Data

A trajectory is a sequence of the location and
timestamp of a moving object

Hurricanes
Turtles
Vehicles
Vessels
79
Importance of Analysis on Trajectory Data

The world becomes more and more mobile
Prevalence of mobile devices such as cell phones,
smart phones, and PDAs
Satellite, sensor, RFID, and wireless
technologies have been improved rapidly
Tremendous amounts of trajectory data of moving
objects

80
Research Impacts

Trajectory data mining has many important,
real-world applications driven by the real need
Homeland security (e.g., border monitoring)
Law enforcement (e.g., video surveillance)
Weather forecast
Traffic control
Location-based service

81
Part II. Trajectory Data Mining

Introduction to Trajectory Data
Pattern Mining
Clustering
Classification
Outlier Detection

82
Trajectory Pattern (Giannotti et al. 07)

A trajectory pattern should describe the
movements of objects both in space and in time

83
Definition of Trajectory Patterns

A Trajectory Pattern (T-pattern) is a couple
(s,a)
s lt(x0,y0),..., (xk,yk)gt is a sequence of k1
locations
a lta1,..., akgt are the transition times
(annotations)
also written as
(x0,y0) (x1,y1) (xk,yk)
A T-pattern Tp occurs in a trajectory if the
trajectory contains a subsequence S such that
Each (xi,yi) in Tp matches a point (xi,yi) in
S, and
the transition times in Tp are similar to those
in S

a2
ak
a1
84
Characteristics of T-Patterns

Routes between two consecutive regions are not
relevant
Absolute times are not relevant

These two movements are not discriminated
1 hour
A
B
1 hour
These two movements are not discriminated
1 hour at 5 p.m.
A
B
1 hour at 9 a.m.
85
T-Pattern Mining

1. Convert each trajectory to a sequence, i.e.,
by converting a location (x,y) into a region

2. Execute the TAS (temporally annotated
sequence) algorithm, developed by the same
authors, over the set of converted trajectories
A TAS is a sequential pattern annotated with
typical transition times between its elements
The algorithm of TAS mining is an extension of
PrefixSpan so as to accommodate transition times

87
Sample T-Patterns
Data Source Trucks in Athens 273 trajectories)
88
Periodic Pattern (Mamoulis et al. 04)

In many applications, objects follow the same
routes (approximately) over regular time
intervals
e.g., Bob wakes up at the same time and then
follows, more or less, the same route to his work
everyday

89
Definition of Periodic Patterns

Let S be a sequence of n spatial locations l0,
l1, , ln-1, representing the movement of an
object over a long history
Let Tltltn be an integer called period
A periodic pattern P is defined by a sequence
r0r1rT-1 of length T that appears in S by more
than min_sup times
For every ri in P, ri or ljTi is inside ri

90
Periodic Pattern Mining

1. Obtain frequent 1-patterns
Divide the sequence S of locations into T spatial
datasets, one for each offset of the period T,
i.e., locations li, liT, , li(m-1)T go to a
set Ri
Perform DBSCAN on each dataset
e.g.,

Five clusters discovered in datasets R1, R2, R3,
R4, and R6
91

2. Find longer patterns Two methods
Bottom-up level-wise technique
Generate k-patterns using a pair of
(k-1)-patterns with their first k-2 non- regions
in the same position
Use a variant of the Apriori-TID algorithm

r1a
r3c
r1d
r3f
r1w
r3z
r2b
r1ar2br3c r1dr2er3f
r1x
r2e
r2y
2-length patterns generated 3-length patterns
92

Faster top-down approach
Replace each location in S with the cluster-id
which it belongs to or with if the location
belongs to no cluster
Use the sequence mining algorithm to discover
fast all frequent patterns of the form r0r1rT-1,
where each ri is a cluster in a set Ri or
Create a max-subpattern tree and traverse the
tree in a top-down, breadth-first order

93
Four Kinds of Relative Motion Patterns (Laube et
al. 04, Gudmundsson et al. 07)

Flock (Parameters m gt 1 and r gt 0) At least m
entities are within a circular region of radius r
and they move in the same direction
Leadership (Parameters m gt 1, r gt 0, and s gt 0)
At least m entities are within a circular region
of radius r, they move in the same direction, and
at least one of the entities was already heading
in this direction for at least s time steps
Convergence (Parameters m gt 1 and r gt 0) At
least m entities will pass through the same
circular region of radius r (assuming they keep
their direction)
Encounter (Parameters m gt 1 and r gt 0) At least
m entities will be simultaneously inside the same
circular region of radius r (assuming they keep
their speed and direction)

Examples

An example of a flock pattern for p1, p2, and p3
at 8th time step also a leadership pattern with
p2 as the leader
A convergence pattern if m 4 for p2, p3, p4,
and p5
95

Algorithms Exact and approximate algorithms are
developed
Flock Use the higher-order Voronoi diagram
Leadership Check the leader condition
additionally

t is multiplicative factor in all time bounds
96
An Extension of Flock Patterns (Benkert et al.
06, Gudmundsson and Kreveld 07)

A new definition considers multiple time steps,
whereas the previous definition only one time
step
Flock A flock in a time interval I, where the
duration of I is at least k, consists of at least
m entities such that for every point in time
within I there is a disk of radius r that
contains all the m entities
e.g.,

A flock through 3 time steps
97
Computing Flock Patterns

Approximate flocks
Convert overlapping segments of length k to
points in a 2k-dimensional space
Find 2k-d pipes that contain at least m points
Longest duration flocks
For every entity v, compute
a cylindrical region and
the intervals from the
intersection of the cylinder
Pick the longest one

98
An Extension of Leadership Patterns (Andersson
et al. 07)

Leadership We have a leadership pattern if there
is an entity that is a leader of at least m
entities for at least k time units
An entity ej is said to be a leader at time tx,
ty for time-points tx, ty, if and only if ej
does not follow anyone at time tx, ty, and ej
is followed by sufficiently many entities at time
tx, ty

ei
ej
ei follows ej
di dj ß
99
Reporting Leadership Patterns

Algorithm Build and use the follow-arrays

e.g., store nonnegative integers specifying for
how many past consecutive unit-time-intervals ej
is following ei (ej ? ei)
100
Trajectory Join (Bakalov et al. 05)

Identify all pairs of similar trajectories
between two datasets deal with the restricted
version of the problem where a temporal predicate
is specified by the query
e.g., identify the pairs of trucks that were
never apart from each other by more than 1 mile
this morning
Definition Given two sets of object trajectories
R and S, a threshold e and time interval dt, the
result of the trajectory join query is a subset V
of pairs ltRi, Sjgt (where Ri ? R, Sj ? S), such
that during the time-interval dt the distance
Ddt(Ri, Sj) e for any pair in V
Ri and Sj are sub-trajectories for the time
interval dt

101
Evaluation of Trajectory Join

Use the Piecewise Aggregate Approximation (PAA)
and then reduce trajectories to strings
e.g)
Introduce a distance function for strings that
appropriately lower-bounds the distance function
Ddt for trajectories
Propose a pruning heuristic for reducing the
number of trajectory pairs that need to be
examined

a4a3a2a1a2
102
Time-Relaxed Trajectory Join (Bakalov et al. 05)

Here, the interval dt can be anywhere in each
trajectory
Definition Two trajectories match if there exist
time intervals of the same length dt such that
the distance between the locations of the two
trajectories during these intervals is no more
than the spatial threshold e

103
Evaluation of Time-Relaxed Trajectory Join

Approximate raw trajectories using symbolic
representations each trajectory is represented
as a string
Generate all subsequences of length k for each
string (Assume dt covers completely a total of k
frames)
Compare all pairs of subsequences and obtain the
candidates where the distance is no more than ke
Verify the candidates by accessing raw
trajectories
Provide two heuristics for reducing false
positives

104
Hot Motion Path (Sacharidis et al. 08)

Identify hot motion paths followed by moving
objects over a sliding window with guarantees
Motion path A directed line segment
approximating objects movement
Hotness The number of objects crossing a motion
path within the window
Guarantees User-defined tolerance e for
approximating the location of an object at a
given time

105
Example of Hot Motion Paths

Consider 4 moving objects and their trajectories
1. Extract motion paths
2. Calculate hotness
3. Select the hottest (hotness 2)

106
Finding Hot Motion Paths

System setting
Objects communicate with the central coordinator
Two-tiered approach
Object side RayTrace algorithm
Update locations only when the object falls
outside a filter
Coordinator side SinglePath strategy
Discover motion paths using lightweight indexes

107
Part II. Trajectory Data Mining

Introduction to Trajectory Data
Pattern Mining
Clustering
Classification
Outlier Detection

108
Moving Object Clustering

A moving cluster is a set of objects that move
close to each other for a long time interval
Note Moving clusters and flock patterns are
essentially the same
Formal Definition Kalnis et al. 05
A moving cluster is a sequence of (snapshot)
clusters c1, c2, , ck such that for each
timestamp i (1 i lt k), ci n ci1 / ci ?
ci1 ? (0 lt ? 1)

109
Retrieval of Moving Clusters (Kalnis et al. 05)

Basic algorithm (MC1)
1. Perform DBSCAN for each time slice
2. For each pair of a cluster c and a moving
cluster g, check if g can be extended by c
If yes, g is used at the next iteration
If no, g is returned as a result
Improvements
MC2 Avoid redundant checks (Improve Step 2)
MC3 Reduce the number of executing DBSCAN
(Improve Step 1)

110
Moving Micro-Clusters (Li et al. 04)

A group of objects that are not only close to
each other, but also likely to move together for
a while
It is desirable to provide multi-level data
analysis for prohibitively large datasets A
moving micro-cluster could be viewed as a moving
object
Initial moving micro-clusters are obtained using
a generic clustering algorithm then, split and
collision events are identified

111
Trajectory Clustering

Group similar trajectories into the same cluster
1. Whole Trajectory Clustering
Probabilistic Clustering
Density-Based Clustering TF-OPTICS
2. Partial Trajectory Clustering
The Partition-and-Group Framework

112
Probabilistic Trajectory Clustering (Gaffney and
Smyth 99)

Basic assumption The data are being produced in
the following generative manner
An individual is drawn randomly from the
population of interest
The individual has been assigned to a cluster k
with probability wk, these are
the prior weights on the K clusters
Given that an individual belongs to a cluster k,
there is a density function fk(yj ?k) which
generates an observed data item yj for the
individual j

113

The probability density function of observed
trajectories is a mixture density
fk(yj xj, ?k) is the density component
wk is the weight, and ?k is the set of parameters
for the k-th component
?k and wk can be estimated from the trajectory
data using the Expectation-Maximization (EM)
algorithm

114
Clustering Results For Hurricanes (Camargo et al.
06)
Mean Regression Trajectory
TRACKS
Tracks Atlantic named Tropical Cyclones 1970-2003.
115
Density-Based Trajectory Clustering (Nanni and
Pedreschi 06)

Define the distance between whole trajectories
A trajectory is represented as a sequence of
location and timestamp
The distance between trajectories is the average
distance between objects for every timestamp
Use the OPTICS algorithm for trajectories
e.g.,

Reachability Plot
Time
Four clusters
Y
X
116
Temporal Focusing TF-OPTICS

In a real environment, not all time intervals
have the same importance
e.g., urban traffic In rush hours, many people
move from home to work, and vice versa
Clustering trajectories only in meaningful time
intervals can produce more interesting results
TF-OPTICS aims at searching the most meaningful
time intervals, which allows us to isolate the
clusters of higher quality

117
TF-OPTICS

Define the quality of a clustering
Take account of both high-density clusters and
low-density noise
Can be computed directly from the reachability
plot
Find the time interval that maximizes the quality
1. Choose an initial random time interval
2. Calculate the quality of neighborhood
intervals generated by increasing or decreasing
the starting or ending times
3. Repeat Step 2 as long as the quality increases

118
Partition-and-Group Framework (Lee et al. 07)

Existing algorithms group trajectories as a whole
? They might not be able to find similar portions
of trajectories
e.g., the common behavior cannot be discovered
since TR1TR5 move to totally different
directions
The partition-and-group framework is proposed to
discover common sub-trajectories

TR5
TR4
TR3
A common sub-trajectory
TR2
TR1
119
Usefulness of Common Sub-Trajectories

Discovering common sub-trajectories is very
useful, especially if we have regions of special
interest
Hurricane Landfall Forecasts
Meteorologists will be interested in the common
behaviors of hurricanes near the coastline or at
sea (i.e., before landing)
Effects of Roads and Traffic on Animal Movements
Zoologists will be interested in the common
behaviors of animals near the road where the
traffic rate has been varied

120
Overall Procedure

Two phases partitioning and grouping

A set of trajectories
(1) Partition
A representative trajectory
(2) Group
A cluster
A set of line segments
Note a representative trajectory is a common
sub-trajectory
121
Partitioning Phase

Identify the points where the behavior of a
trajectory changes rapidly ? characteristic
points
An optimal set of characteristic points is found
by using the minimum description length (MDL)
principle
Partition a trajectory at every characteristic
point

characteristic point trajectory
partition
122
Overview of the MDL Principle

The MDL principle has been widely used in
information theory
The MDL cost consists of two components L(H) and
L(DH), where H means the hypothesis, and D the
data
L(H) is the length, in bits, of the description
of the hypothesis
L(DH) is the length, in bits, of the description
of the data when encoded with the help of the
hypothesis
The best hypothesis H to explain D is the one
that minimizes the sum of L(H) and L(DH)

123
MDL Formulation

Finding the optimal partitioning translates to
finding the best hypothesis using the MDL
principle
H ? a set of trajectory partitions, D ? a
trajectory
L(H) ? the sum of the length of all trajectory
partitions
L(DH) ? the sum of the difference between a
trajectory and a set of its trajectory partitions
L(H) measures conciseness L(DH) preciseness

124
Grouping Phase (1/2)

Find the clusters of trajectory partitions using
density-based clustering (i.e., DBSCAN)
A density-connect component forms a cluster,
e.g., L1, L2, L3, L4, L5, L6

125
Grouping Phase (2/2)

Describe the overall movement of the trajectory
partitions that belong to the cluster

A red line a representative trajectory, A blue
line an average direction vector, Pink lines
line segments in a density-connected set
126
Sample Clustering Results
7 Clusters from Hurricane Data
570 Hurricanes (19502004)
A red line a representative trajectory
127
2 Clusters from Deer Data
128
Part II. Trajectory Data Mining

Introduction to Trajectory Data
Pattern Mining
Clustering
Classification
Outlier Detection

129
Trajectory Classification

Predict the class labels of moving objects based
on their trajectories and other features
1. Machine learning techniques
Studied mostly in pattern recognition,
bioengineering, and video surveillance
The hidden Markov model (HMM)
2. TraClass Trajectory classification using
hierarchical region-based and trajectory-based
clustering

130
Machine Learning for Trajectory Classification
(Sbalzarini et al. 02)

Compare various machine learning techniques for
biological trajectory classification
Data encoding
For the hidden Markov model, a whole trajectory
is encoded to a sequence of the momentary speed
For other techniques, a whole trajectory is
encoded to the mean and the minimum of the speed
of a trajectory, thus a vector in R2
Two 3-class datasets Trajectories of living
cells taken from the scales of the fish
Gillichthys mirabilis
Temperature dataset 10C, 20C, and 30C
Acclimation dataset Three different fish
populations

131
Machine Learning Techniques Used

k-nearest neighbors (KNN)
A previously unseen pattern x is simply assigned
to the same class to which the majority of its
k-nearest neighbors belongs
Gaussian mixtures with expectation maximization
(GMM)
Support vector machines (SVM)
Hidden Markov models (HMM)
Training Determine the model parameters ? (A,
B, p) to maximize P x ? for a given
observation x
Evaluation Given an observation x O1, , OT
and a model ? (A, B, p), compute the
probability P x ? that the observation x has
been produced by a source described by ?

132
Temperature data set
Acclimation data set
133
Vehicle Trajectory Classification (Fraile and
Maybank 98)

1. The measurement sequence is divided into
overlapping segments
2. In each segment, the trajectory of the car is
approximated by a smooth function and then
assigned to one of four categories ahead, left,
right, or stop
3. In this way, the list of segments is reduced
to a string of symbols drawn from the set a, l,
r, s
4. The string of symbols is classified using the
hidden Markov model (HMM)

134
Use of the HMM for Classification

Classification of the global motions of a car is
carried out using an HMM
The HMM contains four states which are in order
A, L, R, S, which are the true states of the car
ahead, turning left, turning right, stopped
The HMM has four output symbols in order a, l, r,
s, which are the symbols obtained from the
measurement segments
The Viterbi algorithm is used to obtain the
sequence of internal states

135
Experimental Result
Measurement sequence
Observed symbols
Sequence of inferred states
This measurement sequence means the driver stops
and then turns to the right
136
Motion Trajectory Classification (Bashir et al.
07)

Motion trajectories
Tracking results from video trackers, sign
language data measurements gathered from wired
glove interfaces, and so on
Application scenarios
Sport video (e.g., soccer video) analysis Player
movements ? A strategy
Sign and gesture recognition Hand movements ? A
particular word

137
The HMM-Based Algorithm

1. Trajectories are segmented at points of change
in curvature
2. Sub-trajectories are represented by their
Principal Component Analysis (PCA) coefficients
3. The PCA coefficients are represented using a
GMM for each class
4. An HMM is built for each class, where the
state of the HMM is a sub-trajectory and is
modeled by a mixture of Gaussians

138
Use of the HMM for Classification

Training and parameter estimation
The Baum-Welch algorithm is used to estimate the
parameters
Classification
The PCA coefficient vectors of input trajectories
after segmentation are posed as an observation
sequence to each HMM (i.e., constructed for each
class)
The maximum likelihood (ML) estimate of the test
trajectory for each HMM is computed
The class is determined to be the one that has
the largest maximum likelihood

139
Experimental Result

Datasets
The Australian Sign Language dataset (ASL)
83 classes (words), 5,727 trajectories
A sport video data set (HJSL)
2 classes, 40 trajectories of high jump and 68
trajectories of slalom skiing objects
Accuracy

140
Common Characteristics of Previous Methods

Use the shapes of whole trajectories to do
classification
Encode a whole trajectory into a feature vector
Convert a whole trajectory into a string or a
sequence of the momentary speed or
Model a whole trajectory using the HMM
Note Although a few methods segment
trajectories, the main purpose is to approximate
or smooth trajectories before using the HMM

141
TraClass Trajectory Classification Based on
Clustering

Motivation
Discriminative features are likely to appear at
parts of trajectories, not at whole trajectories
Discriminative features appear not only as common
movement patterns, but also as regions
Solution
Extract features in a top-down fashion, first by
region-based clustering and then by
trajectory-based clustering

142
Intuition and Working Example

Parts of trajectories near the container port and
near the refinery enable us to distinguish
between container ships and tankers even if they
share common long paths
Those in the fishery enable us to recognize
fishing boats even if they have no common path
there

143
Trajectory Partitions
Region-Based Clustering
Region-Based Clustering
Trajectory-Based Clustering
Features
Trajectory-Based Clustering
144
Class-Conscious Trajectory Partitioning

1. Trajectories are partitioned based on their
shapes as in the partition-and-group framework
2. Trajectory partitions are further partitioned
by the class labels
The real interest here is to guarantee that
trajectory partitions do not span the class
boundaries

Non-discriminative Discriminative
Class A Class B
Additional partitioning points
145
Region-Based Clustering

Objective Discover regions that have
trajectories mostly of one class regardless of
their movement patterns
Algorithm Find a better partitioning alternately
for the X and Y axes as long as the MDL cost
decreases
The MDL cost is formulated to achieve both
homogeneity and conciseness

146
Trajectory-Based Clustering

Objective Discover sub-trajectories that
indicate common movement patterns of each class
Algorithm Extend the partition-and-group
framework for classification purposes so that the
class labels are incorporated into trajectory
clustering
If an e-neighborhood contains trajectory
partitions mostly of the same class, it is used
for clustering otherwise, it is discarded
immediately

147
Selection of Trajectory-Based Clusters

After trajectory-based clusters are found, highly
discriminative clusters are selected for
effective classification
If the average distance from a specific cluster
to other clusters of different classes is high,
the discriminative power of the cluster is high
e.g.,

Class A Class B
C2
C1
C1 is more discriminative than C2
148
Overall Procedure of TraClass

1. Partition trajectories
2. Perform region-based clustering
3. Perform trajectory-based clustering
4. Select discriminative trajectory-based
clusters
5. Convert each trajectory into a feature vector
Each feature is either a region-based cluster or
a trajectory-based cluster
The i-th entry of a feature vector is the
frequency that the i-th feature occurs in the
trajectory
6. Feed feature vectors to the SVM

149
Classification Results

Datasets
Animal Three classes ? three species elk, deer,
and cattle
Vessel Two classes ? two vessels
Hurricane Two classes ? category 2 and 3
hurricanes
Methods
TB-ONLY Perform trajectory-based clustering only
RB-TB Perform both types of clustering
Results

150
Extracted Features
Features 10 Region-Based Clusters 37
Trajectory-Based Clusters
Data (Three Classes)
Accuracy 83.3
151
Part II. Trajectory Data Mining

Introduction to Trajectory Data
Pattern Mining
Clustering
Classification
Outlier Detection

152
Trajectory Outlier Detection

Detect trajectory outliers that are grossly
different from or inconsistent with the remaining
set of trajectories
1. Whole Trajectory Outlier Detection
An unsupervised method
A supervised method based on classification
2. Integration with multi-dimensional information
3. Partial Trajectory Outlier Detection
The Partition-and-Detect Framework

153
A Distance-Based Approach (Knorr Ng00)

Define the distance between two whole
trajectories
A whole trajectory is represented by
The distance between two whole trajectories is
defined as

where
154

Apply a distance-based approach to detection of
trajectory outliers
An object O in a dataset T is a DB(p, D)-outlier
if at least fraction p of the objects in T lies
greater than distance D from O
Unsupervised learning

155
Sample Trajectory Outliers