Title: Mining Massive RFID, Trajectory, and Traffic Data Sets
1Mining Massive RFID, Trajectory, and Traffic
Data Sets
KDD08 Tutorial
- Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei
Li - ACM SIGKDD'08 Conference Tutorial
- Las Vegas, NE
- August 24, 2008
2Tutorial Outline
- Part I. RFID Data Mining
- Part II. Trajectory Data Mining
- Part III. Traffic Data Mining
- Part IV. Conclusions
3Part 1. RFID Data Mining
- Introduction to RFID Data
- Why RFID Data Warehousing and Mining?
- RFID Data Warehousing
- Mining RFID Data Sets
- Conclusions
4RFID Technology
- Radio Frequency Identification (RFID)
- Technology that allows a sensor (reader) to read,
from a distance, and without line of sight, a
unique electronic product code (EPC) associated
with a tag
5Broad Applications of RFID Technology
- Supply Chain Management Real-time inventory
tracking - Retail Active shelves monitor product
availability - Access control Toll collection, credit cards,
building access - Airline luggage management Reduce lost/misplaced
luggage - Medical Implant patients with a tag that
contains their medical history - Pet identification Implant RFID tag with pet
owner information
6Inventory Management
How many pens should we reorder?
7Asset Tracking
British airways loses 20 million bags a year
8Electronic Toll Collection
Illinois 1 million drivers a day use I-Pass
9RFID System (Tag, Reader, Database)
Source www.belgravium.com
10RFID Data Warehousing and Mining
Flow Mining
Traffic Mining
Other
Adaptive Fastest Path Computation on a Road
Network A Traffic Mining Approach Gonzalez, et
al. VLDB07
Flow Mining
Traffic Mining
Mining Engine
FlowCube Constructing RFID FlowCubes for
Multi-Dimensional Analysis of Commodity
Flows Gonzalez et al. VLDB06
RFID Warehouse
Mining Compressed Commodity Workflows From
Massive RFID Data Sets Gonzalez, et al. CIKM06
Warehousing Engine
Warehousing Engine
Warehousing and Analyzing Massive RFID Data Sets
Gonzalez, et al. ICDE06 (Best Student Paper)
Data Cleaning
Cost-Conscious Cleaning of Massive RFID Data Sets
Gonzalez, et al. ICDE07
RFID Data Site 1
RFID Data Site 2
RFID Data Site k
11Part 1. RFID Data Mining
- Introduction to RFID Data
- Why RFID Data Warehousing and Mining?
- RFID Data Warehousing
- Mining RFID Data Sets
- Conclusions
12Challenges of RFID Data Sets
- Data generated by RFID systems is enormous
(peta-bytes in scale!) due to redundancy and low
level of abstraction - Walmart is expected to generate 7 terabytes of
RFID data per day - Data analysis requirements
- Highly compact summary of the data
- OLAP operations on multi-dimensional view of the
data - Preserving the path structures of RFID data for
analysis - Efficiently drilling down to individual tags when
an interesting pattern is discovered
13Example Trajectory
(Factory, T1,T2)
(Shipping,T3,T4)
(Warehouse, T5,T6)
(Self, T7,T8)
(Checkout,T9,T10)
14Data Generation
EPC (L1,T1)(L2,T2)(Ln,Tn)
EPC, Location, Time_in, Time_out
EPC, Location, Time
15RFID Data Warehouse Modeling
- Three models in typical RFID applications
- Bulky movements supply-chain management
- Scattered movements E-pass tollway system
- No movements fixed location sensor networks
- Different applications may require different data
warehouse systems - Our discussion will focus on bulky movements
16Why RFID-Warehousing?
- Lossless compression for bulky movement data
- Significantly reduce the size of the RFID data
set by redundancy removal and grouping objects
that move and stay together - Data cleaning reasoning based on more complete
info - Multi-reading, miss-reading, error-reading, bulky
movement, - Multi-dimensional summary, multiple views
- Multiple dimensional view Product, location,
time, - Store manager Check item movements from the
backroom to different shelves in his store - Region manager Collapse intra-store movements
and look at distribution centers, warehouses, and
stores
17Example A Supply Chain Store
- A retailer with 3,000 stores, selling 10,000
items a day per store - Each item moves 10 times on average before being
sold - Movement recorded as (EPC, location, second)
- Data volume 300 million tuples per day (after
redundancy removal) - OLAP query Costly to answer if scanning 1
billion tuples - Avg time for outwear items to move from warehouse
to checkout counter in March 2006? - Mining query
- Is there a correlation between the time spent at
transportation and the milk in store S rotten?
18Part 1. RFID Data Mining
- Introduction to RFID Data
- Why RFID Data Warehousing and Mining?
- RFID Data Warehousing
- Mining RFID Data Sets
- Conclusions
19Cleaning of RFID Data Records
- Raw Data
- (EPC, location, time)
- Duplicate records due to multiple readings of a
product at the same location - (r1, l1, t1) (r1, l1, t2) ... (r1, l1, t10)
- Cleansed Data Minimal information to store and
removal of raw data - (EPC, Location, time_in, time_out)
- (r1, l1, t1, t10)
- Warehousing can help fill-up missing records and
correct wrongly-registered information
20What is a Data Warehouse?
Warehouse
Operational Data Site 1
Cube 1
Cube 2
Extract Transform Load
OLAP
Cube N
Operational Data Site k
Q1
Q2
Q3
Q4
DVD,Chicago,All
DVD
PC
Chicago
TV
Boston
All Possible Groupings (Product, Location, Time)
New York
Q1,All,All
All,All,All
21Why Do We Need a New Design?
- Ex. Avg time that milk stays at the Champaign
Walmart that coming from farm A, and Truck 1?
Paths are lost in the aggregation
22Data Compression
Raw Data (EPC, Reader, Time)
Cleansed Data (EPC, Reader, T_in,T_out)
Lossless
Redundancy Elimination
Bulky Movement Compression
Lossless
Stay (GID, Reader, T_in,T_out)
Stay (GID, Locale, Day 1, Day2)
Lossy
Path and Item Abstraction
23Bulky Object Movements
1.1.1.1
1.1.1
1.1.1.2
1.1
1.1.2
1.2
1
i1,i2,,i10000, Dist Center 1, 01/01/08,
01/03/08
1.1
24Data Compression with GID
- Bulky object movements
- Objects often move and stay together
- If 1000 packs of soda stay together at the
distribution center, register a single record - (GID, distribution center, time_in, time_out)
- GID is a generalized identifier that represents
the 1000 packs that stayed together at the
distribution center
shelf 1
store 1
10 pallets (1000 cases)
shelf 2
Dist. Center 1
store 2
Dist. Center2
Factory
10 packs (12 sodas)
20 cases (1000 packs)
25Movement Graph Producer Consumer
Configurations
26Non-Spatial Generalization
Category level
Clothing
Type level
Interesting Level
Outerwear
Shoes
SKU level
Shirt
Jacket
Cleansed RFID Database Level
Shirt 1
Shirt n
EPC level
27Path Generalization
Store View
Transportation
backroom
shelf
checkout
backroom
shelf
checkout
dist. center
truck
Transportation View
dist. center
truck
Store
28RFID-Cube Architecture
29RFID Cuboid
30Example RFID Cuboid
Cleansed RFID Database
Stay Table
Map Table
31Design Decisions Stay vs. Transition
l1
ln1
Vs.
l2
l
ln2
- Measure of Items at location l
- Transition n m retrievals
- Stay 1 retrieval
- Measure of Items from li to lj
- Transition 1 retrieval
- Stay 2 retrievals
ln
lnm
32Design Decisions EPC vs. GID Lists
How many pallets traveled path l1, l7, l13?
(r1,l1,t1,t2) (r1,l2,t3,t4) (r2,l1,t1,t2) (r2,l2
,t3,t4) (rk,l1,t1,t2) (rk,l2,t3,t4)
- With EPC lists
- Retrieve all EPCs with location in l1,l17,l13
- With GID lists
- Retrieve all GIDs with location in l1,l7,l13
- Savings
- GIDs ltlt EPCs
(g1,l1,t1,t2) (g2,l2,t3,t4)
33GID Naming
0.1
0.0
- GID Name Encodes Path
- Benefit - Speed
- Reduce GID Intersections
- Cost - Space
- Locations
- Path length
l2
l1
0.1.1
0.1.0
0.0.0
l4
l3
0.0.0.0
0.1.0.1
l5
l6
34RFID Cuboid Construction
- Build Path Prefix-tree
- For each Node
- GID parent GID unique id
- Aggregate measure for items at each leaf under
node - Generate stay records, merge if necessary
35Compression by Data/Path Generalization
- Data generalization
- Analysis usually takes place at a much higher
level of abstraction than the one present in raw
RFID data - Aggregate object movements into fewer records
- If interested in time at the day level, merge
records at the minute level into records at the
hour level - Path generalization Merge and/or collapse path
segments - Uninteresting path segments can be ignored or
merged - Multiple item movements within the same store may
be uninteresting to a regional manager and thus
merged
36Three RFID-Cuboids
- Stay Table (GIDs, location, time_in, time_out
measures) - Records information on items that stay together
at a given location - If using record transitions difficult to answer
queries, lots of intersections needed - Map Table (GID, ltGID1,..,GIDngt)
- Links together stages that belong to the same
path. Provides additional compression and query
processing efficiency - High level GID points to lower level GIDs
- If saving complete EPC Lists high costs of IO to
retrieve long lists, costly query processing - Information Table (EPC list, attribute
1,...,attribute n) - Records path-independent attributes of the items,
e.g., color, manufacturer, price
37Algorithm Example
Stay Table
Path Tree
l2
l1
0.1 t1,t8 3
0.0 t1,t10 3
l4
l3
l3
0.1.1 t10,t20 2
0.0.0 t20,t30 3
0.1.0 t20,t30 3
r8,r9
0.1.0.1 t35,t50 1
0.1.0.0 t40,t60 2
l5
l6
l5
0.0.0.0 t40,t60 3
r1,r2,r3
r5,r6
r7
38RFID-Cuboid Construction Algorithm
- Build a prefix tree for the paths in the cleansed
database - For each node, record a separate measure for each
group of items that share the same leaf and
information record - Assign GIDs to each node
- GID parent GID unique id
- Each node generates a stay record for each
distinct measure - If multiple nodes share the same location, time,
and measure, generate a single record with
multiple GIDs
39Algorithm Properties
- Construction Time
- Single scan of cleansed data
- Compression
- Lossless compression for abstraction level
- In our experiments we get 80 lossless
compression at the level of abstraction of the
raw data
40From RFID-Cuboids to RFID-Warehouse
- Which cuboids to materialize?
- Minimum interesting level
- Popular RFID-Cuboids
- How?
- Run algorithm
- Input From the smallest materialized RFID-Cuboid
that is at a lower level of abstraction
41Query Processing
- Traditional OLAP operations
- Roll up, drill down, slice, and dice
- Implemented efficiently with traditional
techniques, e.g., what is the avg time spent by
milk at the shelf
?stay.location 'shelf', info.product 'milk'
(stay gid info)
- Path selection (New operation)
- Compute an aggregate measure on the tags that
travel through a set of locations and that match
a selection criteria on path-independent
dimensions
q ? lt ?c info,(?c1 stage1, ..., ?ck stagek) gt
42Query Processing (II)
- Query What is the average time spent from l3 to
l5? - GIDs for l3 lt0.0.0gt, lt0.1.0gt
- GIDs for l5 lt0.0.0.0gt, lt0.1.0.1gt
- Prefix pairs p1 (lt0.0.0gt,lt0.0.0.0gt)
- p2 (lt0.1.0gt,lt0.1.0.1gt)
- Retrieve stay records for each pair (including
intermediate steps) and compute measure - Savings No EPC list intersection, remember that
each EPC list may contain millions of different
tags, and retrieving them is a significant IO cost
43Performance Study RFID-Cube Compression
Compression vs. Cleansed data size P 1000, B
(500,150,40,8,1), k 5 Lossless compression,
cuboid is at the same level of abstraction as
cleansed RFID database
Compression vs. Data Bulkiness P 1000, N
1,000,000, k 5 Map gives significant benefits
for bulky data For data where items move
individually we are better off using tag lists
44From Distribution Center Model to Gateway-Based
Movement Model
- Gateway-based movement model
- Supply-chain movement is a merge-shuffle-split
process - Three types of gateways
- Out, In, In/Out
- Multiple hierarchies for compression and
exploration - Location, Time, Path, Gateway, Info
45Part 1. RFID Data Mining
- Introduction to RFID Data
- Why RFID Data Warehousing and Mining?
- RFID Data Warehousing
- Mining RFID Data Sets
- Conclusions
46Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
47Data Cleaning by Data Mining
- RFID data warehouse substantially compresses the
RFID data and facilitate efficient and systematic
data analysis - Data cleaning is essential to RFID applications
- Multiple reading, miss reading, errors in
reading, etc. - How RFID warehouse facilitates data cleaning?
- Multiple reading automatically resolved when
being compressed - Miss reading gaps can be stitched by simple
look-around - Error reading use future positions to resolve
discrepancies - Data mining helps data cleaning
- Multiple cleaning methods can be cross-validated
- Cost-sensitive method selection by data mining
48Cost-Conscious Cleaning of RFID Data (Gonzalez et
al. 07)
- Unreliable System
- 50 loss rate
- Interference Water, metal, speed
- Large data volume
- Thousands of readers
- Millions of tags
- Key Idea
- Use inexpensive methods first, escalate only when
necessary
49DBN-Based Cleaning (DBNs Dynamic Bayesian
Networks)
- No need to remember recent tag readings, we just
update our belief in the item being present given
the readings - Dynamically give more weight to recent
observations - Differentiate between these two cases
Transition Model
uncertain
certain
Present t
Present t-1
hidden
Observation Model
Detect t-1
Detect t
Observed
new belief state
observation model
transition model
old belief state
50Cleaning Sequence
- A cleaning method is a classifier
- For a tag case (EPC, time, history of readings)
it assigns a label (location), and gives a
confidence value for the prediction - Cost of applying method is proportional to CPU,
memory, and amortized training costs - Given a set of tag cases and cleaning methods,
determine best method application order to
maximize accuracy and minimize costs - C(M1) 1, C(M2) 1.5, C(M3) 0.5
- C(Error) 0.5
- SD,M M1 ? M3 ? M2
- Greedy algorithm At each iteration choose
cheapest cleaning method (including error cost)
for the set of tag cases still not correctly
classified
51Cleaning Plan
- The cleaning plan is a decision tree
- Internal nodes are tag features
- Leaves contain all tag cases matching the
conditions on the branch, and define the optimal
cleaning sequence to use on such cases - Tag cases have features that can be used to
segment them
Induction Algorithm
- Traditional top down induction of decision trees
Quinlan, ML 86 - Split the tag cases as long as we can reduce
cleaning costs
Cleaning sequence cost, before the split
Average cost for each cleaning sequence after
the split
52Experimental Result
- Setup
- Diverse environment, different levels of noise,
tag speed, and reader locations - Results
- Cleaning plan wins in accuracy and cost
- In general, DBN outperforms smoothing window
methods
53Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
54RFID Data A Path Database View
- From raw tuples to cleansed data A Stay Table
view - Raw tuples ltEPC, location, timegt
- Stay view (EPC, Location, time_in, time_out)
- A data flow view of RFID data path forms
- ltEPC, (l1,t1),(l2,t2),...,(lk,tk)gt, where li
location i, ti duration i - The paths can be augmented with path-independent
dimensions to get a Path Database of the form - ltProduct, Manufacturer, Price, Color, (l1,t1),
..., (lk,tk)gt
Path independent dimensions
Path stages
55What Can Product Flows Tell?
Why was the Milk discarded?
Correlation between operator and returns?
56Summarizing Flows FlowGraph
- Tree shaped workflow
- Nodes Locations
- Edges Transitions
- Each node is annotated with
- Distribution of durations at the node
- Distribution of transition probabilities
- Significant duration, transition exceptions
storage
shelf
factory
backroom
truck
warehouse
57FlowGraph Example
Duration Dist 1 0.2 2 0.8 Duration
Exceptions Given (f,5) 1 0.0, 2 1.0 Given
(f,10) 1 0.5, 2 0.5
checkout
0.60
0.20
dist. ctr.
truck
shelf
1.00
1.00
0.65
factory
0.20
shelf
checkout
dist. center
0.35
1.00
truck
0.67
0.33
warehouse
Duration Dist 1 0.67 2 0.33 Transition
Dist shelf 0.67 warehouse 0.33
Transition Exceptions Given (t,1) shelf 0.5,
warehouse 0.5 Given (t,2) shelf 1.0,
warehouse 0.0
58FlowCube
- Data cube computed on the path database, by
grouping entries that share the same values on
the path independent dimensions. - Each cuboid has an associated level in the item
and path abstraction lattices. - Level in the item lattice.
- (product category, country, price)
- Level in the path lattice.
- (lttransportation, factory, backroom, shelf,
checkoutgt, hour) - The measure for each cell in the FlowCube is a
FlowGraph computed on the paths aggregated in the
cell.
59FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
1.0
0.67
factory
truck
1.0
0.33
warehouse
60Cubing FlowGraphs FlowCube
- Fact Table Path Table (EPC, path)
- Dimensions
- Path independent dimensions
- Product, Vendor, Price, etc
- Abstraction Level
- Each dimension has a concept hierarchy
- Paths aggregated according to location, time
- Measure
- FlowGraph
61FlowCube Example
Cuboid for ltproduct type, brandgt
FlowGraph for cell 3
shelf
checkout
factory
truck
warehouse
62Cells to Compute
- Frequent Cells (Iceberg FlowCube)
- Min Support Number of paths in cell
- FlowGraph is statistically significant
- Non-Redundant Cells
- Redundant cell Can be inferred from others
- Flow patterns for Milk same for Milk 2
- Compression Keep non-redundant general cells
63FlowCube Computation - Ideas
- Transform Paths into Transaction Database
- Mine frequent path segments
- Mine frequent dimension combinations
- Cross Pruning
- Infrequent (Factory ? Shelf) for NorthEast
- Has to be infrequent in MA
- Infrequent (Laptop, MN)
- Has to be infrequent for (Factory ? Shelf)
64Transaction Encoding
Jacket (1 1 1 2)
Jacket
Outerwear
Clothing
Product
(factory,10)(dist,2)(truck,1)(shelf,5)(checkout,0)
1 (factorydisttruck,1) 2 (factoryTransportati
on,1) 3 (factorydisttruck,)
65One Step Algorithm
Path DB
Freq. Cells Freq. Paths
FlowCube
Encode Transactions
Freq. Pattern Mining
Build FlowGraphs
- Integrated Pruning
- Pre-Counting level k1
- Prune non-related stages
- Prune parent-child
66Two Step Algorithm
Path DB
Cube
Freq Path Mining cell 1
Freq Path Mining cell 2
Build FlowGraphs
Cubing Non-Spatial
Freq Path Mining cell n
FlowGraph is holistic no shared
computation Wasted effort No cross pruning One
cell at a time High IO cost
67Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
68Path- or Segment- Based Classification and
Cluster Analysis
- Classification Given class label (e.g., broken
goods vs. quality ones), construct path-related
predictive models - Take paths or segments as motifs and perform
motif-based high-dimensional information for
classification - Clustering Group similar paths or similar stay
or movements of RFIDs, with other
multi-dimensional information into clusters - It is essential to define new distance measure
and constraints for effective clustering
69Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
70Frequent Pattern and Sequential Pattern Analysis
- Frequent patterns and sequential patterns can be
related to movement segments and paths - Taking movement segments and paths base units,
one can perform multi-dimensional frequent
pattern and sequential pattern analysis - Correlation analysis can be formed in a similar
way - Correlation components can be stay, move
segments, and paths - Efficient and scalable algorithms can be
developed using the warehouse modeling
71Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
72Outlier Analysis in RFID Data
- Outlier detection in RFID data is by-product of
other mining tasks - Data flow analysis Detect those not in the major
flows - Classification Treat outliers and normal data as
different class labels - Cluster analysis Identify those that deviate
substantially in major clusters - Trend analysis Those not following the major
trend - Frequent pattern and sequential pattern analysis
anomaly patterns
73Mining RFID Data Sets
- Data cleaning by data mining
- RFID data flow analysis
- Path-based classification and cluster analysis
- Frequent pattern and sequential pattern analysis
- Outlier analysis in RFID data
- Linking RFID data mining with others
74Linking RFID Mining with Others
- RFID warehouse and cube model makes the data
mining better organized and more efficient - Real time RFID data mining will need further
development of stream data mining methods - Stream cubing and high dimensional OLAP are two
key method that will benefit RFID mining - RFID data mining is still a young, largely
unexplored field - RFID data mining has close links with sensor data
mining, moving object data mining and stream data
mining - Thus will benefit from rich studies in those
fields
75Part 1. RFID Data Mining
- Introduction to RFID Data
- Why RFID Data Warehousing and Mining?
- RFID Data Warehousing
- Mining RFID Data Sets
- Conclusions
76Part I Conclusions
- A new RFID warehouse model
- Allows efficient and flexible analysis of RFID
data in multidimensional space - Preserves the structure of the data
- Compresses data by exploiting bulky movements,
concept hierarchies, and path collapsing - Mining RFID data
- Powerful mining mechanisms can be constructed
with RFID data warehouse - Flowgraph analysis, data cleaning,
classification, clustering, trend analysis,
frequent/sequential pattern analysis, outlier
analysis - Lots can be done in RFID data analysis
77Part II. Trajectory Data Mining
- Introduction to Trajectory Data
- Pattern Mining
- Clustering
- Classification
- Outlier Detection
78Trajectory Data
- A trajectory is a sequence of the location and
timestamp of a moving object
Hurricanes
Turtles
Vehicles
Vessels
79Importance of Analysis on Trajectory Data
- The world becomes more and more mobile
- Prevalence of mobile devices such as cell phones,
smart phones, and PDAs - Satellite, sensor, RFID, and wireless
technologies have been improved rapidly - Tremendous amounts of trajectory data of moving
objects
80Research Impacts
- Trajectory data mining has many important,
real-world applications driven by the real need - Homeland security (e.g., border monitoring)
- Law enforcement (e.g., video surveillance)
- Weather forecast
- Traffic control
- Location-based service
-
81Part II. Trajectory Data Mining
- Introduction to Trajectory Data
- Pattern Mining
- Clustering
- Classification
- Outlier Detection
82Trajectory Pattern (Giannotti et al. 07)
- A trajectory pattern should describe the
movements of objects both in space and in time
83Definition of Trajectory Patterns
- A Trajectory Pattern (T-pattern) is a couple
(s,a) - s lt(x0,y0),..., (xk,yk)gt is a sequence of k1
locations - a lta1,..., akgt are the transition times
(annotations) - also written as
- (x0,y0) (x1,y1) (xk,yk)
- A T-pattern Tp occurs in a trajectory if the
trajectory contains a subsequence S such that - Each (xi,yi) in Tp matches a point (xi,yi) in
S, and - the transition times in Tp are similar to those
in S
a2
ak
a1
84Characteristics of T-Patterns
- Routes between two consecutive regions are not
relevant - Absolute times are not relevant
These two movements are not discriminated
1 hour
A
B
1 hour
These two movements are not discriminated
1 hour at 5 p.m.
A
B
1 hour at 9 a.m.
85T-Pattern Mining
- 1. Convert each trajectory to a sequence, i.e.,
by converting a location (x,y) into a region
86- 2. Execute the TAS (temporally annotated
sequence) algorithm, developed by the same
authors, over the set of converted trajectories - A TAS is a sequential pattern annotated with
typical transition times between its elements - The algorithm of TAS mining is an extension of
PrefixSpan so as to accommodate transition times
87Sample T-Patterns
Data Source Trucks in Athens 273 trajectories)
88Periodic Pattern (Mamoulis et al. 04)
- In many applications, objects follow the same
routes (approximately) over regular time
intervals - e.g., Bob wakes up at the same time and then
follows, more or less, the same route to his work
everyday
89Definition of Periodic Patterns
- Let S be a sequence of n spatial locations l0,
l1, , ln-1, representing the movement of an
object over a long history - Let Tltltn be an integer called period
- A periodic pattern P is defined by a sequence
r0r1rT-1 of length T that appears in S by more
than min_sup times - For every ri in P, ri or ljTi is inside ri
90Periodic Pattern Mining
- 1. Obtain frequent 1-patterns
- Divide the sequence S of locations into T spatial
datasets, one for each offset of the period T,
i.e., locations li, liT, , li(m-1)T go to a
set Ri - Perform DBSCAN on each dataset
- e.g.,
Five clusters discovered in datasets R1, R2, R3,
R4, and R6
91- 2. Find longer patterns Two methods
- Bottom-up level-wise technique
- Generate k-patterns using a pair of
(k-1)-patterns with their first k-2 non- regions
in the same position - Use a variant of the Apriori-TID algorithm
r1a
r3c
r1d
r3f
r1w
r3z
r2b
r1ar2br3c r1dr2er3f
r1x
r2e
r2y
2-length patterns generated 3-length patterns
92- Faster top-down approach
- Replace each location in S with the cluster-id
which it belongs to or with if the location
belongs to no cluster - Use the sequence mining algorithm to discover
fast all frequent patterns of the form r0r1rT-1,
where each ri is a cluster in a set Ri or - Create a max-subpattern tree and traverse the
tree in a top-down, breadth-first order
93Four Kinds of Relative Motion Patterns (Laube et
al. 04, Gudmundsson et al. 07)
- Flock (Parameters m gt 1 and r gt 0) At least m
entities are within a circular region of radius r
and they move in the same direction - Leadership (Parameters m gt 1, r gt 0, and s gt 0)
At least m entities are within a circular region
of radius r, they move in the same direction, and
at least one of the entities was already heading
in this direction for at least s time steps - Convergence (Parameters m gt 1 and r gt 0) At
least m entities will pass through the same
circular region of radius r (assuming they keep
their direction) - Encounter (Parameters m gt 1 and r gt 0) At least
m entities will be simultaneously inside the same
circular region of radius r (assuming they keep
their speed and direction)
94An example of a flock pattern for p1, p2, and p3
at 8th time step also a leadership pattern with
p2 as the leader
A convergence pattern if m 4 for p2, p3, p4,
and p5
95- Algorithms Exact and approximate algorithms are
developed - Flock Use the higher-order Voronoi diagram
- Leadership Check the leader condition
additionally -
t is multiplicative factor in all time bounds
96An Extension of Flock Patterns (Benkert et al.
06, Gudmundsson and Kreveld 07)
- A new definition considers multiple time steps,
whereas the previous definition only one time
step - Flock A flock in a time interval I, where the
duration of I is at least k, consists of at least
m entities such that for every point in time
within I there is a disk of radius r that
contains all the m entities - e.g.,
A flock through 3 time steps
97Computing Flock Patterns
- Approximate flocks
- Convert overlapping segments of length k to
points in a 2k-dimensional space - Find 2k-d pipes that contain at least m points
- Longest duration flocks
- For every entity v, compute
- a cylindrical region and
- the intervals from the
- intersection of the cylinder
- Pick the longest one
98An Extension of Leadership Patterns (Andersson
et al. 07)
- Leadership We have a leadership pattern if there
is an entity that is a leader of at least m
entities for at least k time units - An entity ej is said to be a leader at time tx,
ty for time-points tx, ty, if and only if ej
does not follow anyone at time tx, ty, and ej
is followed by sufficiently many entities at time
tx, ty
ei
ej
ei follows ej
di dj ß
99Reporting Leadership Patterns
- Algorithm Build and use the follow-arrays
e.g., store nonnegative integers specifying for
how many past consecutive unit-time-intervals ej
is following ei (ej ? ei)
100Trajectory Join (Bakalov et al. 05)
- Identify all pairs of similar trajectories
between two datasets deal with the restricted
version of the problem where a temporal predicate
is specified by the query - e.g., identify the pairs of trucks that were
never apart from each other by more than 1 mile
this morning - Definition Given two sets of object trajectories
R and S, a threshold e and time interval dt, the
result of the trajectory join query is a subset V
of pairs ltRi, Sjgt (where Ri ? R, Sj ? S), such
that during the time-interval dt the distance
Ddt(Ri, Sj) e for any pair in V - Ri and Sj are sub-trajectories for the time
interval dt
101Evaluation of Trajectory Join
- Use the Piecewise Aggregate Approximation (PAA)
and then reduce trajectories to strings - e.g)
- Introduce a distance function for strings that
appropriately lower-bounds the distance function
Ddt for trajectories - Propose a pruning heuristic for reducing the
number of trajectory pairs that need to be
examined
a4a3a2a1a2
102Time-Relaxed Trajectory Join (Bakalov et al. 05)
- Here, the interval dt can be anywhere in each
trajectory - Definition Two trajectories match if there exist
time intervals of the same length dt such that
the distance between the locations of the two
trajectories during these intervals is no more
than the spatial threshold e
103Evaluation of Time-Relaxed Trajectory Join
- Approximate raw trajectories using symbolic
representations each trajectory is represented
as a string - Generate all subsequences of length k for each
string (Assume dt covers completely a total of k
frames) - Compare all pairs of subsequences and obtain the
candidates where the distance is no more than ke - Verify the candidates by accessing raw
trajectories - Provide two heuristics for reducing false
positives
104Hot Motion Path (Sacharidis et al. 08)
- Identify hot motion paths followed by moving
objects over a sliding window with guarantees - Motion path A directed line segment
approximating objects movement - Hotness The number of objects crossing a motion
path within the window - Guarantees User-defined tolerance e for
approximating the location of an object at a
given time
105Example of Hot Motion Paths
- Consider 4 moving objects and their trajectories
- 1. Extract motion paths
- 2. Calculate hotness
- 3. Select the hottest (hotness 2)
106Finding Hot Motion Paths
- System setting
- Objects communicate with the central coordinator
- Two-tiered approach
- Object side RayTrace algorithm
- Update locations only when the object falls
outside a filter - Coordinator side SinglePath strategy
- Discover motion paths using lightweight indexes
107Part II. Trajectory Data Mining
- Introduction to Trajectory Data
- Pattern Mining
- Clustering
- Classification
- Outlier Detection
108Moving Object Clustering
- A moving cluster is a set of objects that move
close to each other for a long time interval - Note Moving clusters and flock patterns are
essentially the same - Formal Definition Kalnis et al. 05
- A moving cluster is a sequence of (snapshot)
clusters c1, c2, , ck such that for each
timestamp i (1 i lt k), ci n ci1 / ci ?
ci1 ? (0 lt ? 1)
109Retrieval of Moving Clusters (Kalnis et al. 05)
- Basic algorithm (MC1)
- 1. Perform DBSCAN for each time slice
- 2. For each pair of a cluster c and a moving
cluster g, check if g can be extended by c - If yes, g is used at the next iteration
- If no, g is returned as a result
- Improvements
- MC2 Avoid redundant checks (Improve Step 2)
- MC3 Reduce the number of executing DBSCAN
(Improve Step 1)
110Moving Micro-Clusters (Li et al. 04)
- A group of objects that are not only close to
each other, but also likely to move together for
a while - It is desirable to provide multi-level data
analysis for prohibitively large datasets A
moving micro-cluster could be viewed as a moving
object - Initial moving micro-clusters are obtained using
a generic clustering algorithm then, split and
collision events are identified
111Trajectory Clustering
- Group similar trajectories into the same cluster
- 1. Whole Trajectory Clustering
- Probabilistic Clustering
- Density-Based Clustering TF-OPTICS
- 2. Partial Trajectory Clustering
- The Partition-and-Group Framework
112Probabilistic Trajectory Clustering (Gaffney and
Smyth 99)
- Basic assumption The data are being produced in
the following generative manner - An individual is drawn randomly from the
population of interest - The individual has been assigned to a cluster k
with probability wk, these are
the prior weights on the K clusters - Given that an individual belongs to a cluster k,
there is a density function fk(yj ?k) which
generates an observed data item yj for the
individual j
113- The probability density function of observed
trajectories is a mixture density - fk(yj xj, ?k) is the density component
- wk is the weight, and ?k is the set of parameters
for the k-th component - ?k and wk can be estimated from the trajectory
data using the Expectation-Maximization (EM)
algorithm
114Clustering Results For Hurricanes (Camargo et al.
06)
Mean Regression Trajectory
TRACKS
Tracks Atlantic named Tropical Cyclones 1970-2003.
115Density-Based Trajectory Clustering (Nanni and
Pedreschi 06)
- Define the distance between whole trajectories
- A trajectory is represented as a sequence of
location and timestamp - The distance between trajectories is the average
distance between objects for every timestamp - Use the OPTICS algorithm for trajectories
- e.g.,
Reachability Plot
Time
Four clusters
Y
X
116Temporal Focusing TF-OPTICS
- In a real environment, not all time intervals
have the same importance - e.g., urban traffic In rush hours, many people
move from home to work, and vice versa - Clustering trajectories only in meaningful time
intervals can produce more interesting results - TF-OPTICS aims at searching the most meaningful
time intervals, which allows us to isolate the
clusters of higher quality
117TF-OPTICS
- Define the quality of a clustering
- Take account of both high-density clusters and
low-density noise - Can be computed directly from the reachability
plot - Find the time interval that maximizes the quality
- 1. Choose an initial random time interval
- 2. Calculate the quality of neighborhood
intervals generated by increasing or decreasing
the starting or ending times - 3. Repeat Step 2 as long as the quality increases
118Partition-and-Group Framework (Lee et al. 07)
- Existing algorithms group trajectories as a whole
? They might not be able to find similar portions
of trajectories - e.g., the common behavior cannot be discovered
since TR1TR5 move to totally different
directions - The partition-and-group framework is proposed to
discover common sub-trajectories
TR5
TR4
TR3
A common sub-trajectory
TR2
TR1
119Usefulness of Common Sub-Trajectories
- Discovering common sub-trajectories is very
useful, especially if we have regions of special
interest - Hurricane Landfall Forecasts
- Meteorologists will be interested in the common
behaviors of hurricanes near the coastline or at
sea (i.e., before landing) - Effects of Roads and Traffic on Animal Movements
- Zoologists will be interested in the common
behaviors of animals near the road where the
traffic rate has been varied
120Overall Procedure
- Two phases partitioning and grouping
A set of trajectories
(1) Partition
A representative trajectory
(2) Group
A cluster
A set of line segments
Note a representative trajectory is a common
sub-trajectory
121Partitioning Phase
- Identify the points where the behavior of a
trajectory changes rapidly ? characteristic
points - An optimal set of characteristic points is found
by using the minimum description length (MDL)
principle - Partition a trajectory at every characteristic
point
characteristic point trajectory
partition
122Overview of the MDL Principle
- The MDL principle has been widely used in
information theory - The MDL cost consists of two components L(H) and
L(DH), where H means the hypothesis, and D the
data - L(H) is the length, in bits, of the description
of the hypothesis - L(DH) is the length, in bits, of the description
of the data when encoded with the help of the
hypothesis - The best hypothesis H to explain D is the one
that minimizes the sum of L(H) and L(DH)
123MDL Formulation
- Finding the optimal partitioning translates to
finding the best hypothesis using the MDL
principle - H ? a set of trajectory partitions, D ? a
trajectory - L(H) ? the sum of the length of all trajectory
partitions - L(DH) ? the sum of the difference between a
trajectory and a set of its trajectory partitions - L(H) measures conciseness L(DH) preciseness
124Grouping Phase (1/2)
- Find the clusters of trajectory partitions using
density-based clustering (i.e., DBSCAN) - A density-connect component forms a cluster,
e.g., L1, L2, L3, L4, L5, L6
125Grouping Phase (2/2)
- Describe the overall movement of the trajectory
partitions that belong to the cluster
A red line a representative trajectory, A blue
line an average direction vector, Pink lines
line segments in a density-connected set
126Sample Clustering Results
7 Clusters from Hurricane Data
570 Hurricanes (19502004)
A red line a representative trajectory
1272 Clusters from Deer Data
128Part II. Trajectory Data Mining
- Introduction to Trajectory Data
- Pattern Mining
- Clustering
- Classification
- Outlier Detection
129Trajectory Classification
- Predict the class labels of moving objects based
on their trajectories and other features - 1. Machine learning techniques
- Studied mostly in pattern recognition,
bioengineering, and video surveillance - The hidden Markov model (HMM)
- 2. TraClass Trajectory classification using
hierarchical region-based and trajectory-based
clustering
130Machine Learning for Trajectory Classification
(Sbalzarini et al. 02)
- Compare various machine learning techniques for
biological trajectory classification - Data encoding
- For the hidden Markov model, a whole trajectory
is encoded to a sequence of the momentary speed - For other techniques, a whole trajectory is
encoded to the mean and the minimum of the speed
of a trajectory, thus a vector in R2 - Two 3-class datasets Trajectories of living
cells taken from the scales of the fish
Gillichthys mirabilis - Temperature dataset 10C, 20C, and 30C
- Acclimation dataset Three different fish
populations
131Machine Learning Techniques Used
- k-nearest neighbors (KNN)
- A previously unseen pattern x is simply assigned
to the same class to which the majority of its
k-nearest neighbors belongs - Gaussian mixtures with expectation maximization
(GMM) - Support vector machines (SVM)
- Hidden Markov models (HMM)
- Training Determine the model parameters ? (A,
B, p) to maximize P x ? for a given
observation x - Evaluation Given an observation x O1, , OT
and a model ? (A, B, p), compute the
probability P x ? that the observation x has
been produced by a source described by ?
132Temperature data set
Acclimation data set
133Vehicle Trajectory Classification (Fraile and
Maybank 98)
- 1. The measurement sequence is divided into
overlapping segments - 2. In each segment, the trajectory of the car is
approximated by a smooth function and then
assigned to one of four categories ahead, left,
right, or stop - 3. In this way, the list of segments is reduced
to a string of symbols drawn from the set a, l,
r, s - 4. The string of symbols is classified using the
hidden Markov model (HMM)
134Use of the HMM for Classification
- Classification of the global motions of a car is
carried out using an HMM - The HMM contains four states which are in order
A, L, R, S, which are the true states of the car
ahead, turning left, turning right, stopped - The HMM has four output symbols in order a, l, r,
s, which are the symbols obtained from the
measurement segments - The Viterbi algorithm is used to obtain the
sequence of internal states
135Experimental Result
Measurement sequence
Observed symbols
Sequence of inferred states
This measurement sequence means the driver stops
and then turns to the right
136Motion Trajectory Classification (Bashir et al.
07)
- Motion trajectories
- Tracking results from video trackers, sign
language data measurements gathered from wired
glove interfaces, and so on - Application scenarios
- Sport video (e.g., soccer video) analysis Player
movements ? A strategy - Sign and gesture recognition Hand movements ? A
particular word
137The HMM-Based Algorithm
- 1. Trajectories are segmented at points of change
in curvature - 2. Sub-trajectories are represented by their
Principal Component Analysis (PCA) coefficients - 3. The PCA coefficients are represented using a
GMM for each class - 4. An HMM is built for each class, where the
state of the HMM is a sub-trajectory and is
modeled by a mixture of Gaussians
138Use of the HMM for Classification
- Training and parameter estimation
- The Baum-Welch algorithm is used to estimate the
parameters - Classification
- The PCA coefficient vectors of input trajectories
after segmentation are posed as an observation
sequence to each HMM (i.e., constructed for each
class) - The maximum likelihood (ML) estimate of the test
trajectory for each HMM is computed - The class is determined to be the one that has
the largest maximum likelihood
139Experimental Result
- Datasets
- The Australian Sign Language dataset (ASL)
- 83 classes (words), 5,727 trajectories
- A sport video data set (HJSL)
- 2 classes, 40 trajectories of high jump and 68
trajectories of slalom skiing objects - Accuracy
140Common Characteristics of Previous Methods
- Use the shapes of whole trajectories to do
classification - Encode a whole trajectory into a feature vector
- Convert a whole trajectory into a string or a
sequence of the momentary speed or - Model a whole trajectory using the HMM
- Note Although a few methods segment
trajectories, the main purpose is to approximate
or smooth trajectories before using the HMM
141TraClass Trajectory Classification Based on
Clustering
- Motivation
- Discriminative features are likely to appear at
parts of trajectories, not at whole trajectories - Discriminative features appear not only as common
movement patterns, but also as regions - Solution
- Extract features in a top-down fashion, first by
region-based clustering and then by
trajectory-based clustering
142Intuition and Working Example
- Parts of trajectories near the container port and
near the refinery enable us to distinguish
between container ships and tankers even if they
share common long paths - Those in the fishery enable us to recognize
fishing boats even if they have no common path
there
143Trajectory Partitions
Region-Based Clustering
Region-Based Clustering
Trajectory-Based Clustering
Features
Trajectory-Based Clustering
144Class-Conscious Trajectory Partitioning
- 1. Trajectories are partitioned based on their
shapes as in the partition-and-group framework - 2. Trajectory partitions are further partitioned
by the class labels - The real interest here is to guarantee that
trajectory partitions do not span the class
boundaries
Non-discriminative Discriminative
Class A Class B
Additional partitioning points
145Region-Based Clustering
- Objective Discover regions that have
trajectories mostly of one class regardless of
their movement patterns - Algorithm Find a better partitioning alternately
for the X and Y axes as long as the MDL cost
decreases - The MDL cost is formulated to achieve both
homogeneity and conciseness
146Trajectory-Based Clustering
- Objective Discover sub-trajectories that
indicate common movement patterns of each class - Algorithm Extend the partition-and-group
framework for classification purposes so that the
class labels are incorporated into trajectory
clustering - If an e-neighborhood contains trajectory
partitions mostly of the same class, it is used
for clustering otherwise, it is discarded
immediately
147Selection of Trajectory-Based Clusters
- After trajectory-based clusters are found, highly
discriminative clusters are selected for
effective classification - If the average distance from a specific cluster
to other clusters of different classes is high,
the discriminative power of the cluster is high - e.g.,
Class A Class B
C2
C1
C1 is more discriminative than C2
148Overall Procedure of TraClass
- 1. Partition trajectories
- 2. Perform region-based clustering
- 3. Perform trajectory-based clustering
- 4. Select discriminative trajectory-based
clusters - 5. Convert each trajectory into a feature vector
- Each feature is either a region-based cluster or
a trajectory-based cluster - The i-th entry of a feature vector is the
frequency that the i-th feature occurs in the
trajectory - 6. Feed feature vectors to the SVM
149Classification Results
- Datasets
- Animal Three classes ? three species elk, deer,
and cattle - Vessel Two classes ? two vessels
- Hurricane Two classes ? category 2 and 3
hurricanes - Methods
- TB-ONLY Perform trajectory-based clustering only
- RB-TB Perform both types of clustering
- Results
150Extracted Features
Features 10 Region-Based Clusters 37
Trajectory-Based Clusters
Data (Three Classes)
Accuracy 83.3
151Part II. Trajectory Data Mining
- Introduction to Trajectory Data
- Pattern Mining
- Clustering
- Classification
- Outlier Detection
152Trajectory Outlier Detection
- Detect trajectory outliers that are grossly
different from or inconsistent with the remaining
set of trajectories - 1. Whole Trajectory Outlier Detection
- An unsupervised method
- A supervised method based on classification
- 2. Integration with multi-dimensional information
- 3. Partial Trajectory Outlier Detection
- The Partition-and-Detect Framework
153A Distance-Based Approach (Knorr Ng00)
- Define the distance between two whole
trajectories - A whole trajectory is represented by
- The distance between two whole trajectories is
defined as
where
154- Apply a distance-based approach to detection of
trajectory outliers - An object O in a dataset T is a DB(p, D)-outlier
if at least fraction p of the objects in T lies
greater than distance D from O - Unsupervised learning
155Sample Trajectory Outliers
- Detect outliers from person trajectories in a room
The entire data set
The outliers only
156Use of the Neural Network (Owens and Andrew
Hunter 00)
- A whole trajectory is encod