Title: Geographic Privacy-aware Knowledge Discovery
1Geographic Privacy-aware Knowledge Discovery
Delivery
- Fosca Giannotti1, Dino Pedreschi1, Yannis
Theodoridis2 - 1 KDD Lab, University of Pisa and ISTI-CNR, Italy
www-kdd.isti.cnr.it - 2 InfoLab, University of Piraeus, Greece
infolab.cs.unipi.gr
Tutorial _at_ EDBT 2009, St Petersburg, 25th March
2009
2Mobile devices and services
- Large diffusion of mobile devices, mobile
services and location-based services
3Wireless networks as mobility data collectors
- Wireless networks infrastructures are the nerves
of our territory - besides offering their services, they gather
highly informative traces about the human mobile
activities - UbiComp infrastructure will further push this
phenomenon - Miniaturization, wearability, pervasiveness will
produce traces of increasing - positioning accuracy
- semantic richness
4Which mobility data?
- Location data from mobile phones, i.e. cell
positions in the GSM/UMTS network. - Location data from GPS-equipped devices Galileo
in the (near?) future - Nokia (and other) mobile phones have on-board GPS
receiver, and can transmit GPS tracks by SMS/MMS - Location data from
- peer-to-peer mobile networks
- intelligent transportation environments VANET
- ad hoc sensor networks, RFIDs (radio-frequency
ids)
5What can we learn from mobility data ...
6Real-time density estimation in urban areas
The senseable project http//senseable.mit.edu/gr
azrealtime/
7(No Transcript)
8More ambitiously mobility patterns
9From mobility data to mobility patterns
10GeoPKDD
- (for) Geographic Privacy-aware Knowledge
Discovery Delivery - Towards an archaeology of the present?
- A scenario of great opportunities and risks
- mining mobility data can yield useful knowledge
- but, individual privacy is at risk.
- A new multidisciplinary research area is emerging
at this crossroads, with potential for broad
social and economic impact - F. Giannotti and D. Pedreschi (Eds.) Mobility,
Data Mining and Privacy. Springer, 2008.
11- A paradigmatic project GeoPKDD
- http//www.geopkdd.eu
- A European FP6/FET project
- (Dec. 2005 Mar. 2009)
- 30 researchers involved (18 young researchers)
- 80 conference paper 34journal paper, 2 books, 7
workshops - 30 specific algorithms, 2 integration project
platforms
12The GeoPKDD scenario
- From the analysis of the traces of our mobile
phones it is possible to reconstruct our mobile
behaviour, the way we collectively move - This knowledge may help us improving
decision-making in many mobility-related issues - Planning traffic and public mobility systems in
metropolitan areas - Planning physical communication networks
- Localizing new services in our towns
- Forecasting traffic-related phenomena
- Organizing logistics systems
- Avoid repeating mistakes
- Timely detecting changes.
13The big picture
14GSM network, WSN, GPS
End user
Mobility manager
Mobility Patterns
Mobility Data
Privacy and anonymity protection
Raw data
15GeoPKDD research issues
Spatio-temporal patterns
- Trajectory Warehouse
- Privacy-preserving OLAP
- Spatio-temporal models for moving objects
- Moving Object DB
- Geographic reasoning
- Visual Analytics
- ST data mining methods
- Data mining query languages
- Privacy-preserving data mining
16Key questions
- How to reconstruct a trajectory from raw logs,
how to store and query trajectory data? - How to classify trajectories according to means
of transportation (pedestrian, private vehicle,
public transportation vehicle, )? - Which spatio-temporal patterns and/or models are
useful abstractions of mobility data? - How to compute such patterns and models
efficiently? - Privacy protection and anonymity how to make
such concepts formally precise and measurable? - How to find an optimal trade-off between privacy
protection and quality of the analysis?
17A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining
- Trajectory warehousing and OLAP
- Mobility data mining
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
18A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining and the geographic KDD
process - Trajectory warehousing and OLAP
- Mobility data mining
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
19Acquiring and Storing Trajectories in MODs
- About mobility data
- The trajectory reconstruction problem
- MOD engines
20Mobility Data
- Typical structure and size
NTimeLatLonHeightCourseSpeedPDOPStateNSat
822/03/07 08515250.7771327.205580
67.6345.421.8173.818084 922/03/07
08515650.7773527.205435 68.435.614.2233.8
18084 1022/03/07 08515950.7774157.205543
68.3112.725.2983.818084 1122/03/07
08520350.7773177.205877 68.8119.832.4473.8
18084 1222/03/07 08520650.7771857.206202
68.1124.130.0583.818084 1322/03/07
08520950.7770577.206522 67.9117.734.0033.8
18084 1422/03/07 08521250.7769257.206858
66.9117.537.1513.818084 1522/03/07
08521550.7768137.207263 67.099.239.1883.8
18084 1622/03/07 08521850.7767807.207745
68.890.641.1703.818084 1722/03/07
08522150.7768037.208262 71.182.035.0583.8
18084 1822/03/07 08522450.7768327.208682
68.6117.111.3713.818084
21Location data producers GSM, GPS, WiFi
Streaming data manager Trajectory reconstructor
Streaming location data are received
Trajectory data are reconstructed
Moving Object Database
22The trajectory reconstruction problem
- From raw location data (obj-id, x, y, t)
- To trajectory data (obj-id, traj-id, (x, y, t))
a sample of a users movement (GPS recordings)
a sample of reconstructedtrajectories
23Reconstructing trajectories
- Collected raw data represent time-stamped
geo-locations - Raw points arrive in bulk sets
- We need a filter that decides if the new series
of data is to be appended to an existing
trajectory or not - Prospective parameters
- Tolerance distance
- Temporal gap
- Spatial gap
- Maximum speed
- Maximum noise duration
24Moving Objects Database Systems
- The traditional database technology has been
extended into Moving Object Databases (MODs) that
handle modeling, indexing and query processing
issues for trajectories - Spatial and temporal dimensions are considered as
first-class citizens. - Both past and current (as well as anticipated
future) positions of moving objects are of
interest. - Several prototype MODs
- DOMINO (Wolfson et al.) NGITS02, EDBT02,
ICDT05, - PLACE (Aref et al.) SSDBM04, VLDB04,
- SECONDO (Güting et. al.) IDEAS00, ICDE05,
MDM06 - HERMES (Pelekis et. al.) EDBT06, SIGMOD08
25DOMINO
- Databases for Moving Objects tracking
(http//www.cs.uic.edu/wolfson/html/mobile.html) - Built on top of DBMS using a three-layers
approach - Utilize dynamic attributes for future predicted
locations - Manage uncertainty that is inherent in future
motion plans - Support various location models
- Exact point location
- An area in which the object is located in
- An approximate motion plan
- A complete motion plan
26SECONDO
- An Extensible DBMS Architecture and Prototype
(http//dna.fernuni-hagen.de/Secondo.html/index.h
tml) - A generic DBMS framework that can be filled with
implementation of various data models
(relational, object-oriented, or XML) and data
types (spatial data, moving objects) - A database is a set of SECONDO objects of the
form (name, type, value), where type is one of
(about 20) implemented algebras - Query optimizer includes
- optimization of conjunctive queries
- selectivity estimation
- implementation of an SQL-like query language
- Built on top of Berkeley DB.
Command Manager
GUI
Query Processor Catalog
Op 1
Op 2
Op n
Optimizer
Storage Manager Tools
Kernel
27The PLACE Server
- Pervasive Location-Aware Computing Environments
(http//www.cs.purdue.edu/place/) - Continuous evaluation of queries over
spatio-temporal data streams - Shared execution among concurrent continuous
queries - Built inside a DB engine
- Incremental evaluation of continuous queries
- Spatio-temporal query operators
28The Hermes MOD engine
- Data model
- Current location as a function in time over the
starting location - linear vs. arc movement
- A palette of ADTs
- Moving Point, Moving Rectangle, Moving Polygon,
etc. - on top of Oracle Spatial Cartridge
- Indexing support
- TB-tree for Trajectories, R-tree for stationary
spatial data
29Location-aware querying
- Traditional vs. advanced queries
30What kind of queries?
- The nature of trajectory data provides us with
the ability to query them with a variety of
operators - Coordinate-based
- Range
- Nearest Neighbor
- Similarity-based
- Trajectory-based
- Topological
- Derived information
- Combined
31Coordinate-based queries
- Spatial (range or NN) search
- Find all trajectories that were inside area A at
time instant t (or time interval I) or - Find the trajectory that was closest to point B
at time instant t (or time interval I)
32Trajectory-based queries
- Topological / directional search
- Find all trajectories that entered (crossed,
left, bypassed, etc.) or were located west
(south, etc.) of an area or - Find all trajectories that crossed (met, etc.)
or were located left of (right of, in front of,
etc.) a query trajectory TQ
33 but even more advanced queries
- Most-similar-trajectories
- Frentzos et al. 2007 Given a query trajectory
TQ, show me the k- most similar trajectories to
TQ (perhaps, constrained is space and/or time)
- Motion patterns
- Hadjieleftheriou et al, 2005Find objects that
crossed through region A at time t1, came as
close as possible to point B at a later time t2
and then stopped inside circle C during interval
(t3, t4)
34Trajectory Similarity Queries
- Issues
- How do we measure (dis-)similarity between two
trajectories? - Similarity variations
- Similarity in space, in time, in derived info
(e.g. speed, acceleration, direction) - Similarity queries have been studied extensively
in time-series literature - Here, things are different! Where you are and at
what time are important. - While in time-series there is no spatial
component, we typically start with normalization
35Motion Pattern Queries
- Trajectories represent behavior over time they
capture the evolution of a movement - Can we query the behavior / motion of
trajectories? - Yes! We can use complex motion patterns
- A Motion Pattern (MP) query is actually a
time-ordered sequence of primitive queries - Qmp Q1 ? Q2 ? Q3 , where Qi is a primitive
query - The time-ordering of the spatial predicates may
be explicit or implicit - MP queries are different !
- They are not typical similarity queries
- It is not the same predicate that holds for the
duration of an interval - They are not typical range/NN queries
- We can now choose separate predicates at
different times
36Trajectory Indexing
- Indexing in native vs. parametric space
- Trajectory-oriented indexing techniques
37Two approaches
- Indexing in the Native Space
- Typically approximate using MBRs then index
these MBRs - Advantage we can use R-trees etc. can also
index other moving objects (areas etc.) - Disadvantage trajectories are lines thus MBRs
add extensive empty space - Indexing in the Parametric Space
- approximate each trajectory by a function
(typically a polynomial) then index the
functions coefficients - Advantage better approximation
- Disadvantage translate btw Native Parametric
spaces, better approximation means, more
coefficients
38Indexing in the Native Space
- Traditional approaches
- One MBR per trajectory
- Too much empty space
- vs. one MBR per segment
- Too many objects
- Can we do any better?
- Q Where can we cut for MBRs?
- A Balancing this tradeoff Hadjieleftheriou et
al. 2002
39Trajectory-oriented indexing techniques
- Indexing movement in free space
- The multi-version R-tree (MVR-tree)
- The trajectory-bundle tree (TB-tree)
- Indexing network-constrained movement
- The fixed-network-restricted tree (FNR-tree)
40MVR-tree
- (Kumar et al. 1998), (Kollios et al. 2001), (Tao
and Papadias, 2001) - The idea is to use a multi-version structure
(MVR-tree) to index the trajectory approximations - Why multi-versioning?
- A traditional R-tree considers time as another
dimension - for example x,y,t creates a 3D R-tree
- Instead, an MVR-tree effectively provides a
separate 2D R-tree, indexing each time-slice
41TB-tree
- (Pfoser et al. 2000) Maintains the trajectory
concept - Each node consists of segments of a single
trajectory - nodes corresponding to the same trajectory are
linked together in a chain - Effective for trajectory-oriented queries
- Implemented in Hermes MOD engine using Oracles
indexing extensibility
42FNR-tree
- FNR-tree (Frentzos, 2003)
- a forest of 1D (temporal) R-trees on top of a 2D
(spatial) R-tree - There is an additional Parent 1D R-tree which
indexes the temporal intervals of the 1D R-trees
leaf nodes
43Indexing in the Parametric Space
- Each trajectory is a collection of functions
- Recall the definition of a trajectory
- A trajectory T is defined as T oid, t0,
(f1(t)t1), (f2(t)t2), where - f1(t), f2(t), are functions of time
representing movement during time interval tj-1
.. tj, - t1, t2, are time stamps in chronological order
- If fi(t) is linear function, the trajectory
becomes a polyline
44Indexing in the Parametric Space (cont.)
- A first approach in Porkaew et al, 2001 Use
the parameters for each function fi(t) as the
keys in the index structure. - Problem Hundreds of functions per trajectory
- Result Large storage overhead, reduced
efficiency - Cai Ng, 2004 approximate each trajectory with
a Chebyshev polynomial and use its coefficients
for indexing - Easy to compute
- Almost identical to optimal minmax polynomial
- Focused on similarity queries
- Over entire trajectories of equal length
- Same degree polynomials for all trajectories
45Summary on Mobility Data Management
- From spatial and spatio-temporal to moving object
databases - Research has touched almost all aspects, from
data modeling to efficient storage and retrieval - Open issues
- Efficient query processing for location-based
services (LBS) - Indexing both archived and prospective locations
- MOD architecture centralized vs. distributed
vs. stream-oriented - Exotic applications mobile computer vision, etc.
46A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining and the geographic KDD
process - Trajectory warehousing and OLAP
- Mobility data mining and reasoning
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
47Trajectory Warehousing and OLAP
- From DW to TDW
- Trajectory-based OLAP
- ETL and the distinct count problem
48Data warehousing (DW)
- Widely investigated for conventional, non-spatial
data. - Some research on spatial DW, pioneering work by
Han et al. in 1998. - Spatial and non-spatial dimensions and measures.
- OLAP operations in a spatial data cube.
- Recent research direction developing
spatio-temporal DW and supporting spatio-temporal
OLAP operations in order to extract summarized
spatio-temporal information. - Useful for traffic supervision systems,
transportation and supply chain managements,
mobile e-commerce. - Focus on methods for an efficient implementation
of spatio-temporal aggregate queries.
49Trajectory data warehousing (TDW)
- TDW should
- extract aggregate information from MOD
- support a variety of dimensions (temporal,
spatial, thematic, ) and measures (about space,
time and their derivatives) - Storing measures associated with facts,
concerning the set of trajs crossing the cell
? aggregate information in base cells - Challenges
- design of a trajectory-oriented data cube
- high volume and complex nature of data special
query processing requirements - extensions of traditional aggregation techniques
to produce summary information for OLAP analysis
50A trajectory warehouse system architecture
data analyst (desktop)
data producers (mobile)
location data (obj-id, x, y, t)(not
trajectories) are generated
trajectory stream manager
moving object database
trajectory data cube
geo- layers
trajectory data (obj-id, traj-id, (x, y, t))
are reconstructed
Geographical context is considered
aggregated trajectory data are computed (ETL
procedure)
51Basic definitions schemas
- Trajectory
- Moving Object Database D T1, T2, , TN
- Trajectory Data Warehouse
- Dimensions Spatial, Temporal, Object Profile
- Measures count (trajectories), count (users),
avg (distance traveled), avg (travel duration),
avg (speed), avg (abs (acceler) )
52OLAP (spatio-temporal aggregation)
53ETL processing loading
- Loading data into the dimension tables ?
straightforward - Loading data into the fact table ? complex
- Fill in the measures with the appropriate numeric
values - In order to calculate the measures, we have to
extract the portions of the trajectories that fit
into the base cells of the cube - alternative solutions
- cell-oriented
- trajectory-oriented
y
x
54Aggregating measures in the cube
R
R1
R5
R4
At the lowest hierarchy level count of
trajectories in R4 3 count of trajectories in
R5 2 count of trajectories in R6 1
R2
R6
count of trajectories in R 6 (according to
traditional roll up) Correct answer 3 (!!) due
to the fact that the contents (trajectories) of
the partitions are overlapping
R3
- A naïve solution is to query back the raw data.
- Can we do something better?
55The distinct count problem
- During the ETL process, measures can be computed
in an accurate way by executing MOD queries - Once the fact table has been fed, aggregate-only
information is stored inside the TDW (no
trajectory / user ids) - When rolling up, COUNT_USERS, COUNT_TRAJECTORIES
and, hence, all other measures defined over
COUNT_TRAJECTORIES are subject to the distinct
count problem Tao et al. 2004 - if an object remains in the query region for
several timestamps during the query interval,
instead of counting this object once, it is
counted multiple times in the result
y
x
56Mobility Data Mining
- Trajectory pattern mining
- Trajectory clustering
57Examples of mobility patterns
- Trajectory clustering
- Cluster trajectories
- For each cluster, find the mean trajectory to
represent/classify
- Frequent pattern mining
- Discover frequent routes, etc.
58Trajectory pattern mining
59Q What is a trajectory pattern?
60A A spatio-temporal sequential pattern
- Giannotti et al. 2007 A sequence of visited
regions, frequently visited in the specified
order with similar transition times
61T-Pattern discovery
1- Find Regions of Interest
2- Find similar Trajectory in space and time
3- Extract patterns
62T-Patterns for trajectories
- A Trajectory Pattern (T-pattern) is a pair (s,
?) - s lt(x0,y0),..., (xk,yk)gt is a sequence of k1
locations - ? lt?1,..., ?kgt are the transition times
(annotations)? - also written as
- A T-pattern Tp occurs in a trajectory if it
contains a sub-sequence S such that - each (xi,yi) in Tp matches a point (xi,yi) in
S, and - the transition times in Tp are similar to those
in S
63Continuity issues (space time)?
- What does matches mean in space/time?
- The same exact spatial location (x,y) usually
never occurs twice - The same exact transition times usually do not
occur twice - Solution allow approximation
- a notion of spatial neighborhood
- a notion of temporal tolerance
64T-Pattern approximate occurrence
- Two points match if one falls within a spatial
neighborhood N() of the other - Two transition times match if their temporal
difference is t - Example
65T-Pattern approximate occurrence
- Two points match if one falls within a spatial
neighborhood N() of the other - Two transition times match if their temporal
difference is t - Example
66T-Pattern approximate occurrence
- Two points match if one falls within a spatial
neighborhood N() of the other - Two transition times match if their temporal
difference is t - Example
67Computing general T-Patterns
- T-pattern mining can be mapped to a density
estimation problem over R3n-1 - 2 dimensions for each (x,y) in the pattern (2n)?
- 1 dimension for each transition (n-1)?
- Density computed by
- mapping each sub-sequence of n points of each
input trajectory to R3n-1 - drawing an influence area for each point
(composition of N() and t) - Too computationally expensive, heuristics
needed!!!
68Approach 1 predefined regions
- Fix a set of pre-defined regions of interest
- Map each (x,y) of the trajectory to its region
- Sample pattern
68
69Approach 2 static discovered regions
- Detect significant regions thru spatial
clustering - Map each (x,y) of the trajectory to its region
- Sample pattern
69
70Approach 3 dynamic discovered regions
- Dynamic discovering of dense regions
- Regions are located at each step of the pattern
generation - Sample pattern
1.Considering all trajectories, A is a cluster /
dense region 2.Considering only trajectories
that visit A, B is a cluster 3.20 mins is a
typical time for pattern A?B
71Sample T-patterns
Data source Athens trucks 273 trajectories
(source www.rtreeportal.org)
72Ongoing work on T-pattern mining
- Application-oriented assessments on large, real
datasets show that T-patterns are many and
difficult to evaluate - A starting point for further model construction,
rather than a final product - Simplification of output transition times
- The most complex info for end users
- Study relations with
- Geographic background knowledge, such as points
of interests and road network - Privacy issues are T-patterns safe? Can we use
T-patterns to protect (anonymize) original data? - Reasoning on trajectories and patterns
73Location prediction based on T-patterns
- T-Pattern extracts a set of local patterns from a
global set of data. - Can we use these patterns to build a global model
to predict the next location? Yes! Pinelli et
al. 2008
Global model (Ptree)
Local patterns (T-pattern)
74Trajectory Clustering
75Trajectory Clustering
- Questions
- Which distance between trajectories?
- Which kind of clustering?
- What is a cluster mean in our case?
- A representative trajectory?
76 Which distance?
- Average Euclidean distance
- Synchronized behaviour distance
- Similar objects almost always in the same place
at the same time - Computed on the whole trajectory
- Computational aspects
- Cost O( ?1 ?2 ) (? number
of points in ?) - It is a metric gt efficient indexing methods
allowed, e.g. Frentzos et al. 2007
77Which kind of clustering?
- General requirements
- Non-spherical clusters should be allowed
- E.g. A traffic jam along a road snake-shaped
cluster - Tolerance to noise
- Low computational cost
- Applicability to complex, possibly non-vectorial
data - A suitable candidate Density-based clustering
- OPTICS (Ankerst et al., 1999)
- ? T(rajectory)-OPTICS
78A sample dataset
- A set of trajectories forming 4 clusters noise
(synthetic)
79T-OPTICS vs. HAC K-means
K-means
HAC-average
Reachability plot ( objects reordering for
distance distribution)
T-OPTICS
? threshold
80Extension1 Temporal focusing
- Different time intervals can show different
behaviours - E.g. objects that are close to each other within
a time interval can be much distant in other
periods of time - The time interval becomes a parameter
- E.g. rush hours vs. low traffic times
- Already supported by the distance measure
- Just compute D(?1 , ?2) T on a time interval T
? T - Problem significant T are not always known a
priori - An automated mechanism is needed to find them
- Nanni, Pedreschi. Time-focused clustering of
trajectories of moving objects. J. of
Intelligent Information Systems, 2006
81Extension2 visually-driven clustering
- Progressive refinement through visually-driven
exploration - Progressively complex similarity functions
- Scalability
- Index structures to support efficient
neighborhood queries for trajectory clustering - Progressive clustering by sampling
- Incremental clustering and concept drift
82Interactive density-based trajectory clustering
- Rinzivillo, Pedreschi, Nanni, Giannotti,
Andrienko, Andrienko.Visually-driven analysis of
movement data by progressive clustering. J. of
Information Visualization, 2008
83Progressive clustering
- First, create a large clusters of trajectories
using the common ends distance function, - Concentrate on the (big) cluster of inward
trajectories (routes towards the city center) - Refine by creating subclusters using a more
sophisticated distance function (route similarity)
84Looking for frequent stops moves
85Clusters of typical trips
86Cluster 1 from work to home
Observation the eastern route is chosen more
often
87Cluster 2 from home to work
Observation the eastern route is chosen much
more often
88MILANO data on the map
895 biggest (sub-)clusters of trajectories towards
the city centre
Dark grey moves occurring in trajectories from
several clusters
90Clustering trajectories on route similarity
Left peripheral routes middle inward routes
right outward routes.
- Rinzivillo, Pedreschi, Nanni, Giannotti,
Andrienko, AndrienkoVisually-driven analysis of
movement data by progressive clustering. J. of
Information Visualization, 2008
91Cluster-based Classification of Large Trajectory
Datasets
- Gennady Andrienko , Natalia AndrienkoFraunhofer
IAIS, Sankt Augustin, Germany - Salvatore Rinzivillo , Mirco Nanni, Fosca
GiannottiISTI - CNR, Pisa, Italy - Dino PedreschiUniversità di Pisa, Pisa, Italy
92Motivation
- Massive collections of GPS tracks are rough
approximations of complex human activities - Challenge
- develop analysis techniques capable of mastering
the complexity of the data and extracting
meaningful abstractions - A trajectory clustering problem
- find, for the spatial area and the time interval
under analysis, the natural clusters of similar
trajectories and attach them semantics
93The process at a glance
- Given a trajectory dataset D, extract a sample D
of trajectories from D - Apply OPTICS with a suitable distance function d
and get a set of density-based clusters C1 , C2
, . . . , Cm - For each cluster Ci
- Select s specimens as its representative
- Visually inspect and re?ne the selected
specimens. The set of the specimens for all
clusters forms a classi?er - Apply the classi?er to the remaining
trajectories, attaching each new trajectory to
the closest specimens. The trajectories with no
close specimen remain unclassi?ed - Repeat the whole process for the unclassi?ed
trajectories
94Classification of the original DBfor each
candidate trajectory findthe closest specimen
and attach it to the corresponding cluster
Sampling
Find Specimens
The trajectories attached to the cluster after
the classification
T-OPTICS
95Summary on Mobility Data Mining
- Data Analysis and Knowledge Discovery in MODs is
here to stay - It is the opportunity to discover, from the
digital traces of human activity, the knowledge
that makes us comprehend timely and precisely the
way we live, the way we use our time and our
land. - Open issues
- Integrating mining process into MODs
- Interactive, progressive KDD customized to users
needs
96A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining and the geographic KDD
process - Trajectory warehousing and OLAP
- Mobility data mining and reasoning
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
97From opportunities to threats
- Personal mobility data, as gathered by the
wireless networks, are extremely sensitive - Their disclosure may represent a brutal violation
of the privacy protection rights, i.e., to keep
confidential - the places we visit
- the places we live or work at
- the people we meet
97
98The naïve scientists view
- Knowing the exact identity of individuals is not
needed for analytical purposes - De-identified mobility data are enough to
reconstruct aggregate movement behaviour,
pertaining to groups of people. - Reasoning coherent with European data protection
laws personal data, once made anonymous, are not
subject to privacy law restrictions - Is this reasoning correct?
98
99Unfortunately not!
- Making data (reasonably) anonymous is not easy.
- Sometimes, it is possible to reconstruct the
exact identities from the de-identified data. - Many famous examples of re-identification
- Dalenius
- Governor of Massachusetts clinical records
(Sweeneys experiment, 2001) - AOL August 2006 crisis user re-identified from
search logs - Two main sources of danger
- Many observations on the same anonymous subject
- Linking data, after joining separate datasets
99
100Spatio-temporal linkage in Mobility Data
almost every day Mon-Fri between 745 815
Id 34567
A
B
almost every day Mon-Fri between 1745 1815
A
B
- By intersecting the phone directories of
locations A and B we find that only one
individual lives in A and works in B. - Id34567 Prof. Smith
- Then you discover that on Saturday night Id34567
usually drives to the city red lights district
100
101Preserving anonymity
- Anonymity-preserving mobility mining
101
102How do people (try to) stay anonymous?
- either by camouflage
- pretending to be someone else or somewhere else
- or by hiding in the crowd
- becoming indistinguishable among many others
102
103Concepts for Location Privacy
- Location Perturbation Randomization
- The user location is represented with a fake
value - Privacy protection is achieved from the fact that
the reported location is false - The accuracy and the amount of privacy mainly
depends on how far is the reported location from
the exact location
103
104Concepts for Location Privacy
- Spatial Cloaking Generalization
- The user exact location is represented as a
region that includes the exact user location - An adversary does know that the user is located
in the region, but has no clue where the user is
exactly located - The area of the region achieves a trade-off
between user privacy and accuracy
104
105Concepts for Location Privacy
Y
- Spatio-temporal generalization
- In addition to the spatial dimension, generalize
also the temporal dimension
X
T
105
106Concepts for Location Privacy
- k-anonymity
- Users position is generalized to a region
containing at least k users - The user is indistinguishable among other k-1
users - The area largely depends on the surrounding
environment. - A value of k 100 may result in a very small
area downtown Hong Kong, or a very large area
in the desert.
10-anonymity
106
107Trajectory anonymization
- Several variants developed in GeoPKDD
- Abul, Bonchi, Nanni (Pisa KDD LAB). Int. Conf.
Data Engineering ICDE 2008 - Nergiz, Atzori, Saygin (Sabanci Univ. Pisa KDD
LAB). ACMGIS 2008 - Gkoulalas-Divanis, Verykios (Univ. Thessaly).
2007 (submitted) - Monreale, Pensa, Pedreschi, Pinelli PILBA 2008
- Yarovoy, Bonchi, Laksmanan, Wang, EDBT 2009
- Common goal construct an anonymized version of a
trajectory dataset, preserving some target
analytical properties - Different techniques adopted
108Anonymity preserving mobility mining
- Never Walk Alone Bonchi et al. 2008
- Trade uncertainty for anonymity trajectories
that are close up the uncertainty threshold are
indistinguishable - Combine k-anonymity and perturbation
- Two steps
- Cluster trajectories into groups of k similar
ones (removing outliers) - Perturb trajectories in a cluster so that each
one is close to each other up to the uncertainty
threshold
108
109Sample results(dataset Oldenburg, synthetic)?
original data
original data
anonymized data
110Key open challenges
- Define an acceptable formal measure of anonymity
protection - Probability of re-identification (in a given
context) - A (technically supported) juridical issue!
- Sampling a necessity and an opportunity!
- Necessary for performance/feasibiliy of data
mining from massive mobility datasets - Good for anonymity (re-identification probability
decreases)
110
111Summary on Privacy Aspects
- Today, tracking is an everytime / everywhere
process - Therefore, privacy-preservation is a must!
- What is required
- Privacy-aware KDD process
- Much already in the literature
- Privacy-aware MOD management
- Not so much!
111
112A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining and the geographic KDD
process - Trajectory warehousing and OLAP
- Mobility data mining and reasoning
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
113Incorporating semantics a step towards the user
- May data and patterns be re-combined and queried?
- May the datamining tasks be more accurate if data
are semantically enriched? - May we deduce something new from data and
patterns?
114Why a DMQL?
- Data, Patterns/models and background knowledge
need to be combined - Find the patterns that involve trajectories
crossing a polluted area during rush hours
115A unifying framework DEDALUS
The queries between data and/or models can be
expressed with Object-relational language using
Hermes package and Tas package. For
example 1. Select all TASs belonging to a
certain trajectory (e.g. id3) SELECT
Patterns.id FROM Patterns, Trajectories WHERE
Trajectories.id 3 AND Patterns.TAS.f_membership
s(Trajectories.trajectory) 2. Select all
trajectories belonging to a TAS included in a
polluted area. SELECT Trajectories.id FROM
Patterns, Trajectories, Polluted_Areas WHERE
Patterns.f_membership(Trajectories.trajectory) AN
D Polluted_Areas.geometry.include(Patterns.TAS.get
_geometry())
Id Number Object Pattern_TAS
Patterns
116Building mobility data mining applications
- requires reasoning on a richer form of knowledge
about mobility - Geographic semantics
- Landmarks and interesting places
- Road network
- Landscape
-
- Movement sematics
- stops and moves
- Purposes of movement
- means of transportation
117End user
GSM network
Where should I go next?
Multimedia Geo
Mobility Database
Mobility models
118Semantic Trajectory Data
- Physical Trajectory
- e.g. GPS recording over some period of time
- Semantic Trajectory
- places where a person stayed
- means of transportation
- combination of above elements for higher-level
description
way to work
bus stop
work
home
bus stop
bus stop
bus stop
119Semantic (frequent) patterns
120How to enrich?
- An ontological framework enables a progressive
semantic enrichment of mobility data and patterns
121ATHENA The ontological framework
Query
Movement Ontology Application Ontology Data
Ontology
How a movement ontology should be developed?
How a data ontology should be mapped onto a
database?
patterns
trajectories
geography
122AthenaQuerying Reasoning
Data Ontology
Application Ontology
ONTOLOGY SYSTEM
123A guided tour on GeoPKDD
- Mobility data management
- Acquiring and storing trajectories in MODs
- Location-aware querying
- Trajectory indexing
- Mobility data mining and the geographic KDD
process - Trajectory warehousing and OLAP
- Mobility data mining and reasoning
- Visual analytics for mobility data
- Privacy aspects on mobility data
- Preserving anonymity
- (Semantic enriched) Geographic KDD process
- Combining Mining and Querying
- Ontological framework for end-user querying and
reasoning - Outlook
124Outlook
- (Privacy-preserving) Mobility Data Acquisition,
Querying, and Mining strives for a win-win
situation - Obtaining the advantages of collective mobility
knowledge without disclosing inadvertently any
individual mobility knowledge. - A word of wisdom solutions can only be obtained
via an alliance of technology, legal regulations,
and social norms (Rakesh Agrawal) - GeoPKDD.eu is in the mix, shaping up the area of
PP mobility data mining - Challenge UbiComp will flood us with new complex
data (in a decentralized setting) - data miners have only begun to scratch the
surface of this problem
125 trying to accomplish a long-time dream
126Acknowledgements
- We are grateful to all the GeoPKDD researchers,
who made the project successful through their
results and contributed actively to this tutorial - Theyre too many to be listed here, their work
has been cited along these notes
GeoPKDD is a project under the FP6 / FET
Programme of the European Commission, FET-Open
contract nr 014915 (Dec. 2005 Mar. 2009)
127Selected literature on
- Mobility Data Modeling MOD engines
- de Almeida, V.T. et al. (2006) Querying Moving
Objects in SECONDO. Proceedings of MDM. - Behr, T. and Güting, R.H. (2005) Fuzzy Spatial
Objects An Algebra Implementation in SECONDO.
Proceedings of ICDE. - Cao, H. and Wolfson, O. (2005) Nonmaterialized
Motion Information in Transport Networks.
Proceedings of ICDT. - Chen, C.X. and Zaniolo, C. (2000) SQLST A
Spatio-Temporal Data Model and Query Language.
Proceedings of ER. - Cheng, R. et al. (2004) Efficient Indexing
Methods for Probabilistic Threshold Queries over
Uncertain Data. Proceedings of VLDB. - Dieker, S. and Güting, R.H. (2000) Plug and Play
with Query Algebras SECONDO A Generic DBMS
Development Environment. Proceedings of IDEAS. - Güting, R.H. et al. (2000) A Foundation for
Representing and Querying Moving Objects. ACM
Transactions on Database Systems, 25(1)1-42. - Güting, R.H. et al. (2006) Modeling and querying
moving objects in networks. VLDB Journal, 15(2)
165-190. - Karimi, H. and Liu, X. (2003) A Predictive
Location Model for Location-Based Services,
Proceedings of ACM-GIS. - Mokbel, M.F. et al. (2004a) Continuous Query
Processing of Spatio-temporal Data Streams in
PLACE. Proceedings of SSDBM. - Mokbel, M.F. et al. (2004a) PLACE A Query
Processor for Handling Real-time Spatio-temporal
Data Streams. Proceedings of VLDB.
128Selected literature on
- Mobility Data Modeling MOD engines (cont.)
- Mokhtar, H., and Su, J. (2005) A Query Language
for Moving Object Trajectories. Proceedings of
SSDBM. - Patroumpas, K. and Sellis, T.K. (2004) Managing
Trajectories of Moving Objects as Data Streams.
Proceedings of STDBM. - Pelekis, N. and Theodoridis, Y. (2007) An Oracle
Data Cartridge for Moving Objects. Technical
Report, TR-2007-04, University of Piraeus. - Pelekis, N. et al. (2004) Literature Review of
Spatio-temporal Database Models. Knowledge
Engineering Review, 19(3) 235-274. - Pelekis, N. et al. (2006) Hermes - A Framework
for Location-Based Data Management. Proceedings
of EDBT. - Pelekis, N. et al. (2008) HERMES aggregative LBS
via a trajectory DB engine. Proceedings of ACM
SIGMOD. Pfoser, D. and Jensen, C.S. (1999)
Capturing the Uncertainty of Moving-Object
Representations. Proceedings of SSD. - Schlieder, C. et al. (2001) Location Modeling for
Intentional Behavior in Spatial Partonomies.
Proceedings of Location Modeling for Ubiquitous
Computing Workshop. - Sistla, P. et al. (1997) Modeling and Querying
Moving Objects. Proceedings of ICDE. - Trajcevski, G. et al. (2002) The geometry of
uncertainty in moving objects databases.
Proceedings of EDBT. - Trajcevski, G. et al. (2004) Managing uncertainty
in moving objects databases. ACM Transactions on
Database Systems 29(3) 463-507.
129Selected literature on
- Mobility Data Modeling MOD engines (cont.)
- Wolfson, O. (2002) Moving Objects Information
Management The Database Challenge. Proceedings
of NGITS. - Wolfson, O. et al. (1998) Moving Objects
Databases Issues and Solutions. Proceedings of
SSDBM. - Wolfson, O. et al. (1999) Updating and Querying
Databases that Track Mobile Units. Distributed
and Parallel Databases, 7(3) 257-387.
130Selected literature on
- MOD Query Processing
- Benetis, R. et al. (2002) Nearest Neighbor and
Reverse Nearest Neighbor Queries for Moving
Objects. Proceedings of IDEAS. - Frentzos, E. et al. (2005) Nearest Neighbor
Search on Moving Object Trajectories. Proceedings
of SSTD. - Frentzos, E. et al. (2007) Index-based Most
Similar Trajectory Search. Proceedings of ICDE. - Gedik, B., and Liu, L. (2004) MobiEyes
Distributed Processing of Continuously Moving
Queries on Moving Objects in a Mobile System.
Proceedings of EDBT. - Jensen, C.S. et al. (2003) Nearest Neighbor
Queries in Road Networks. Proceedings of ACM-GIS. - Lema, J.A.C. et al. (2003) Algorithms for Moving
Objects Databases. The Computer Journal,
46(6)680-712. - Li, F. et al. (2005) On Trip Planning Queries in
Spatial Databases. Proceedings of SSTD. - Mokbel, M.F. et al. (2004) SINA Scalable
Incremental Processing of Continuous Queries in
Spatio-temporal Databases. Proceedings of ACM
SIGMOD. - Papadias, D. et al. (2003) Query Processing in
Spatial Network Databases, Proceedings of VLDB. - Pelekis, N. et al. (2007) Similarity Search in
Trajectory Databases, Proceedings of TIME. - Pfoser, D. and C.S. Jensen (2001) Querying the
Trajectories of On-line Mobile Objects.
Proceedings of MobiDE. - Porkaew, K. et al. (2001) Querying Mobile Objects
in Spatio-Temporal Databases. Proceedings of
SSTD. - Sankaranarayanan, J. et al. (2005) Efficient
Query Processing on Spatial Networks. Proceedings
of ACM-GIS. - Seydim, A.V. et al. (2001) Location Dependent
Query Processing. Proceedings of MobiDE. - Tao, Y. et al. (2002) Continuous Nearest Neighbor
Search. Proceedings of VLDB. - Xia, T. and Zhang, D. (2006) Continuous Reverse
Nearest Neighbor Monitoring. Proceedings of ICDE.
131Selected literature on
- MOD Indexing
- Cai, Y. and Ng, R.T. (2004) Indexing
Spatio-Temporal Trajectories with Chebyshev
Polynomials. Proceedings of ACM SIGMOD. - Frentzos, E. (2003) Indexing Objects Moving on
Fixed Networks. Proceedings of SSTD. - Hadjieleftheriou, M. et al. (2006) Indexing
Spatio-temporal Archives. VLDB Journal, 15(2)
143-164. - Kollios, G. et al. (2001) Indexing Animated
Objects Using Spatiotemporal Access Methods. IEEE
Transactions on Knowledge and Data Engineering,
13(5) 758-777. - Myllymaki, J. and Kaufman, J. (2003)
High-Performance Spatial Indexing for
Location-Based Services. Proceedings of WWW. - Ni, J. and Ravishankar, C.V. (2007) Indexing
Spatio-Temporal Trajectories with Efficient
Polynomial Approximations. IEEE Transactions on
Knowledge and Data Engineering, 19(5) 663-678. - Pfoser, D. et al. (2000) Novel Approaches to the
Indexing of Moving Object Trajectories.
Proceedings of VLDB. - Rasetic, S. et al. (2005) A Trajectory Splitting
Model for Efficient Spatio-Temporal Indexing.
Proceedings of VLDB. - Saltenis, S. et al. (2000) Indexing the Positions
of Continuously Moving Objects. Proceedings of
ACM SIGMOD. - Saltenis, S. and C.S. Jensen (2002) Indexing of
Moving Objects for Location-Based Services.
Proceedings of ICDE. - Tao, Y. and Papadias, D. (2001) MV3R-Tree A
Spatio-Temporal Access Method for Timestamp and
Interval Queries. Proceedings of VLDB.
132Selected literature on
- Mobility Data Warehousing
- Han, J. et al. (1998) Selective Materialization
An Efficient Method for Spatial Data Cube
Construction. Proceedings of PAKDD. - Jensen, C.S. et al. (2004) Multidimensional data
modeling for location-based services, The VLDB
Journal, 13121. - Leonardi, L. et al. (2009) Frequent
Spatio-Temporal Patterns in Trajectory Data
Warehouses. Proceedings of ACM SAC. - Marketos, G. et al. (2008) Building Real World
Trajectory Warehouses. Proceedings of MobiDE. - Orlando, S. et al. (2007) Spatio-Temporal
Aggregations in Trajectory Data Warehouses.
Proceedings of DaWaK. - Pelekis, N. et al. (2008) Towards Trajectory Data
Warehouses. Chapter in Mobility, Data Mining and
Privacy Geographic Knowledge Discovery.
Springer-Verlag. 2008. - Shekhar, S. et al. (2001) Map Cube a
Visualization Tool for Spatial Data Warehouses,
Chapter in Geographic Data Mining and Knowledge
Discovery. Taylor and Francis. - Tao, Y. et al. (2004) Spatio-Temporal Aggregation
Using Sketches. Proceedings of ICDE.
133Selected literature on
- Mobility Pattern Querying Mining
- Cao, H. et al. (2005) Mining frequent
spatio-temporal sequential patterns. Proceedings
of ICDM. - Djafri, N. et al. (2002) Spatio-temporal
evolution querying patterns of change in
databases. Proceedings of ACM-GIS. - Giannotti, F. et al. (2006) Efficient Mining of
Temporally Annotated Sequences. Proceedings of
SDM. - Giannotti, F. et al. (2007) Trajectory Pattern
Mining. Proceedings of KDD. - Hadjieleftheriou, M. et al. (2005) Complex
Spatio-Temporal Pattern Queries. Proceedings of
VLDB. - Horvitz, E. et al. (2005) Prediction,
expectation, and surprise Methods, designs, and
study of a deployed traffic forecasting service.
Proceedings of Conference on Uncertainty in
Artificial Intelligence. - Kalnis, P. et al. (2005) On discovering moving
clusters in spatio-temporal data. Proceedings of
SSTD. - van Kreveld, M. et al. (2007) Efficient Detection
of Motion Patterns in Spatio-Temporal Data Sets.
GeoInformatica, 11(2)195-215. - Laube, P. et al. (2005) Discovering relative
motion patterns in groups of moving point
objects. Int. Journal of Geographical Information
Science, 19(6) 639-668. - Li, X. et al. (2007) Traffic density-based
discovery of hot routes in road networks.
Proceedings of SSTD. - Liu, Y. et al. (2006) A scalable distributed
stream mining system for highway traffic data.
Proceedings of PKDD. - Mamoulis, N. et al. (2004) Mining, indexing, and
querying historical spatiotemporal data.
Proceedings of KDD. - du Mouza, C. and Rigaux, P. (2005) Mobility
Patterns. GeoInformatica, 9(4) 297-319.
134Selected literature on
- Mobility Pattern Querying Mining (cont.)
- Nakata, T. and Takeuchi, J. (2004) Mining traffic
data from probe-car system for travel time
prediction. Proceedings of KDD. - Qu, Y. et al. (2003) Supporting Movement Pattern
Queries in User-Specified Scales. IEEE
Transacti