Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC

About This Presentation

Title:

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC

Description:

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 18

Provided by: zha117

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC

1
Smarter Outlier Detection and Deeper
Understanding of Large-Scale Taxi Trip Records A
Case Study of NYC

Jianting Zhang
Department of Computer Science
The City College of New York

2
Outline

Introduction
Background and Related Work
Method and Discussions
Experiments and Results
Summary

3
Introduction

Taxi trip records
300 million trips in about two years
170 million trips (300 million passengers) in
2009
1/5 of that of subway riders and 1/3 of that of
bus riders in NYC
The dataset is not perfect...

13,000 Medallion taxi cabs
License priced at 600, 000 in 2007
Car services and taxi services are separate
Only taxis with Medallion license are for hail
(the rule could be under changing outside
Manhattan...)

3
4
Introduction
Meshed up on purpose due to privacy concerns
5
Introduction

In addition
Some of the data fields are empty
Pickup and drop-off locations can be in Hudson
River
The recorded trip distance/duration can be
unreasonable
...

Outlier detections for data cleaning are needed
Mission can be easier to handle 170 million trips
with the help of U2SOD-DB
6
Background and Related Work

Existing approaches for outlier detection for
urban computing
Thresholding e.g. 200m lt dist lt 30km
Locating in unusual ranges of distributions
Spatial analysis within a region or a land use
type
Matching trajectory with road segments treat
unmatched ones as outliers
Some techniques require complete GPS traces while
we only have O-D locations
Large-scale Shortest path computing has not been
used for outlier detection

7
Background and Related Work

Shortest path computation
Dijkstra and A
New generation algorithms
Contraction Hierarchy (CH) based

Open source implementations of CH
MoNav
OSRM
Much faster than ArcGIS NA module

8
Background and Related Work

Network Centrality (Brandes, 2008)

Can be easily derived after shortest paths are
computed

Mapping node/edge between centrality can reveal
the connection strengths among different parts of
cities

9
Method and Discussions
Raw Taxi trip data
Type I outlier (spatial analysis)
Match pickup/drop-off point locations to street
segments within Distance D0
Successful?
Assign pickup/drop-off nodes by picking closer
ones
Type II outlier (network analysis)
Compute shortest path
CD gtD1 AND CDgtWRD?
CD Compute shortest distance RD Recorded trip
distance
Update centrality measurements
10
Method and Discussions

The approach is approximate in nature
Taxi drivers do not always follow shortest path
Especially for short trips and heavily congested
areas
But we only care about aggregated centrality
measurements and the errors have a chance to be
cancelled out by each other
Increasing D0 will reduce of type I outliners,
but the locations might be mismatched with
segments
Reducing D1 and/or W will increase of type II
outliers but may generate false positives.

11
Experiments and Results
Over all distributions of trip distance, time,
speed and fare
12
Experiments and Results

D0200 feet, D13 miles, W2
166 million trips, 25 million unique
2.5 millions (1.5) type I outliers
18,000 type II outliers
Shortest path computation completes in less than
2 hours (5,952 seconds) on a single CPU core
(2.26 GHZ)

Mapping of Computed Shortest Paths Overlaid with
NYC Community Districts Map
13
Experiments and Results
Examples of Detected Type II Outliers
14
Experiments and Results
Mapping Betweenness Centralities (All hours)
15
Experiments and Results
16
Summary

Large-scale taxi trip records are error-prone due
to a combination of device, human and information
system induced errors outlier detection and
data cleaning are important preprocess steps.
Our approach detects outliers that can not be
snapped to street segments (through spatial
analysis) and/or have significant differences
between computed shortest distances and recorded
trip distances (network analysis)
The work is preliminary - a more comprehensive
framework is needed (e.g., incorporating pickup
and drop-off times, trip duration and fare
information)
It would be interesting to generate dynamics of
betweenness maps at different traffic conditions,
e.g., peak/non-peak, morning/afternoon and
weekdays/weekends, and explore connection
strengths among NYC regions.