Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC - PowerPoint PPT Presentation

About This Presentation
Title:

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC

Description:

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 18
Provided by: zha117
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC


1
Smarter Outlier Detection and Deeper
Understanding of Large-Scale Taxi Trip Records A
Case Study of NYC
  • Jianting Zhang
  • Department of Computer Science
  • The City College of New York

2
Outline
  • Introduction
  • Background and Related Work
  • Method and Discussions
  • Experiments and Results
  • Summary

3
Introduction
  • Taxi trip records
  • 300 million trips in about two years
  • 170 million trips (300 million passengers) in
    2009
  • 1/5 of that of subway riders and 1/3 of that of
    bus riders in NYC
  • The dataset is not perfect...
  • 13,000 Medallion taxi cabs
  • License priced at 600, 000 in 2007
  • Car services and taxi services are separate
  • Only taxis with Medallion license are for hail
    (the rule could be under changing outside
    Manhattan...)

3
4
Introduction
Meshed up on purpose due to privacy concerns
5
Introduction
  • In addition
  • Some of the data fields are empty
  • Pickup and drop-off locations can be in Hudson
    River
  • The recorded trip distance/duration can be
    unreasonable
  • ...

Outlier detections for data cleaning are needed
Mission can be easier to handle 170 million trips
with the help of U2SOD-DB
6
Background and Related Work
  • Existing approaches for outlier detection for
    urban computing
  • Thresholding e.g. 200m lt dist lt 30km
  • Locating in unusual ranges of distributions
  • Spatial analysis within a region or a land use
    type
  • Matching trajectory with road segments treat
    unmatched ones as outliers
  • Some techniques require complete GPS traces while
    we only have O-D locations
  • Large-scale Shortest path computing has not been
    used for outlier detection

7
Background and Related Work
  • Shortest path computation
  • Dijkstra and A
  • New generation algorithms
  • Contraction Hierarchy (CH) based
  • Open source implementations of CH
  • MoNav
  • OSRM
  • Much faster than ArcGIS NA module

8
Background and Related Work
  • Network Centrality (Brandes, 2008)
  • Can be easily derived after shortest paths are
    computed
  • Mapping node/edge between centrality can reveal
    the connection strengths among different parts of
    cities

9
Method and Discussions
Raw Taxi trip data
Type I outlier (spatial analysis)
Match pickup/drop-off point locations to street
segments within Distance D0
Successful?
Assign pickup/drop-off nodes by picking closer
ones
Type II outlier (network analysis)
Compute shortest path
CD gtD1 AND CDgtWRD?
CD Compute shortest distance RD Recorded trip
distance
Update centrality measurements
10
Method and Discussions
  • The approach is approximate in nature
  • Taxi drivers do not always follow shortest path
  • Especially for short trips and heavily congested
    areas
  • But we only care about aggregated centrality
    measurements and the errors have a chance to be
    cancelled out by each other
  • Increasing D0 will reduce of type I outliners,
    but the locations might be mismatched with
    segments
  • Reducing D1 and/or W will increase of type II
    outliers but may generate false positives.

11
Experiments and Results
Over all distributions of trip distance, time,
speed and fare
12
Experiments and Results
  • D0200 feet, D13 miles, W2
  • 166 million trips, 25 million unique
  • 2.5 millions (1.5) type I outliers
  • 18,000 type II outliers
  • Shortest path computation completes in less than
    2 hours (5,952 seconds) on a single CPU core
    (2.26 GHZ)

Mapping of Computed Shortest Paths Overlaid with
NYC Community Districts Map
13
Experiments and Results
Examples of Detected Type II Outliers
14
Experiments and Results
Mapping Betweenness Centralities (All hours)
15
Experiments and Results
16
Summary
  • Large-scale taxi trip records are error-prone due
    to a combination of device, human and information
    system induced errors outlier detection and
    data cleaning are important preprocess steps.
  • Our approach detects outliers that can not be
    snapped to street segments (through spatial
    analysis) and/or have significant differences
    between computed shortest distances and recorded
    trip distances (network analysis)
  • The work is preliminary - a more comprehensive
    framework is needed (e.g., incorporating pickup
    and drop-off times, trip duration and fare
    information)
  • It would be interesting to generate dynamics of
    betweenness maps at different traffic conditions,
    e.g., peak/non-peak, morning/afternoon and
    weekdays/weekends, and explore connection
    strengths among NYC regions.

17
QA
jzhang_at_cs.ccny.cuny.edu
Write a Comment
User Comments (0)
About PowerShow.com