smarter outlier detection and deeper understanding of large-scale taxi trip records: a case study of...
TRANSCRIPT
![Page 1: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/1.jpg)
Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip
Records: A Case Study of NYC
Jianting Zhang
Department of Computer Science
The City College of New York
![Page 2: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/2.jpg)
Outline
•Introduction
•Background and Related Work
•Method and Discussions
•Experiments and Results
•Summary
![Page 3: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/3.jpg)
Introduction
3
Taxi trip records•~300 million trips in about two years•~170 million trips (300 million passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC •The dataset is not perfect...
•13,000 Medallion taxi cabs
•License priced at $600, 000 in 2007
•Car services and taxi services are separate
•Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)
![Page 4: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/4.jpg)
Introduction
Medallion#Shift#Trip#
Trip_Pickup_DateTimeTrip_Dropoff_DateTime
Trip_Pickup_LocationTrip_Dropoff_Location
Start_LonStart_LatEnd_LonEnd_Lat
Payment_TypeSurchargeTotal_AmtRate_Code
Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time
Trip_Distance
vendor_namedate_loadedstore_and_forward
time_between_servicedistance_between_service
Start_Zip_CodeEnd_Zip_Code
start_xstart_yend_xend_y (local projection)
1
2
3
4
5
6
78 9
1110
Meshed up on purpose due to privacy concerns
![Page 5: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/5.jpg)
Introduction• In addition:
– Some of the data fields are empty– Pickup and drop-off locations can
be in Hudson River – The recorded trip distance/duration
can be unreasonable– ...
Outlier detections for data cleaning are needed
Mission can be easier to handle 170 million trips with the help of U2SOD-DB
![Page 6: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/6.jpg)
Background and Related Work
•Existing approaches for outlier detection for urban computing
•Thresholding: e.g. 200m < dist < 30km
•Locating in unusual ranges of distributions
•Spatial analysis: within a region or a land use type
•Matching trajectory with road segments – treat unmatched ones as outliers
•Some techniques require complete GPS traces while we only have O-D locations
•Large-scale Shortest path computing has not been used for outlier detection
![Page 7: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/7.jpg)
Background and Related Work• Shortest path computation
– Dijkstra and A* – New generation algorithms– Contraction Hierarchy (CH) based
•Open source implementations of CH:
•MoNav
•OSRM
•Much faster than ArcGIS NA module
![Page 8: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/8.jpg)
Background and Related Work
• Network Centrality (Brandes, 2008)
Vtvs st
stB
vvC
)(
)(Node based
Edge based
Vtvs st
stB
eeC
)(
)(
•Can be easily derived after shortest paths are computed
•Mapping node/edge between centrality can reveal the connection strengths among different parts of cities
![Page 9: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/9.jpg)
Method and DiscussionsRaw Taxi trip data
Match pickup/drop-off point locations to street
segments within Distance D0
CD >D1 AND CD>W*RD?
Assign pickup/drop-off nodes by picking closer ones
Type I outlier (spatial analysis)
Compute shortest path
CD: Compute shortest distanceRD: Recorded trip distance
Type II outlier (network analysis)
Successful?
Aggregate unique (sid,tid) pairs
Update centrality measurements
![Page 10: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/10.jpg)
Method and Discussions
• The approach is approximate in nature– Taxi drivers do not always follow shortest path– Especially for short trips and heavily congested areas – But we only care about aggregated centrality
measurements and the errors have a chance to be cancelled out by each other
• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segments
• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.
![Page 11: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/11.jpg)
Experiments and Results
Count-Distance Distribution
0
5000000
10000000
15000000
20000000
<= 0
.0
( 0
.8,
1.0]
( 1
.8,
2.0]
( 2
.8,
3.0]
( 3
.8,
4.0]
( 4
.8,
5.0]
( 5
.8,
6.0]
( 6
.8,
7.0]
( 7
.8,
8.0]
( 8
.8,
9.0]
( 9
.8, 1
0.0]
( 10
.8, 1
1.0]
( 11
.8, 1
2.0]
( 12
.8, 1
3.0]
( 13
.8, 1
4.0]
( 14
.8, 1
5.0]
( 15
.8, 1
6.0]
( 16
.8, 1
7.0]
( 17
.8, 1
8.0]
( 18
.8, 1
9.0]
( 19
.8, 2
0.0]
Trip Distance (mile)
Cou
nt
Count-Time Distribution
0
5000000
10000000
15000000
20000000
<=
0
.0
( 2
.0,
3.0
]
( 5
.0,
6.0
]
( 8
.0,
9.0
]
( 1
1.0
, 1
2.0
]
( 1
4.0
, 1
5.0
]
( 1
7.0
, 1
8.0
]
( 2
0.0
, 2
1.0
]
( 2
3.0
, 2
4.0
]
( 2
6.0
, 2
7.0
]
( 2
9.0
, 3
0.0
]
( 3
2.0
, 3
3.0
]
( 3
5.0
, 3
6.0
]
( 3
8.0
, 3
9.0
]
( 4
1.0
, 4
2.0
]
( 4
4.0
, 4
5.0
]
( 4
7.0
, 4
8.0
]
> 5
0.0
TripTime (Minute)
Co
un
t
Count-Speed Distribution
0
5000000
10000000
15000000
20000000
<= 0
.0( 1
.0, 2
.0]( 3
.0, 4
.0]( 5
.0, 6
.0]( 7
.0, 8
.0]( 9
.0, 10
.0]( 1
1.0, 1
2.0]
( 13.0
, 14.0
]( 1
5.0, 1
6.0]
( 17.0
, 18.0
]( 1
9.0, 2
0.0]
( 21.0
, 22.0
]( 2
3.0, 2
4.0]
( 25.0
, 26.0
]( 2
7.0, 2
8.0]
( 29.0
, 30.0
]( 3
1.0, 3
2.0]
( 33.0
, 34.0
]( 3
5.0, 3
6.0]
( 37.0
, 38.0
]( 3
9.0, 4
0.0]
( 41.0
, 42.0
]( 4
3.0, 4
4.0]
( 45.0
, 46.0
]( 4
7.0, 4
8.0]
( 49.0
, 50.0
]
Speed (MPH)
Coun
t
Count-Fare Distribution
05000000
1000000015000000200000002500000030000000
<= 0
.0(
1.0,
2.0
](
3.0,
4.0
](
5.0,
6.0
](
7.0,
8.0
](
9.0,
10.
0]( 1
1.0,
12.
0]( 1
3.0,
14.
0]( 1
5.0,
16.
0]( 1
7.0,
18.
0]( 1
9.0,
20.
0]( 2
1.0,
22.
0]( 2
3.0,
24.
0]( 2
5.0,
26.
0]( 2
7.0,
28.
0]( 2
9.0,
30.
0]( 3
1.0,
32.
0]( 3
3.0,
34.
0]( 3
5.0,
36.
0]( 3
7.0,
38.
0]( 3
9.0,
40.
0]( 4
1.0,
42.
0]( 4
3.0,
44.
0]( 4
5.0,
46.
0]( 4
7.0,
48.
0]( 4
9.0,
50.
0]
Fare ($)
Coun
t
Over all distributions of trip distance, time, speed and fare
![Page 12: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/12.jpg)
Experiments and Results
Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map
•D0=200 feet, D1=3 miles, W=2
•166 million trips, 25 million unique
•~2.5 millions (1.5%) type I outliers
•~ 18,000 type II outliers
•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)
![Page 13: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/13.jpg)
Experiments and Results
Examples of Detected Type II Outliers
![Page 14: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/14.jpg)
Experiments and ResultsMapping Betweenness Centralities (All hours)
![Page 15: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/15.jpg)
Experiments and Results
00H 02H 04H 06H 08H 10H
12H 14H 16H18H
20H 22H
Legend:
Mapping Betweenness Centralities (bi-hourly)
![Page 16: Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science](https://reader037.vdocuments.us/reader037/viewer/2022103022/56649cf75503460f949c6c30/html5/thumbnails/16.jpg)
Summary
• Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps.
• Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis)
• The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information)
• It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.