my metro -- the smart way to get around, delays

13
My Metro The smart way to get around

Upload: daniel-lamblin

Post on 13-Apr-2017

110 views

Category:

Data & Analytics


8 download

TRANSCRIPT

Page 1: My metro -- The smart way to get around, delays

My MetroThe smart way to get around

Page 2: My metro -- The smart way to get around, delays

My MetroThe smart way to get around, delays

Page 3: My metro -- The smart way to get around, delays

MotivationThe MTA’s subway delays never seem to be announced in actionable timeframes.

Inspired by the Swiss Rail 2000 option of subscribing to your morning and evening commuter trains:

I only want updates on my specific train within its specific time.

Page 4: My metro -- The smart way to get around, delays

Goal for UsersInstead of trip planning, inform the user(s) of the posted scheduled trains, the frequency of alerts for that schedule, and the variation of times off that schedule.

Users should get a push notification for any alerts affecting their selected trains.

Future work might include identifying indicators of impending alerts or un-alerted delays for that train. For example previous delays or weather.

Page 5: My metro -- The smart way to get around, delays

Pipeline

GTFS

Page 6: My metro -- The smart way to get around, delays

Pipeline

GTFS

Page 7: My metro -- The smart way to get around, delays

Pipeline

GTFS

Page 8: My metro -- The smart way to get around, delays

ChallengesGTFS is in Protocol Buffers, while a solid technology it’s not conducive to rapid prototyping;

Format is applicable for a current MTA train system snapshot, but poor for an yearly overview or finding trends.

Much of the important details, like train names, routes, and stations, are noted by ID only, and some clunky CSV files are provided to explain them. They can vary over time.

It is compact, about 50gb input for 1y.

Then 1.5GB for 2y turnstile data a few MB for the network geojson, station, route and schedule information

Page 9: My metro -- The smart way to get around, delays

ChallengesGTFS is an extended Protocol Buffer and NYCT extensions apply.

The format comes in at ~100kb and includes: route changes, train locations and updates on next stop expectation, alerts with plain text notes.

This leads to an HDFS small files problem, resolved by translating to HFiles.

Extracting only the useful information by timestamp train and location to build an HBase HFile, and then aggregating stats on it.

Page 10: My metro -- The smart way to get around, delays

Example deserialized input. ~15 pages for 1 request."header": {

"gtfs_realtime_version": "1.0","timestamp": 1424845747,"nyct_feed_header": {

"trip_replacement_period": [{"route_id": "1","replacement_period": {

"end": 1424847547}

},{

"route_id": "2","replacement_period": {

"end": 1424847547}

},….

….{"route_id": "3","replacement_period": {

"end": 1424847547}

},{

"route_id": "4","replacement_period": {

"end": 1424847547}

},{

"route_id": "5","replacement_period": {

"end": 1424847547}

ETC; some 400kb of text from GTFS per feed request.

Page 11: My metro -- The smart way to get around, delays

Outlook on Open Sourced Data PipelinesCurrent open source projects are in the mode of:

● Make this pipeline work● Find ways to improve its speed and capacity

Thus little effort is currently going into simplifying distribution and setup except where it is for profit. I project this will change and improve in 5 years.

To draw an analogy: open-MR solutions are currently like Apache httpd, MySQL, and PHP were before the LAMP stack was packaged. Once dependencies were in package maintenance, adoption and utility took off.

Page 12: My metro -- The smart way to get around, delays

Daniel Pascal LamblinBS-CS 1999; years at:

GoogleDoubleClick Studio componentized ad authoring web app with campaign reporting, and internal

reporting on reporting

Travelocity & Sabre GDSRecommend Travel Experiences, with shared and

reviewed trip journals

EMC CorpPetaSite support: PB of tape

Raised in Dubai

A 4 year old took this →

Page 13: My metro -- The smart way to get around, delays

Sources

Realtime:

➔ MTA has a realtime feed of subway train locations for 1,2,3,4,5,6 S and L trains.

➔ This includes service alerts.

Engineered:

➔ User data for subscriptions to trains and notifications of personal delays would need to be generated for demonstration use.

Static:

➔ The MTA provides batches of historical data for these too. datamine-history.s3.amazonaws.com/2015-09-01-09-{01,06,11,16,26,31,36,41,46,51,56}

➔ Weekly turnstile data can be used to generate realistic interest in transit throughout the day.

➔ Station IDs and geolocations