daniel lamblin week4_demo

9

Click here to load reader

Upload: daniel-lamblin

Post on 13-Apr-2017

132 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Daniel lamblin week4_demo

MTA Delay & MonitoringDaily delay data on lines 1-6, L & S

Page 2: Daniel lamblin week4_demo

Motivation

The MTA’s subway delays are rarely announced in actionable timeframes.

Most riders are only interested in knowing about issues on their daily commute at their regular travel time.

When Switzerland completed it’s Swiss Rail 2000 initiative (late, in 2004) it enabled users to subscribe to their regular train, notifying them early of issues.

Page 3: Daniel lamblin week4_demo

Web UI Demo

The demo is linked from dlamblin.github.io/mta-delay-monitoring

Page 4: Daniel lamblin week4_demo

Processing pipeline

GTFS

Page 5: Daniel lamblin week4_demo

Challenges: Resolved

GTFS is in Protocol Buffers, while a solid technology it’s not conducive to rapid prototyping; cluster configuration and reconfiguration.

Format is ‘good’ for a current snapshot, but poor for an yearly overview.

Original format is compact; only 50GB for 2 years of GTFS

Then 1.5GB for 2y turn-style data a few MB for the network geojson, station, route and schedule information

Page 6: Daniel lamblin week4_demo

Challenges: Remaining

Originally aimed for a repeatable containerized deployable pipeline with Kubernetes, with replication controllers and monitoring, but the timeframe gave

➢ A manually configured cluster of 4 EC2 m4.large nodes.➢ Running shared HDFS, Kafka, Spark, HBASE, web-api.➢ Would have output additional views for manual analysis EG with Drill➢ Generating sample user loads.

Page 7: Daniel lamblin week4_demo

Clusters, sizes, throughput etc.

Plan to show some measurements of data sizes, possibly # of user notifications supported.

Page 8: Daniel lamblin week4_demo

Daniel Pascal LamblinBS-CS 1999; years at:

GoogleDoubleClick Studio componentized ad authoring web app with campaign reporting, and internal

reporting on reporting

Travelocity & Sabre GDSRecommend Travel Experiences, with shared and

reviewed trip journals

EMC CorpPetaSite support: PB of tape

Raised in Dubai

A 4 year old took this →

Page 9: Daniel lamblin week4_demo

Sources

Realtime:

➔ MTA has a realtime feed of subway train locations for 1,2,3,4,5,6 S and L trains.

➔ This includes service alerts.

Engineered:

➔ User data for subscriptions to trains and notifications of personal delays would need to be generated for demonstration use.

Static:

➔ The MTA provides batches of historical data for these too. datamine-history.s3.amazonaws.com/2015-09-01-09-{01,06,11,16,26,31,36,41,46,51,56}

➔ Weekly turnstile data can be used to generate realistic interest in transit throughout the day.

➔ Station IDs and geolocations