Download - Daniel lamblin week4_demo
![Page 1: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/1.jpg)
MTA Delay & MonitoringDaily delay data on lines 1-6, L & S
![Page 2: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/2.jpg)
Motivation
The MTA’s subway delays are rarely announced in actionable timeframes.
Most riders are only interested in knowing about issues on their daily commute at their regular travel time.
When Switzerland completed it’s Swiss Rail 2000 initiative (late, in 2004) it enabled users to subscribe to their regular train, notifying them early of issues.
![Page 3: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/3.jpg)
Web UI Demo
The demo is linked from dlamblin.github.io/mta-delay-monitoring
![Page 4: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/4.jpg)
Processing pipeline
GTFS
![Page 5: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/5.jpg)
Challenges: Resolved
GTFS is in Protocol Buffers, while a solid technology it’s not conducive to rapid prototyping; cluster configuration and reconfiguration.
Format is ‘good’ for a current snapshot, but poor for an yearly overview.
Original format is compact; only 50GB for 2 years of GTFS
Then 1.5GB for 2y turn-style data a few MB for the network geojson, station, route and schedule information
![Page 6: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/6.jpg)
Challenges: Remaining
Originally aimed for a repeatable containerized deployable pipeline with Kubernetes, with replication controllers and monitoring, but the timeframe gave
➢ A manually configured cluster of 4 EC2 m4.large nodes.➢ Running shared HDFS, Kafka, Spark, HBASE, web-api.➢ Would have output additional views for manual analysis EG with Drill➢ Generating sample user loads.
![Page 7: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/7.jpg)
Clusters, sizes, throughput etc.
Plan to show some measurements of data sizes, possibly # of user notifications supported.
![Page 8: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/8.jpg)
Daniel Pascal LamblinBS-CS 1999; years at:
GoogleDoubleClick Studio componentized ad authoring web app with campaign reporting, and internal
reporting on reporting
Travelocity & Sabre GDSRecommend Travel Experiences, with shared and
reviewed trip journals
EMC CorpPetaSite support: PB of tape
Raised in Dubai
A 4 year old took this →
![Page 9: Daniel lamblin week4_demo](https://reader038.vdocuments.us/reader038/viewer/2022100801/58a47aa71a28aba34c8b6c17/html5/thumbnails/9.jpg)
Sources
Realtime:
➔ MTA has a realtime feed of subway train locations for 1,2,3,4,5,6 S and L trains.
➔ This includes service alerts.
Engineered:
➔ User data for subscriptions to trains and notifications of personal delays would need to be generated for demonstration use.
Static:
➔ The MTA provides batches of historical data for these too. datamine-history.s3.amazonaws.com/2015-09-01-09-{01,06,11,16,26,31,36,41,46,51,56}
➔ Weekly turnstile data can be used to generate realistic interest in transit throughout the day.
➔ Station IDs and geolocations