kuan lin chen-week5_demo
TRANSCRIPT
How to solve it?
• Need to know the number of the bike at each station.
• First attempt: report the number of bikes every minute
Current Approach
• Report the number of bikes every minute
• NOT fault-tolerant
Station ID Count
1 5
2 10
3 15
4 16
5 8
Station ID Event Timestamp
1 Add 1 bike 2015/06/22 10:07:00
2 Add 1 bike 2015/06/22 10:08:00
1 Remove 1 bike 2015/06/22 10:20:00
3 Add 1 bike 2015/06/22 10:21:00
2 Remove 1 bike 2015/06/22 10:40:00
My Approach
• Compute the number of bike at each station from the history of the trip logs
Station ID Event Timestamp
1 Add 1 bike 2015/06/22 10:07:00
2 Add 1 bike 2015/06/22 10:08:00
1 Remove 1 bike 2015/06/22 10:20:00
3 Add 1 bike 2015/06/22 10:21:00
2 Remove 1 bike 2015/06/22 10:40:00
My Approach
• Compute the number of bike at each station from the history of the trip logs
• Raw
Station ID Event Timestamp
1 Add 1 bike 2015/06/22 10:07:00
2 Add 1 bike 2015/06/22 10:08:00
1 Remove 1 bike 2015/06/22 10:20:00
3 Add 1 bike 2015/06/22 10:21:00
2 Remove 1 bike 2015/06/22 10:40:00
My Approach
• Compute the number of bike at each station from the history of the trip logs
• Raw • Immutable
Station ID Event Timestamp
1 Add 1 bike 2015/06/22 10:07:00
2 Add 1 bike 2015/06/22 10:08:00
1 Remove 1 bike 2015/06/22 10:20:00
3 Add 1 bike 2015/06/22 10:21:00
2 Remove 1 bike 2015/06/22 10:40:00
My Approach
• Compute the number of bike at each station from the history of the trip logs
• Raw • Immutable • Perpetual
Data
• The actual log data from Bay Area Bike Share have many fields – Trip ID,Duration,Start Date,Start Station,Start
Terminal,End Date,End Station,End Terminal,Bike #,Subscription Type,Zip Code
Data
• The actual log data from Bay Area Bike Share have many fields – Trip ID,Duration,Start Date,Start Station,Start
Terminal,End Date,End Station,End Terminal,Bike #,Subscription Type,Zip Code
• For my project, I only need start/end station ID and start/end Date
Data
• The actual log data from Bay Area Bike Share have many fields – Trip ID,Duration,Start Date,Start Station,Start
Terminal,End Date,End Station,End Terminal,Bike #,Subscription Type,Zip Code
• For my project, I only need start/end station ID and start/end Date
Data
• The actual log data from Bay Area Bike Share have many fields – Trip ID,Duration,Start Date,Start Station,Start
Terminal,End Date,End Station,End Terminal,Bike #,Subscription Type,Zip Code
• For my project, I only need start/end station ID and start/end Date
• So I generated all my data
Data
• The actual log data from Bay Area Bike Share has many fields – Trip ID,Duration,Start Date,Start Station,Start
Terminal,End Date,End Station,End Terminal,Bike #,Subscription Type,Zip Code
• For my project, I only need start/end station ID and start/end Date
• So I generated all my data
Data Pipeline
KafKa
Spark Streaming
HDFS Spark
Front end service (Flask)
Cassandra
Ingestion
Real time Streaming
Data Pipeline
KafKa
Spark Streaming
HDFS Spark
Front end service (Flask)
Cassandra
Ingestion
Real time Streaming
Data Pipeline
KafKa
Spark Streaming
HDFS Spark
Front end service (Flask)
Cassandra
Ingestion
Real time Streaming
Data Pipeline
KafKa
Spark Streaming
HDFS Spark
Front end service (Flask)
Cassandra
Ingestion
Real time Streaming
About me • Kuan-Lin Chen
• Master of Engineering in Computer Science, Cornell University, class of 2015
• Bachelor of Science in Computer Science & Math, University of Wisconsin-Madison, class of 2013
About me • Kuan-Lin Chen
• Master of Engineering in Computer Science, Cornell University, class of 2015
• Bachelor of Science in Computer Science & Math, University of Wisconsin-Madison, class of 2013
• I was a military police during 2013-2014.
Bay Area Bike Share Overview
• Launched on August 29, 2013
–~70 stations
–~700 bikes
–Dock count 11~27, Average = 17.7
• Looking to expand to 7000 bikes by 2017
–Potential big data problem
How big could the data be?
• California is divided into 58 counties and contains 482 municipalities (cities or towns).
• Assuming each city has 40 stations, each station has 30 docks but only half of them do have bikes (600 bikes for each city)
• Each bike is used 72 times / day (20 min / trip)
• Each simple log is 30 bytes
• 30*72*2*600*482 = 1.2 GB / day