fantasy league sports with big data technologies
TRANSCRIPT
FANTASY LEAGUE SPORTSFantastical, fast, and furious fantasy stats.
By: Silvia Oliveros
WHY FANTASY LEAGUES?
• I like watching sports.
• Large fan base (41 million people).
• Simulate my own site with 5 million user base.
WEBSITE
PIPELINE
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
DATA INGESTION REAL-TIME / SPEED LAYER
BATCH LAYER SERVING LAYER
DATA INGESTION
User Data:Information
Roster
NFL:Play-by-Play
Kafka
DATA INGESTION
User Data:Information
Roster
NFL:Play-by-Play
Kafka
User Data (Roster):
Play-by-Play Data:
DATA INGESTION
User Data:Information
Roster
NFL:Play-by-Play
Kafka
Why Kafka?
Two consumers to send data to HDFS and Spark
Streaming.
Potential real-time changes in roster information
(future).
PIPELINE
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
DATA INGESTION REAL-TIME / SPEED LAYER
BATCH LAYER SERVING LAYER
REAL-TIME / SPEED LAYER
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
DATA INGESTION REAL-TIME / SPEED LAYER
BATCH LAYER SERVING LAYER
SPEED LAYER / REAL-TIMENew play comes in:
SPEED LAYER / REAL-TIMENew play comes in:
Lookup (roster data):
SPEED LAYER: REAL-TIMENew play comes in:
Lookup (roster data):
Generate information:
SPEED LAYER: REAL-TIMENew play comes in:
Lookup (roster data):
Generate information:
Aggregate pointsby user
PIPELINE
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
DATA INGESTION REAL-TIME / SPEED LAYER
BATCH LAYER SERVING LAYER
BATCH LAYER
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
DATA INGESTION REAL-TIME / SPEED LAYER
BATCH LAYER SERVING LAYER
BATCH LAYER• Spark on top of HDFS
• Admin queries (Updated once every 24 hours):
• Top Users
• Demographic Breakdown
• User and Player queries:
• Historical fantasy points per game / week
PIPELINE
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
SERVING LAYER
User Data:Information
Roster
NFL:Play-by-Play
HDFSKafka
Spark Streaming
Spark
Cassandra
Flask
SERVING LAYER
Cassandra
Flask
Multiple queries require different tables with efficient
schemas.
API for both analysts and users of the website.
D3 graphs
LESSONS LEARNED
• Technologies: Spark, Spark Streaming, Cassandra
• Scalability in Spark Streaming for different operations (number of records vs number of nodes)
• Spark Streaming saveAs Function saves a lot of small files even after repartition, so to deal with that in HDFS I wrote a function to append to a single file.
SILVIA OLIVEROS• M.S. Computer Engineering -
Purdue University
• Developed Visual Analytics Tools for DHS Partners:
• Coast Guard
• Dietary Survey (NHANES)
[email protected]/soliverost