fantasy league sports with big data technologies

FANTASY LEAGUE SPORTSFantastical, fast, and furious fantasy stats.

By: Silvia Oliveros

WHY FANTASY LEAGUES?

• I like watching sports.

• Large fan base (41 million people).

• Simulate my own site with 5 million user base.

WEBSITE

http://www.apple.com

PIPELINE

User Data:Information

Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

DATA INGESTION REAL-TIME / SPEED LAYER

BATCH LAYER SERVING LAYER

DATA INGESTION


Roster

NFL:Play-by-Play

Kafka

DATA INGESTION


Roster

NFL:Play-by-Play

Kafka

User Data (Roster):

Play-by-Play Data:

DATA INGESTION


Roster

NFL:Play-by-Play

Kafka

Why Kafka?

Two consumers to send data to HDFS and Spark

Streaming.

Potential real-time changes in roster information

(future).

PIPELINE


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask



REAL-TIME / SPEED LAYER


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask



SPEED LAYER / REAL-TIMENew play comes in:

SPEED LAYER / REAL-TIMENew play comes in:

Lookup (roster data):

SPEED LAYER: REAL-TIMENew play comes in:


Generate information:

SPEED LAYER: REAL-TIMENew play comes in:


Generate information:

Aggregate pointsby user

PIPELINE


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask



BATCH LAYER


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask



BATCH LAYER• Spark on top of HDFS

• Admin queries (Updated once every 24 hours):

• Top Users

• Demographic Breakdown

• User and Player queries:

• Historical fantasy points per game / week

PIPELINE


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

SERVING LAYER


Roster

NFL:Play-by-Play

HDFSKafka

Spark Streaming

Spark

Cassandra

Flask

SERVING LAYER

Cassandra

Flask

Multiple queries require different tables with efficient

schemas.

API for both analysts and users of the website.

D3 graphs

LESSONS LEARNED

• Technologies: Spark, Spark Streaming, Cassandra

• Scalability in Spark Streaming for different operations (number of records vs number of nodes)

• Spark Streaming saveAs Function saves a lot of small files even after repartition, so to deal with that in HDFS I wrote a function to append to a single file.

SILVIA OLIVEROS• M.S. Computer Engineering -

Purdue University

• Developed Visual Analytics Tools for DHS Partners:

• Coast Guard

• Dietary Survey (NHANES)

[email protected]/soliverost

fantasy league sports with big data technologies

Technology

layerbatch layeruser

playbyplay data

websitepipelineuser

lookup roster data

layerbatch layer spark

playkafkauser data roster

layerdata ingestionuser

game weekpipelineuser