data at spotify

Post on 15-Jan-2015

1.139 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Data infrastructure at Spotify

TRANSCRIPT

June 12, 2014

Danielle Jabin Data Engineer, A/B Testing

Data at Spotify

I’m Danielle Jabin

•  Data Engineer in the Stockholm office •  A/B testing infrastructure

•  California born & raised •  If I can survive a Swedish winter, so can you!

•  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania

3

Over 40 million active users

As of June 9, 2014  

4

Access to more than 20 million songs

As of June 9, 2014  

Big Data

•  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including

replication factor of 3)

As of June 9, 2014  

6

So how much data is that?

Let’s compare: 64 TB

•  293, 203, 072 books (200 pages or 240,000 characters)

•  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)

8

That’s a lot of selfies

9

How do we use this data?

Use Cases

•  Reporting •  Business Analytics •  Operational Analytics •  Product Features

Reporting

•  Reporting to labels, licensors, partners, and advertisers •  We support our partners

Business Analytics

•  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis

Operational Metrics

•  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)

Product Features

•  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing

15

How do we collect this data?

The three pillars of our Data Infrastructure:

Kafka Collection

Hadoop Processing

DatabasesAnalytics/Visualization

This is Dave. Data Engineer at Spotify by day…

…chiptune DJ Demoscene Time Machine by night.

Let’s listen to Dave’s song

Kafka

•  High volume pub-sub system

•  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”

Kafka

•  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance

•  Consumers can pull data at different rates •  Able to handle extremely high volumes

Other people listened too!

Hadoop

•  Process and store massive amounts of unstructured data across a distributed cluster

•  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe

Hadoop

•  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages

•  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala

•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify

•  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs

What if we want to know more?

vs

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop •  Core data can be used and manipulated for various needs

•  Ad hoc queries •  Dashboards

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop

•  Ad hoc queries •  Dashboards

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop

•  Ad hoc queries •  Dashboards

Questions?

A/B testing questions? Find me!

Control

vs

Thank you!

top related