data at spotify

June 12, 2014

Danielle Jabin Data Engineer, A/B Testing

Data at Spotify

I’m Danielle Jabin

•  Data Engineer in the Stockholm office •  A/B testing infrastructure

•  California born & raised •  If I can survive a Swedish winter, so can you!

•  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania

Over 40 million active users

As of June 9, 2014

Access to more than 20 million songs

As of June 9, 2014

Big Data

•  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including

replication factor of 3)

As of June 9, 2014

So how much data is that?

Let’s compare: 64 TB

•  293, 203, 072 books (200 pages or 240,000 characters)

•  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)

That’s a lot of selfies

How do we use this data?

Use Cases

•  Reporting •  Business Analytics •  Operational Analytics •  Product Features

Reporting

•  Reporting to labels, licensors, partners, and advertisers •  We support our partners

Business Analytics

•  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis

Operational Metrics

•  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)

Product Features

•  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing

How do we collect this data?

The three pillars of our Data Infrastructure:

Kafka Collection

Hadoop Processing

DatabasesAnalytics/Visualization

This is Dave. Data Engineer at Spotify by day…

…chiptune DJ Demoscene Time Machine by night.

Let’s listen to Dave’s song

•  High volume pub-sub system

•  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”

•  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance

•  Consumers can pull data at different rates •  Able to handle extremely high volumes

Other people listened too!

Hadoop

•  Process and store massive amounts of unstructured data across a distributed cluster

•  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe

Hadoop

•  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages

•  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala

•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify

•  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs

What if we want to know more?

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop •  Core data can be used and manipulated for various needs

•  Ad hoc queries •  Dashboards

Databases

•  Sqoop

Databases

•  Sqoop

Questions?

A/B testing questions? Find me!

Control

Thank you!

data at spotify

tb of data

data infrastructure

big data

ab testing data

tb of compressed data

cassandra sqoop core

hadoop streaming

largest hadoop cluster

Technology

data in development @ spotify

the art of data: spotify

the evolution of big data at spotify

playlists at spotify

testing at spotify

cloudstack at spotify

big data at spotify -...

how spotify uses big data for fast product iterations |...

docker at spotify - dockercon14

measuring team performance at spotify slideshare

docker at spotify

algorithmic music recommendations at spotify

storm at spotify

music personalization at spotify

cloudstack at spotify, nyc

music recommendations at spotify -...

big data at spotify -...

agile at spotify

big data and machine learning @ spotify

spotify presentation at conversion jam x