playlist recommendations @ spotify

Post on 12-Jan-2017

470 Views

Category:

Engineering

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Playlist Recommendations@

Nikhil Tibrewal

@nikhil_tibrewal

Who am I?

Nikhil Tibrewal (Nick-hill)

● Data Engineer on Lambda squad (Spotify’s primary ML team)● Graduated from Carnegie Mellon University in Dec 2013● B.Sc. in Computer Science + additional major in Econ● Been part of Spotify band for ~1.5 years● Worked on a range of projects, primarily Playlist Recommendations

Spotify in numbers

● Started in 2006, 58 markets● 75M+ active users, 20M+ paying● 30M+ songs, 20K new per day● 1.5+ billion playlists● 1 TB logs per day

● Discover tab● Radio● Related Artists● Discover Weekly● Playlist recs on “Now” Strip

Recommendations so far on SpotifyFor Ellie Goulding

“Now” Strip

Human curated playlist

“Now” Strip

Human curated playlist

Recommended playlist

But…How are playlist recs generated?

Quick Overview!

● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content

Quick Overview!

● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content

Good

Quick Overview!

● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content

Good Bad

Quick Overview!

● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering

Quick Overview!

● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:

■ Playlist vector derived from track vectors in playlist

Quick Overview!

● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:

■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space

ANNOY (Approximate Nearest Neighbors Oh Yeah)created at Spotify

https://github.com/spotify/annoy

Quick Overview!

● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:

■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space○ Vectorize user taste as well:

■ User vector derived from user listening history

Quick Overview!

● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:

■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space○ Vectorize user taste as well:

■ User vector derived from user listening history○ User and playlist vectors in same space!○ Query for nearest playlists to user from Annoy tree

annoyTree.getNearest(seedVector, K)

Quick Overview!

● Recommendations pipeline: Ranking Model○ Use genre information, demographics data, and playlist popularity

data to further rank recommendations■ John: 21, USA, likes rock■ Should get rock playlist recs that are popular in USA and

amongst 21 year olds○ Apply post-processing steps for shuffling and add variety to avoid

repetitions

Quick Overview!

● Recommendations pipeline: Ranking Model○ Use genre information, demographics data, and playlist popularity

data to further rank recommendations■ John: 21, USA, likes rock■ Should get rock playlist recs that are popular in USA and

amongst 21 year olds○ Apply post-processing steps for shuffling and add variety to avoid

repetitions

90% DAUs have recs!

Quick Overview!

● Infrastructure○ Luigi to manage workflow (also built at Spotify)○ Entire pipeline written in Scalding○ 1200+ nodes Hadoop cluster to run jobs○ Cassandra (~dozen nodes for playlist recs)○ Java backend micro-services serving recs

Quick Overview!

"Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and is a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS" (http://cascading.io/customer/twitter/)

Scalding w.r.t. Playlist Recs

● Used Python back in the day○ Inputs and outputs were tab separated○ Complexity UP => Difficulty to maintain UP○ Hard to write tests

● Scalding provided compile time error checks○ Catch errors early○ Define schemas (e.g. Avro)

● Can use Parquet + Avro for input/output○ Easy to write and read data○ Records with a lot of fields!○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)

+

Scalding w.r.t. Playlist Recs +

Scalding w.r.t. Playlist Recs

● Data quality○ Hadoop counters wrappers in extended Scalding library code

+

Scalding w.r.t. Playlist Recs

● Data quality○ Hadoop counters wrappers in extended Scalding library code○ Verify counters within reasonable ranges

+

Scalding w.r.t. Playlist Recs +

Scalding w.r.t. Playlist Recs

● Pipeline tolerance○ Job failures are normal, and annoying with big jobs○ Scalding checkpoints○ Lesson: checkpoint itself is a map-reduce job and has the same caveats○ Still very helpful!

+

Scalding w.r.t. Playlist Recs

● Job runtimes○ Common solutions: more reducers and code optimizations○ Speculative execution for larger jobs○ Caveat: can take up unnecessary resources

+

Scalding w.r.t. Playlist Recs

● Memory issues○ Used Sparkey indices in Python (developed at Spotify, now open source)

■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts”

■ Replicated to all mappers○ Complex jobs in Scalding => higher memory config for jobs with Sparkey

+

https://github.com/spotify/sparkey

Scalding w.r.t. Playlist Recs

● Memory issues○ Used Sparkey indices in Python (developed at Spotify, now open source)

■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts”

■ Replicated to all mappers○ Complex jobs in Scalding => higher memory config for jobs with Sparkey○ Lesson: trade memory resources for MAYBE a little more time with joins

+

bigPipe.join(exSparkeyPipe)

https://github.com/spotify/sparkey

Scalding w.r.t. Playlist Recs

● Driven○ “A sophisticated tool that collects telemetry data from running Scalding /

Cascading jobs on a cluster and presenting them in an intriguing User Interface."

○ http://cascading.io/

+

Scalding w.r.t. Playlist Recs +

Scalding w.r.t. Playlist Recs

● Other awesome benefits

+

Scalding w.r.t. Playlist Recs

● Other awesome benefits○ Active community + big players

+

Scalding w.r.t. Playlist Recs

● Other awesome benefits○ Active community + big players

○ Data pipeline flows naturally follow the functional paradigm - essentially writing Scala code

+

Scalding w.r.t. Playlist Recs +

Scalding w.r.t. Playlist Recs

Productivity without sacrificing performance!

+

Status: CompletedSpotify is hiring!

Nikhil Tibrewal

@nikhil_tibrewal

top related