python

Post on 20-Jan-2015

686 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

22nd October 2012

Python <3 Content systems- managing millions of tracks for the masses

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

Tuesday, October 23, 12

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

> 18 M tracks

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

> 18 M tracks

> 20 k new tracks added per day

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

> 18 M tracks

> 1 century of listening

> 20 k new tracks added per day

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

> 18 M tracks

> 1 century of listening

> 20 k new tracks added per day

> 500 M playlists

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Service overview

Tuesday, October 23, 12

Service overview

Storage

Tuesday, October 23, 12

Service overview

Storage

User

Tuesday, October 23, 12

Service overview

Storage

User

Search

Tuesday, October 23, 12

Service overview

Storage

User

Search

Metadata

Tuesday, October 23, 12

Service overview

...

Storage

User

Search

Metadata

Tuesday, October 23, 12

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

XMLXMLXMLXML

Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/

Ingestion

Tuesday, October 23, 12

Ingestion: Delivery formats

Tuesday, October 23, 12

Ingestion: Delivery formats

~ 10 different incoming XML formats

Tuesday, October 23, 12

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

Tuesday, October 23, 12

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

- Spotify delivery format (mostly indies)

Tuesday, October 23, 12

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

- Spotify delivery format (mostly indies)

Thousands of lines of source specific code

Tuesday, October 23, 12

Data model [simplified]

Album

Track

Artist

Disc

Rights

Audio

*

*

*

*

*

1

1

1

*

*

1

1

Transcoding

1

*

Tuesday, October 23, 12

Ingestion

LXML and XSLT with extensions for parsing/transforming XML

Tuesday, October 23, 12

Ingestion: XPath extensions

>>> def formerlify(_, name):... return 'The artist formerly known as %s' %name

>>> #Namespace stuff>>> from lxml import etree>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')>>> ns['hello'] = hello>>> ns.prefix = 'f'

>>> root = etree.XML('<a><b>Prince</b></a>')>>> print(root.xpath('f:hello(string(b))'))

... The artist formerly known as Prince

http://lxml.de/extensions.html#xpath-extension-functions

Tuesday, October 23, 12

Ingestion

Tuesday, October 23, 12

IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Tuesday, October 23, 12

IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Bible apparently fits in 3MB XML

Tuesday, October 23, 12

Ingestion

>>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19...

>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78...

>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39...

Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Bible apparently fits in 3MB XML

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Centralized vs. aggregated cataloging

Requires merging!

Requires humans!

Tuesday, October 23, 12

Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08

Metadata - challenges

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Ambiguous artists - thesis work

Tuesday, October 23, 12

Ambiguous artists - thesis work

• User input

Tuesday, October 23, 12

Ambiguous artists - thesis work

• User input

• Machine learning

Tuesday, October 23, 12

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

Tuesday, October 23, 12

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

• Feature selection (#matches per external source, len(name), country-count, multilingual)

Tuesday, October 23, 12

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

• Feature selection (#matches per external source, len(name), country-count, multilingual)

• Matchings + preprocessing in Python

Tuesday, October 23, 12

Content matching

(16 * 10 ** 6) ** 2

Tuesday, October 23, 12

Content matching

(16 * 10 ** 6) ** 2 = A large number

Tuesday, October 23, 12

Content matching

(16 * 10 ** 6) ** 2 = A large number

Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]

Tuesday, October 23, 12

Content matching

(16 * 10 ** 6) ** 2 = A large number

Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]

Side note: Levenshtein (edit) distance is a heavy operation

-> speeded up about 4x with pypy (or use c-extension)

Tuesday, October 23, 12

Automatic data processing will never be perfect

Tuesday, October 23, 12

Automatic data processing will never be perfect

Patch it!

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Transcoding

Asynchronous

RabbitMQ + amqplib

Master / workers

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Index build

Tuesday, October 23, 12

Index build

• Nightly batch job on db-dumps

Tuesday, October 23, 12

Index build

• Nightly batch job on db-dumps

• Previously mostly python but now moved to Java for performance reason

Tuesday, October 23, 12

Index build

• Nightly batch job on db-dumps

• Previously mostly python but now moved to Java for performance reason

• But still lots of python helper scripts :)

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding On site live services, e.g. search, browse

Publishing

Merge

Curation/enrichment

Tuesday, October 23, 12

Distribution/publish Service A

Service B

Service C

Tuesday, October 23, 12

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Scheduling being migrated to ZooKeeper

image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/

Tuesday, October 23, 12

Distribution/publish

Staged rollout

Tuesday, October 23, 12

Distribution/publish

Tuesday, October 23, 12

Distribution/publish

Exponential back-off

Tuesday, October 23, 12

Distribution/publish

Exponential back-offwaiting 5s ...

Tuesday, October 23, 12

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...

Tuesday, October 23, 12

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...

Tuesday, October 23, 12

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...waiting 60s ...

Tuesday, October 23, 12

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding On site live services, e.g. search, browse

Publishing

Merge

Curation/enrichment

Tuesday, October 23, 12

Store ’da data

Tuesday, October 23, 12

Choice of database

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

• Lucene (search service)

Tuesday, October 23, 12

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

• Lucene (search service)

• HDFS

Tuesday, October 23, 12

PostgreSQL

[Pic. of elephant]

Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12

PostgreSQL

Redundancy + scaling: master/slave

Tuesday, October 23, 12

PostgreSQL

Joins and subqueries - let the query planner roll!

Tuesday, October 23, 12

PostgreSQL

Python?

Tuesday, October 23, 12

PostgreSQL

Python?- psycopg2 + SQL-queries

- SQLAlchemy migrator for versioning of db-schemas

Tuesday, October 23, 12

PostgreSQL

Python?- psycopg2 + SQL-queries

- SQLAlchemy migrator for versioning of db-schemas

Server side, aka named, cursors:conn = psycopg2.connect(database='huge_db', user='postgres', password='secret')sscursor = conn.cursor('my_cursor')sscursor.execute('SELECT * FROM big_table')rows = sscursor.fetchmany(1000)...

Tip!

Tuesday, October 23, 12

Scaling the content pipeline

What to scale for?

Tuesday, October 23, 12

Scaling the content pipeline

Size of catalog

Tuesday, October 23, 12

Scaling the content pipeline

# Users

Tuesday, October 23, 12

Thank youhenok@spotify.com

Tuesday, October 23, 12

Distribution/publish

Popen + gevent (although IO-bound)import gevent

gevent.monkey.patch_all()

def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1)

subprocess.Popen.wait = _wait

Tuesday, October 23, 12

top related