python

117
22nd October 2012 Python <3 Content systems - managing millions of tracks for the masses Tuesday, October 23, 12

Upload: henok80

Post on 20-Jan-2015

686 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Python

22nd October 2012

Python <3 Content systems- managing millions of tracks for the masses

Tuesday, October 23, 12

Page 2: Python

Tuesday, October 23, 12

Page 3: Python

Tuesday, October 23, 12

Page 4: Python

Tuesday, October 23, 12

Page 5: Python

Tuesday, October 23, 12

Page 6: Python

Tuesday, October 23, 12

Page 7: Python

Tuesday, October 23, 12

Page 8: Python

Tuesday, October 23, 12

Page 9: Python

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 10: Python

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 11: Python

> 18 M tracks

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 12: Python

> 18 M tracks

> 20 k new tracks added per day

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 13: Python

> 18 M tracks

> 1 century of listening

> 20 k new tracks added per day

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 14: Python

> 18 M tracks

> 1 century of listening

> 20 k new tracks added per day

> 500 M playlists

> Available in 15 Countries

> 15 M active users*

* Users active within the previous 30 daysTuesday, October 23, 12

Page 15: Python

Service overview

Tuesday, October 23, 12

Page 16: Python

Service overview

Storage

Tuesday, October 23, 12

Page 17: Python

Service overview

Storage

User

Tuesday, October 23, 12

Page 18: Python

Service overview

Storage

User

Search

Tuesday, October 23, 12

Page 19: Python

Service overview

Storage

User

Search

Metadata

Tuesday, October 23, 12

Page 20: Python

Service overview

...

Storage

User

Search

Metadata

Tuesday, October 23, 12

Page 21: Python

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Page 22: Python

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Page 23: Python

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Page 24: Python

Service overview

...

Storage

User

Search

Metadata

AP

Tuesday, October 23, 12

Page 25: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 26: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 27: Python

XMLXMLXMLXML

Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/

Ingestion

Tuesday, October 23, 12

Page 28: Python

Ingestion: Delivery formats

Tuesday, October 23, 12

Page 29: Python

Ingestion: Delivery formats

~ 10 different incoming XML formats

Tuesday, October 23, 12

Page 30: Python

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

Tuesday, October 23, 12

Page 31: Python

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

- Spotify delivery format (mostly indies)

Tuesday, October 23, 12

Page 32: Python

Ingestion: Delivery formats

~ 10 different incoming XML formats

- Proprietary formats (majors)

- Spotify delivery format (mostly indies)

Thousands of lines of source specific code

Tuesday, October 23, 12

Page 33: Python

Data model [simplified]

Album

Track

Artist

Disc

Rights

Audio

*

*

*

*

*

1

1

1

*

*

1

1

Transcoding

1

*

Tuesday, October 23, 12

Page 34: Python

Ingestion

LXML and XSLT with extensions for parsing/transforming XML

Tuesday, October 23, 12

Page 35: Python

Ingestion: XPath extensions

>>> def formerlify(_, name):... return 'The artist formerly known as %s' %name

>>> #Namespace stuff>>> from lxml import etree>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')>>> ns['hello'] = hello>>> ns.prefix = 'f'

>>> root = etree.XML('<a><b>Prince</b></a>')>>> print(root.xpath('f:hello(string(b))'))

... The artist formerly known as Prince

http://lxml.de/extensions.html#xpath-extension-functions

Tuesday, October 23, 12

Page 36: Python

Ingestion

Tuesday, October 23, 12

Page 37: Python

IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Tuesday, October 23, 12

Page 38: Python

IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Bible apparently fits in 3MB XML

Tuesday, October 23, 12

Page 39: Python

Ingestion

>>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19...

>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78...

>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39...

Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space

Bible apparently fits in 3MB XML

Tuesday, October 23, 12

Page 40: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 41: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 42: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Page 43: Python

Centralized vs. aggregated cataloging

Requires merging!

Requires humans!

Tuesday, October 23, 12

Page 44: Python

Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08

Metadata - challenges

Tuesday, October 23, 12

Page 45: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 46: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 47: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Page 48: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 49: Python

Ambiguous artists - thesis work

Tuesday, October 23, 12

Page 50: Python

Ambiguous artists - thesis work

• User input

Tuesday, October 23, 12

Page 51: Python

Ambiguous artists - thesis work

• User input

• Machine learning

Tuesday, October 23, 12

Page 52: Python

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

Tuesday, October 23, 12

Page 53: Python

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

• Feature selection (#matches per external source, len(name), country-count, multilingual)

Tuesday, October 23, 12

Page 54: Python

Ambiguous artists - thesis work

• User input

• Machine learning

• Matching against external sources

• Feature selection (#matches per external source, len(name), country-count, multilingual)

• Matchings + preprocessing in Python

Tuesday, October 23, 12

Page 55: Python

Content matching

(16 * 10 ** 6) ** 2

Tuesday, October 23, 12

Page 56: Python

Content matching

(16 * 10 ** 6) ** 2 = A large number

Tuesday, October 23, 12

Page 57: Python

Content matching

(16 * 10 ** 6) ** 2 = A large number

Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]

Tuesday, October 23, 12

Page 58: Python

Content matching

(16 * 10 ** 6) ** 2 = A large number

Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]

Side note: Levenshtein (edit) distance is a heavy operation

-> speeded up about 4x with pypy (or use c-extension)

Tuesday, October 23, 12

Page 59: Python

Automatic data processing will never be perfect

Tuesday, October 23, 12

Page 60: Python

Automatic data processing will never be perfect

Patch it!

Tuesday, October 23, 12

Page 61: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 62: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 63: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Page 64: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 65: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 66: Python

Transcoding

Asynchronous

RabbitMQ + amqplib

Master / workers

Tuesday, October 23, 12

Page 67: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 68: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 69: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Page 70: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 71: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 72: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 73: Python

Index build

Tuesday, October 23, 12

Page 74: Python

Index build

• Nightly batch job on db-dumps

Tuesday, October 23, 12

Page 75: Python

Index build

• Nightly batch job on db-dumps

• Previously mostly python but now moved to Java for performance reason

Tuesday, October 23, 12

Page 76: Python

Index build

• Nightly batch job on db-dumps

• Previously mostly python but now moved to Java for performance reason

• But still lots of python helper scripts :)

Tuesday, October 23, 12

Page 77: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Content pipeline

Tuesday, October 23, 12

Page 78: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Tuesday, October 23, 12

Page 79: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Tuesday, October 23, 12

Page 80: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 81: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 82: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 83: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding On site live services, e.g. search, browse

Publishing

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 84: Python

Distribution/publish Service A

Service B

Service C

Tuesday, October 23, 12

Page 85: Python

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Page 86: Python

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Page 87: Python

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Page 88: Python

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Page 89: Python

Distribution/publish

Index A

Index B

Index C

Service A

Service B

Service C

Tuesday, October 23, 12

Page 90: Python

Scheduling being migrated to ZooKeeper

image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/

Tuesday, October 23, 12

Page 91: Python

Distribution/publish

Staged rollout

Tuesday, October 23, 12

Page 92: Python

Distribution/publish

Tuesday, October 23, 12

Page 93: Python

Distribution/publish

Exponential back-off

Tuesday, October 23, 12

Page 94: Python

Distribution/publish

Exponential back-offwaiting 5s ...

Tuesday, October 23, 12

Page 95: Python

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...

Tuesday, October 23, 12

Page 96: Python

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...

Tuesday, October 23, 12

Page 97: Python

Distribution/publish

Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...waiting 60s ...

Tuesday, October 23, 12

Page 98: Python

Label A

Label B

Label C

Label D

Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/

Ingestion

Content pipeline

Indexing

Transcoding On site live services, e.g. search, browse

Publishing

Merge

Curation/enrichment

Tuesday, October 23, 12

Page 99: Python

Store ’da data

Tuesday, October 23, 12

Page 100: Python

Choice of database

Tuesday, October 23, 12

Page 101: Python

Choice of database

Depends on the use case - duh!

Tuesday, October 23, 12

Page 102: Python

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

Tuesday, October 23, 12

Page 103: Python

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

Tuesday, October 23, 12

Page 104: Python

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

Tuesday, October 23, 12

Page 105: Python

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

• Lucene (search service)

Tuesday, October 23, 12

Page 106: Python

Choice of database

Depends on the use case - duh!

• PostgreSQL (e.g. user service)

• Cassandra (e.g. playlist service)

• Tokyo cabinet (e.g. browse service)

• Lucene (search service)

• HDFS

Tuesday, October 23, 12

Page 107: Python

PostgreSQL

[Pic. of elephant]

Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12

Page 108: Python

PostgreSQL

Redundancy + scaling: master/slave

Tuesday, October 23, 12

Page 109: Python

PostgreSQL

Joins and subqueries - let the query planner roll!

Tuesday, October 23, 12

Page 110: Python

PostgreSQL

Python?

Tuesday, October 23, 12

Page 111: Python

PostgreSQL

Python?- psycopg2 + SQL-queries

- SQLAlchemy migrator for versioning of db-schemas

Tuesday, October 23, 12

Page 112: Python

PostgreSQL

Python?- psycopg2 + SQL-queries

- SQLAlchemy migrator for versioning of db-schemas

Server side, aka named, cursors:conn = psycopg2.connect(database='huge_db', user='postgres', password='secret')sscursor = conn.cursor('my_cursor')sscursor.execute('SELECT * FROM big_table')rows = sscursor.fetchmany(1000)...

Tip!

Tuesday, October 23, 12

Page 113: Python

Scaling the content pipeline

What to scale for?

Tuesday, October 23, 12

Page 114: Python

Scaling the content pipeline

Size of catalog

Tuesday, October 23, 12

Page 115: Python

Scaling the content pipeline

# Users

Tuesday, October 23, 12

Page 117: Python

Distribution/publish

Popen + gevent (although IO-bound)import gevent

gevent.monkey.patch_all()

def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1)

subprocess.Popen.wait = _wait

Tuesday, October 23, 12