mongodb for storing humongous music database

21
Working with Humongous Music Database MongoDB Prasoon Kumar #HyderabadDataScienceGroup

Upload: prasoon-kumar

Post on 20-Jun-2015

610 views

Category:

Technology


0 download

DESCRIPTION

Musicbrainz is an encyclopedia of music tracks, artists and albums. It is available in PostgreSQL under CC license. 2 different approaches to load the database into MongoDB are examined - one where 4 tables are first denormalized in Postgres and then loaded into MongoDB. Other one loads them into MongoDB and denormalizes into a single collection there. We also show MongoDB's fulltext index.

TRANSCRIPT

Page 1: MongoDB for storing humongous music database

Working with Humongous Music Database

MongoDB

Prasoon Kumar

#HyderabadDataScienceGroup

Page 2: MongoDB for storing humongous music database

Agenda

•  MongoDB Features

•  Bulk Import

•  Full Text Index creation

•  Full Text Search

•  Musicbrainz Database

Page 3: MongoDB for storing humongous music database

MUSIC BRAINZ

Page 4: MongoDB for storing humongous music database

What is MusicBrainz ? •  MusicBrainz is a community-maintained open

source encyclopedia of music information.

•  This means that anyone - including you - can help contribute to the project by adding information about your favorite artists and their related works.

•  Robert Kaye founded MusicBrainz. The project has grown rapidly from a one-man operation to an international community of enthusiasts who appreciate both music and music metadata.

Page 5: MongoDB for storing humongous music database

MusicBrainz •  Along the way, the scope of the project has

expanded from its origins as a mere a CDDB replacement to today, where MusicBrainz has become a true encyclopedia of music.

•  As an encyclopedia and as a community, MusicBrainz exists solely to collect as much information about music as we can without discriminating or preferring one "type" of music over another.

Page 6: MongoDB for storing humongous music database

MusicBrainz Database

The MusicBrainz Database is where all of the various pieces of information we collect about music is stored, from artists and their releases to works and their composers, and of course much more. The majority of the data in the MusicBrainz Database is placed in the Public Domain, which means that anyone can download the data and use it in any way they see fit. The remaining data is released under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.

Page 7: MongoDB for storing humongous music database

MongoDB

Document Database

Open-Source

General Purpose

Page 8: MongoDB for storing humongous music database

Scalability

Auto-Sharding

•  Increase capacity as you go

•  Commodity and cloud architectures

•  Improved operational simplicity and cost visibility

Page 9: MongoDB for storing humongous music database

Morphia

MEAN Stack

Java

Python

Perl

Ruby

Support for the most popular languages and frameworks

Drivers & Ecosystem

Page 10: MongoDB for storing humongous music database

Music Mongo •  Load (import)

•  Run – Exact match – Full text search

•  Todo

–  Application interface

Page 11: MongoDB for storing humongous music database

AWS Setup

s0 54.225.100.65

s1 54.235.157.214

s2 54.225.100.42

Client & mongos 54.225.100.39

config 184.73.195.120

Page 12: MongoDB for storing humongous music database

Relevant schema of MusicBrainz:

Page 13: MongoDB for storing humongous music database

Import strategies

•  Denormalized from source DB –  Import TSV in PostgreSQL –  Export joined tables from PostgreSQL –  mongoimport TSV

•  Separate collections from TSV –  mongoimport TSVs into temporary collections –  “Join” temporary collections in client (PyMongo) and

insert to destination collection

Page 14: MongoDB for storing humongous music database

Steps for creating denormalized table:

Page 15: MongoDB for storing humongous music database

Client join

Page 16: MongoDB for storing humongous music database

Import statistics

recording:

2013-11-11T22:02:51.213+0000 imported 12817015 objects real 69m49.949s

artist_credit:

2013-11-11T22:04:41.469+0000 imported 756247 objects real 1m50.256s

track:

2013-11-11T22:48:59.423+0000 imported 15427255 objects real 44m17.973s

release:

2013-11-11T22:53:06.627+0000 imported 1208854 objects real 4m7.183s

medium:

2013-11-11T22:57:45.030+0000 imported 1343234 objects real 4m38.414s

Page 17: MongoDB for storing humongous music database

Import via Postgres

Operation Time

Postgres Import 08m11s

Denormalize 14m57s

Export 00m29s

(Unsharded) (Sharded)

MongoDB Import 14m59s 12m15s

Index 07m45s 02m35s

Overall 45m23s 40m13s

Page 18: MongoDB for storing humongous music database

Indexes & Sharding

Page 19: MongoDB for storing humongous music database

Indexes & Sharding - Text Index

Page 20: MongoDB for storing humongous music database

Indexes & Sharding - Shard key

musicbrainz2.records3

shard key: { "name" : 1, "_id" : 1 }

chunks:

shard0002 18

shard0000 18

shard0001 18

Page 21: MongoDB for storing humongous music database

Thank You team = {

members: [“Jonathan”, “Prasoon”], company: “MongoDB }

@prasoonk