mongodb for storing humongous music database
DESCRIPTION
Musicbrainz is an encyclopedia of music tracks, artists and albums. It is available in PostgreSQL under CC license. 2 different approaches to load the database into MongoDB are examined - one where 4 tables are first denormalized in Postgres and then loaded into MongoDB. Other one loads them into MongoDB and denormalizes into a single collection there. We also show MongoDB's fulltext index.TRANSCRIPT
Working with Humongous Music Database
MongoDB
Prasoon Kumar
#HyderabadDataScienceGroup
Agenda
• MongoDB Features
• Bulk Import
• Full Text Index creation
• Full Text Search
• Musicbrainz Database
MUSIC BRAINZ
What is MusicBrainz ? • MusicBrainz is a community-maintained open
source encyclopedia of music information.
• This means that anyone - including you - can help contribute to the project by adding information about your favorite artists and their related works.
• Robert Kaye founded MusicBrainz. The project has grown rapidly from a one-man operation to an international community of enthusiasts who appreciate both music and music metadata.
MusicBrainz • Along the way, the scope of the project has
expanded from its origins as a mere a CDDB replacement to today, where MusicBrainz has become a true encyclopedia of music.
• As an encyclopedia and as a community, MusicBrainz exists solely to collect as much information about music as we can without discriminating or preferring one "type" of music over another.
MusicBrainz Database
The MusicBrainz Database is where all of the various pieces of information we collect about music is stored, from artists and their releases to works and their composers, and of course much more. The majority of the data in the MusicBrainz Database is placed in the Public Domain, which means that anyone can download the data and use it in any way they see fit. The remaining data is released under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.
MongoDB
Document Database
Open-Source
General Purpose
Scalability
Auto-Sharding
• Increase capacity as you go
• Commodity and cloud architectures
• Improved operational simplicity and cost visibility
Morphia
MEAN Stack
Java
Python
Perl
Ruby
Support for the most popular languages and frameworks
Drivers & Ecosystem
Music Mongo • Load (import)
• Run – Exact match – Full text search
• Todo
– Application interface
AWS Setup
s0 54.225.100.65
s1 54.235.157.214
s2 54.225.100.42
Client & mongos 54.225.100.39
config 184.73.195.120
Relevant schema of MusicBrainz:
Import strategies
• Denormalized from source DB – Import TSV in PostgreSQL – Export joined tables from PostgreSQL – mongoimport TSV
• Separate collections from TSV – mongoimport TSVs into temporary collections – “Join” temporary collections in client (PyMongo) and
insert to destination collection
Steps for creating denormalized table:
Client join
Import statistics
recording:
2013-11-11T22:02:51.213+0000 imported 12817015 objects real 69m49.949s
artist_credit:
2013-11-11T22:04:41.469+0000 imported 756247 objects real 1m50.256s
track:
2013-11-11T22:48:59.423+0000 imported 15427255 objects real 44m17.973s
release:
2013-11-11T22:53:06.627+0000 imported 1208854 objects real 4m7.183s
medium:
2013-11-11T22:57:45.030+0000 imported 1343234 objects real 4m38.414s
Import via Postgres
Operation Time
Postgres Import 08m11s
Denormalize 14m57s
Export 00m29s
(Unsharded) (Sharded)
MongoDB Import 14m59s 12m15s
Index 07m45s 02m35s
Overall 45m23s 40m13s
Indexes & Sharding
Indexes & Sharding - Text Index
Indexes & Sharding - Shard key
musicbrainz2.records3
shard key: { "name" : 1, "_id" : 1 }
chunks:
shard0002 18
shard0000 18
shard0001 18
Thank You team = {
members: [“Jonathan”, “Prasoon”], company: “MongoDB }
@prasoonk