mongodb at the energy frontier

33
MongoNYC, May, 2012 MongoDB at the energy frontier Valentin Kuznetsov, Cornell University 1 Monday, May 21, 12

Upload: valentin-kuznetsov

Post on 21-Jun-2015

2.082 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: MongoDB at the energy frontier

MongoNYC, May, 2012

MongoDB at the energy frontierValentin Kuznetsov, Cornell University

1Monday, May 21, 12

Page 2: MongoDB at the energy frontier

Outline

✤ CMS :: LHC :: CERN

✤ Data Aggregation System and MongoDB

✤ Experience

✤ Summary

2Monday, May 21, 12

Page 3: MongoDB at the energy frontier

CMS :: LHC :: CERN

Large Hadron Collider located at CERN, Geneva, SwitzerlandCMS is one of the 4 experiments to probe our knowledge of particle

interactions and search for a new physics

3Monday, May 21, 12

Page 4: MongoDB at the energy frontier

CMS :: LHC :: CERN

Compact Muon Solenoid (CMS)

4Monday, May 21, 12

Page 5: MongoDB at the energy frontier

CMS :: LHC :: CERN

Typical proton-proton collision in CMS detector5Monday, May 21, 12

Page 6: MongoDB at the energy frontier

CMS :: LHC :: CERN

✤ 40 countries, 172 institutions, more then 3000 scientists

✤ CMS experiment produces a few PB of real data each year and we collect ~TB of meta-data

✤ CMS relies on GRID infrastructure for data processing and uses 100+ computing centers word-wide

✤ CMS software consists of 4M lines of C++(framework), 2M lines of python (data management), plus Java, perl, etc.

✤ ORACLE, MySQL, SQLite, NoSQL

6Monday, May 21, 12

Page 7: MongoDB at the energy frontier

Dilemma

DBS

SiteDB

Phedex

GenDB

LumiDB

RunDB

PSetDB

Data

Quality

Overview

How I can findmy data?

7Monday, May 21, 12

Page 8: MongoDB at the energy frontier

Motivations

✤ Users want to query different data services without knowing about their existence

✤ Users want to combine information from different data services

✤ Some users may have domain knowledge, but they need to query X services, using Y interface and dealing with Z data formats to get our data

block,site

lumi

site

DBSrun, file, block, site,config, tier, dataset,lumi, parameters, ....

LumiDBlumi, luminosity, hltpath

SiteDBsite, admin, site.status, ..

Phedexblock, file, block.replica,file.replica, se, node, ...

GenDBgenerator, xsection, process, decay, ...

RunSummaryrun, trigger, detector, ...

DataQualitytrigger, ecal, hcal, ...

run,lumi

run

MC id

Overviewcountry, node, region, ..

Parameter Set DBCMSSW parameters

run

Service Eparam1, param2, ..Service D

param1, param2, ..Service Cparam1, param2, ..Service B

param1, param2, ..Service Aparam1, param2, ..

pset

Data Aggregation System

8Monday, May 21, 12

Page 9: MongoDB at the energy frontier

Implementation idea

✤ When we talk we may use different languages (English, French, etc.) or different conventions (pounds vs kg)

✤ In order to establish communication we use translation, dictionary, thesaurus

9Monday, May 21, 12

Page 10: MongoDB at the energy frontier

Implementation idea

10Monday, May 21, 12

Page 11: MongoDB at the energy frontier

Pros

✤ Separate data management from discovery service

✤ Data are safe and secure

✤ Pluggable architecture (new translations)

✤ Users never bother with interface, naming and schema conflicts, data-formats, security policies

✤ Information is aggregated in a real-time over distributed services

✤ Data consistency checks for free

✤ DB and API changes are transparent for end-users11Monday, May 21, 12

Page 12: MongoDB at the energy frontier

Cons

✤ DAS does not own the data

✤ lots of writes/reads/translations

✤ Data-services are real bottleneck

✤ nothing is guaranteed, e.g. service can go down, no control of its performance, requested data can be really large, etc.

✤ cache often and preemptive

MongoDB to rescue !!!

12Monday, May 21, 12

Page 13: MongoDB at the energy frontier

Data Aggregation System

DAS webserver

dbs

sitedb

phedex

lumidb

runsum

DAS cache

DAS Analytics

CPU core

DAS core

DAS core

DAS Cache server

record query, APIcall to Analytics

Fetch popularqueries/APIs

Invoke the same API(params)Update cache periodically

DAS mapping Map data-service

output to DASrecords

mapping

par

ser

����������

dat

a-se

rvic

es

DAS merge

plu

gin

s

aggregator

UI

RESTful interface

DAS robot

13Monday, May 21, 12

Page 14: MongoDB at the energy frontier

Mapping DB

✤ Holds translation between user keywords and data-service APIs, resolve naming conflicts, etc.

✤ city=Ithaca query translates into Google API call

{'das2api': [{'api_param': 'q', 'das_key': 'city.name', 'pattern': ''}], 'daskeys': [{'key': 'city', 'map': 'city.name', 'pattern': ''}], 'expire': 3600, 'format': 'JSON', 'params': {'output': 'json', 'q': 'required'}, 'system': 'google_maps', 'url': 'http://maps.google.com/maps/geo', 'urn': 'google_geo_maps'}

14Monday, May 21, 12

Page 15: MongoDB at the energy frontier

Analytics DB

✤ Keep tracks of user queries, data-service API calls

{'api': {'params': {'q': 'Ithaca', 'output': 'json'}, 'name': 'google_geo_maps'}, 'qhash': '7272bdeac45174823d3a4ea240c124ec', 'system': 'google_maps', 'counter': 5}

✤ Used by DAS analytics daemons to pre-fetch “hot” queries

✤ ValueHotSpot look-up data by popular values

✤ KeyHotSpot look-up data by popular key

✤ QueryMaintainer to keep given query always in cache

15Monday, May 21, 12

Page 16: MongoDB at the energy frontier

Caching DB

✤ Data coming out from data-service providers are translated into JSON and stored into cache collection

✤ naming translation are performed at this level

✤ Data records from cache collection are processed on common key, e.g. city.name, and merged into merge collection

cache collection merge collection

{'city': {'name': 'Ithaca', 'lat':42, 'lng':-76, 'zip':14850}}

{'city': {'name': 'Ithaca', 'lat':42, 'lng':-76}}{'city': {'name': 'Ithaca', 'zip':14850}}

16Monday, May 21, 12

Page 17: MongoDB at the energy frontier

DAS workflow

✤ Query parser

✤ Query DAS merge collection

✤ Query DAS cache collection

✤ invoke call to data service

✤ write to analytics

✤ Aggregate results

✤ Represent results on web UI or via command line interface

query

parser

queryDAS merge

Aggregator

queryDAS cache

querydata-services

DASmerge

DAScache

noyes

noyes

results

DASMapping

DASAnalytics

Web UI

DASlogging

DAScore

17Monday, May 21, 12

Page 18: MongoDB at the energy frontier

Example

18Monday, May 21, 12

Page 19: MongoDB at the energy frontier

DAS QL & MongoDB QL

✤ DAS Query Language built on top of MongoDB QL; it represents MongoDB QL in human readable form

✤ UI level:

block dataset=/a/b/c | grep block.size | count(block.size)

✤ DB level:

col.find(spec={‘dataset.name’:‘/a/b/c’}, fields=[block.size]).count()

✤ We enrich QL with additional filters (grep, sort, unique) and implement set of coroutines for aggregator functions

19Monday, May 21, 12

Page 20: MongoDB at the energy frontier

DAS & MongoDB

✤ DAS works with 15 distributed data-services

✤ their size vary, on average O(100GB)

✤ DAS uses 40 MongoDB collections

✤ caching, mapping, analytics, logging (normal, capped, gridfs cols)

✤ DAS inserts/deletes O(1M) records on a daily basis

✤ We operate on a single 64-bit Linux node with 8 CPUs, 24 GB of RAM and 1TB of disk space, sharding were tested, but it is not enabled

20Monday, May 21, 12

Page 21: MongoDB at the energy frontier

MongoDB benefits

✤ Fast I/O and schema-less database are ideal for cache implementation

✤ you’re not limited by key:value approach

✤ Flexible query language allows to build domain specific QL

✤ stay on par with SQL

✤ No administrative costs with DB

✤ easy to install and maintain

21Monday, May 21, 12

Page 22: MongoDB at the energy frontier

MongoDB issues (ver 2.0.X)

✤ We were unable to directly store DAS queries into analytics collection, due to the dot constrain, e.g. {‘a.b’:1}

✤ queries <=> storage format {‘key’:‘a.b’, ‘value’:1}

✤ Scons is not suitable in fully controlled build environment

✤ it removes $PATH/$LD_LIBRARY_PATH for compiler commands; it forces to use -L/lib64. As a result we used wrappers.

✤ Uncompressed field names and limitation with pagination/aggregation

✤ should be addressed in new MongoDB aggregation framework22Monday, May 21, 12

Page 23: MongoDB at the energy frontier

Tradeoffs

✤ Query collisions: DAS does not own the data and there is no transactions, we rely on query status and update it accordingly

✤ Index choice: initially one per select key, later one per query hash

✤ Storage size: we compromise storage vs data flexibility vs naming conventions

✤ Speed: we compromise simple data access vs conglomerate of restrictions (naming, security policies, interfaces, etc.), but we tuning-up our data-service APIs based on query patterns

23Monday, May 21, 12

Page 24: MongoDB at the energy frontier

Results

✤ The service in production over one year

✤ Users authenticated via GRID certificates and DAS uses proxy server to pass credentials to back-end services

✤ Single query request yields few thousand records and resolved within few seconds

✤ Pluggable architecture allows to query your service(s)

✤ unit tests are done against public data-services, e.g. Google, IP look-up, etc.

24Monday, May 21, 12

Page 25: MongoDB at the energy frontier

NoSQL @ CERN

✤ MongoDB is used by other experiments at CERN

✤ logging, monitoring, data analytics

✤ MongoDB is not the only NoSQL solution used at CERN

✤ One size does not fit all

✤ CouchDB, Cassandra, HBase, etc.

✤ There is on-going discussion between experiments and CERN IT about adoption of NoSQL

25Monday, May 21, 12

Page 26: MongoDB at the energy frontier

Summary

✤ CMS experiment built Data Aggregation System as an intelligent cache to query distributed data-services

✤ MongoDB is used as DAS back-end

✤ During first year of operation we did not experience any significant problems

✤ I’d like to thank MongoDB team and its community for their constant support

✤ Questions? Contact: [email protected]

✤ https://github.com/vkuznet/DAS/26Monday, May 21, 12

Page 27: MongoDB at the energy frontier

Back-up slides

27Monday, May 21, 12

Page 28: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

28Monday, May 21, 12

Page 29: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

28Monday, May 21, 12

Page 30: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

block dataset=/a/b/c

MongoDB spec

Mapping DBholds

relationships

28Monday, May 21, 12

Page 31: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

block dataset=/a/b/c

MongoDB spec

Mapping DBholds

relationships

Caching DBholds

service records

28Monday, May 21, 12

Page 32: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

block dataset=/a/b/c

MongoDB spec

Mapping DBholds

relationships

Caching DBholds

service records

Merge DBholds

merged records

28Monday, May 21, 12

Page 33: MongoDB at the energy frontier

From query to results

QueryAPI

lookupMerge results

Aggreator

Data servicegenerator

Data servicegenerator

Data servicegenerator

Aggreator

Aggreator

block dataset=/a/b/c

MongoDB spec

Mapping DBholds

relationships

Caching DBholds

service records

Merge DBholds

merged records

28Monday, May 21, 12