mongo nyc nyt + mongodb

NY Times + MongoDBLessons Learnt- Deep Kapadia (NYT R&D)

Why MongoDB?● Quick Prototyping● Flexible schema

○ Easy to dump data from 3rd party data sources■ bit.ly■ Twitter

○ Schema may change depending on the need of the day or what metrics we are interested in

● Quick scaling if needed

6 Months ago● Started out with two weeks data● Single MongoDB instance

○ No replication○ No backups○ No monitoring

● Data was stored locally on ephemeral storage on an EC2 instance

● Logs were stored locally

Technology stack● Amazon EC2● MongoDB 2.0.x (started with 1.8)● Python 2.7.2● pymongo 2.1.x ● Tornado 2.2 load balanced over Nginx● Little bit of Ruby on Rails (going away soon)● Custom WebGL based framework for

visualization

Monitoring● Monitoring tools

○ db.serverStatus() + cron +email○ 10gen's Mongo Monitoring Service○ M/Monit -very basic ○ Nagios

● At the least monitor ○ The mongod process○ Memory usage○ CPU usage○ Disk usage

● Desirable to monitor EVERYTHING

Replication● REALLY EASY to set up● 4 node replica set

○ Primary ○ Secondary ○ Arbiter ○ Delayed secondary - Priority 0

■ Never gets elected as Primary

All instances are m1.large except Arbiter which is a t1.micro

Replication● Be aware of how your drivers handle failover

○ pymongo throws AutoReconnectException● Decide up front how you want to handle

failover○ Lose data○ idempotent writes (keep trying until write is

successful...up to a certain number of times)

Storage● Do not use local storage on EC2

○ ephemeral - does not persist on reboots● If using EC2 use EBS

○ Persistent○ Easy to snapshot○ can be detached and attached to a different server○ Can be RAID'ed for reliablity and performance

● Note: EBS is known to have inconsistent performance characteristics

● Limited by 1GB/s

Storage● We started with RAID 10 on EBS

○ difficult to image○ slightly steep learning curve if you are not used to

tinkering with RAID/LVM● when using RAID, you would need to freeze

the filesystem ○ xfs_freeze

● If not on EC2 just use File system snapshots

Logs● Store logs an EBS block

○ Logs can still be viewed in case your server goes down and cannot be restarted

● Rotate your logs - please!○ db.runCommand("logRotate");○ command line○ kill -SIGUSR1 <mongod process id>○ killall -SIGUSR1 mongod○ logrotate

■ still requires a post-rotate kill command

Backups and Restore● Snapshot EBS blocks

○ --journaling is your friend○ need to use fsync + lock if journaling is disabled

● Restoring from snapshots is easy○ create a new volume from the snapshot○ mount volume to EC2 instance

Caveat: When using RAID, you would still need to use fsync + lock even if journaling is enabled.

Backups and Restore● mongodump/mongorestore

○ can be run while the DB is still running○ no need to lock the DB○ can backup and restore individual collections or

even partial collections○ rebuilds indexes

● Automate your backups○ https://github.com/micahwedemeyer/automongobackup

Backups and Restore● use --oplog with mongodump and --

oplogReplay with mongorestore● Backups and restore can be slow if your

data is a few 100 GB.○ Plan for it

● Use incremental backups○ possible with mongdump/mongorestore

■ mongodump -q

Understand your Data ● Know whether your application is write

heavy or a read heavy ● Separate write heavy collections from read

heavy collections● Minimize indexes on write heavy collections● Separate operational data from data used for

mining/analytics if possible

Querying● db.<collection>.find({x:123})

○ returns entire documents matching the criteria○ will be slow if you have large documents

● $exists, $nin & $ne not very efficient○ try setting default values for keys instead of using

exist○ try using $gt and $lt instead of $ne if possible

(numerics)

Querying● Limit the data you return only what you need

○ Use range queries○ limit the number of results○ limit the number of keys returned

● Increase or decrease the batch size for a cursor based on your needs○ returning a batch is a network operation

Indexes● Use indexes judiciously● Create indexes to match your query keys● Understand what you get with a compound

index○ db.collection.ensureIndex({a:1,b:1})

■ gives you an index on a and a&b but not on b■ ascending/descending may sometimes matter

when using composite indexes

Indexes (continued)● One index per query rule:

○ Queries on multiple keys cannot use multiple indexes. Use a compound index instead■ $or is an exception

● Make sure that all your indexes fit in the memory○ db.<collection>.getTotalIndexSize()

● Sometimes indexes may not be helpful○ Low selectivity indexes

● Use explain()

Other details● Pay attention to the limitations of the

MongoDB version and the driver you are using○ e.g: $exists does not use indexes prior to 2.0○ e.g. $and is not supported in 1.8

● Design for performance○ iterate over schema design if it does not perform○ sometimes it is better to normalize than store

everything in one large document○ archive historical data to a warehouse

■ mining/analytics

Administration tools● We use Rockmongo● But there are many other tools available

○ http://www.mongodb.org/display/DOCS/Admin+UIs

Create read only users for developers if needed.

Questions?Deep Kapadia

@[email protected]

NYT R&D Labs@nytlabs

http://nytlabs.com

mongo nyc nyt + mongodb

Documents