mongo nyc nyt + mongodb
DESCRIPTION
MongoDB talk from MongoNYC 2012TRANSCRIPT
NY Times + MongoDBLessons Learnt- Deep Kapadia (NYT R&D)
Why MongoDB?● Quick Prototyping● Flexible schema
○ Easy to dump data from 3rd party data sources■ bit.ly■ Twitter
○ Schema may change depending on the need of the day or what metrics we are interested in
● Quick scaling if needed
6 Months ago● Started out with two weeks data● Single MongoDB instance
○ No replication○ No backups○ No monitoring
● Data was stored locally on ephemeral storage on an EC2 instance
● Logs were stored locally
Technology stack● Amazon EC2● MongoDB 2.0.x (started with 1.8)● Python 2.7.2● pymongo 2.1.x ● Tornado 2.2 load balanced over Nginx● Little bit of Ruby on Rails (going away soon)● Custom WebGL based framework for
visualization
Monitoring● Monitoring tools
○ db.serverStatus() + cron +email○ 10gen's Mongo Monitoring Service○ M/Monit -very basic ○ Nagios
● At the least monitor ○ The mongod process○ Memory usage○ CPU usage○ Disk usage
● Desirable to monitor EVERYTHING
Replication● REALLY EASY to set up● 4 node replica set
○ Primary ○ Secondary ○ Arbiter ○ Delayed secondary - Priority 0
■ Never gets elected as Primary
All instances are m1.large except Arbiter which is a t1.micro
Replication● Be aware of how your drivers handle failover
○ pymongo throws AutoReconnectException● Decide up front how you want to handle
failover○ Lose data○ idempotent writes (keep trying until write is
successful...up to a certain number of times)
Storage● Do not use local storage on EC2
○ ephemeral - does not persist on reboots● If using EC2 use EBS
○ Persistent○ Easy to snapshot○ can be detached and attached to a different server○ Can be RAID'ed for reliablity and performance
● Note: EBS is known to have inconsistent performance characteristics
● Limited by 1GB/s
Storage● We started with RAID 10 on EBS
○ difficult to image○ slightly steep learning curve if you are not used to
tinkering with RAID/LVM● when using RAID, you would need to freeze
the filesystem ○ xfs_freeze
● If not on EC2 just use File system snapshots
Logs● Store logs an EBS block
○ Logs can still be viewed in case your server goes down and cannot be restarted
● Rotate your logs - please!○ db.runCommand("logRotate");○ command line○ kill -SIGUSR1 <mongod process id>○ killall -SIGUSR1 mongod○ logrotate
■ still requires a post-rotate kill command
Backups and Restore● Snapshot EBS blocks
○ --journaling is your friend○ need to use fsync + lock if journaling is disabled
● Restoring from snapshots is easy○ create a new volume from the snapshot○ mount volume to EC2 instance
Caveat: When using RAID, you would still need to use fsync + lock even if journaling is enabled.
Backups and Restore● mongodump/mongorestore
○ can be run while the DB is still running○ no need to lock the DB○ can backup and restore individual collections or
even partial collections○ rebuilds indexes
● Automate your backups○ https://github.com/micahwedemeyer/automongobackup
Backups and Restore● use --oplog with mongodump and --
oplogReplay with mongorestore● Backups and restore can be slow if your
data is a few 100 GB.○ Plan for it
● Use incremental backups○ possible with mongdump/mongorestore
■ mongodump -q
Understand your Data ● Know whether your application is write
heavy or a read heavy ● Separate write heavy collections from read
heavy collections● Minimize indexes on write heavy collections● Separate operational data from data used for
mining/analytics if possible
Querying● db.<collection>.find({x:123})
○ returns entire documents matching the criteria○ will be slow if you have large documents
● $exists, $nin & $ne not very efficient○ try setting default values for keys instead of using
exist○ try using $gt and $lt instead of $ne if possible
(numerics)
Querying● Limit the data you return only what you need
○ Use range queries○ limit the number of results○ limit the number of keys returned
● Increase or decrease the batch size for a cursor based on your needs○ returning a batch is a network operation
Indexes● Use indexes judiciously● Create indexes to match your query keys● Understand what you get with a compound
index○ db.collection.ensureIndex({a:1,b:1})
■ gives you an index on a and a&b but not on b■ ascending/descending may sometimes matter
when using composite indexes
Indexes (continued)● One index per query rule:
○ Queries on multiple keys cannot use multiple indexes. Use a compound index instead■ $or is an exception
● Make sure that all your indexes fit in the memory○ db.<collection>.getTotalIndexSize()
● Sometimes indexes may not be helpful○ Low selectivity indexes
● Use explain()
Other details● Pay attention to the limitations of the
MongoDB version and the driver you are using○ e.g: $exists does not use indexes prior to 2.0○ e.g. $and is not supported in 1.8
● Design for performance○ iterate over schema design if it does not perform○ sometimes it is better to normalize than store
everything in one large document○ archive historical data to a warehouse
■ mining/analytics
Administration tools● We use Rockmongo● But there are many other tools available
○ http://www.mongodb.org/display/DOCS/Admin+UIs
Create read only users for developers if needed.