mongodb hacks of frustration

29
2013 MongoDB Hacks of Frustration Foursquare Hacks for a Better Mongo MongoNYC June 21, 2013 Leo Kim Software Engineer, Foursquare

Upload: mongodb

Post on 29-Jun-2015

2.209 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: MongoDB Hacks of Frustration

2013

MongoDB Hacks of FrustrationFoursquare Hacks for a Better Mongo

MongoNYCJune 21, 2013

Leo KimSoftware Engineer, Foursquare

Page 2: MongoDB Hacks of Frustration

2013

Agenda

• About Foursquare

• Vital stats

• Our hacks/tools

• Questions

2013

Page 3: MongoDB Hacks of Frustration

2013

What is Foursquare?

Foursquare helps you explore the world around you.

Meet up with friends, discover new places, and save money using your phone.

Page 4: MongoDB Hacks of Frustration

2013

Big stats

35,000,000+ people

4,000,000,000+ check-ins

55,000,000+ points of interest

1,300,000+ merchants

2013

Page 5: MongoDB Hacks of Frustration

2013

Moar stats● 5MM-6MM checkins a day

● ~4K-5K qps against our API, ~150K-300K qps against Mongo

● 11 clusters, 8 sharded + 3 replica sets

– Mongo 2.2.{3,4}

● ~4TB of data and indexes, all of it kept in memory

– 24-core Intel, 192GB RAM, 4 120GB SSD drives

● Extensive use of sharding + replica set ReadPreference.secondary reads

2013

Page 6: MongoDB Hacks of Frustration

2013

Mongo has scaled with us

We have been using Mongo for three years.

Mongo has enabled us to scale our service along with our growing user base. It has given us the flexibility and agility to innovate in an exciting space where there is still much to figure out.

2013

Page 7: MongoDB Hacks of Frustration

2013

Still, some things to deal with● Monitoring

– MMS is good, but always could use more stats to narrow down on pain points.

● General maintenance

– Constant struggle with data fragmentation

● Sharding

– No load-based balancing (SERVER-2472)

– Overhead of all-shards queries

– Bugs can leave duplicate records on the wrong shards

2013

Page 8: MongoDB Hacks of Frustration

2013

Monitoring hack: “ODash”

2013

Page 9: MongoDB Hacks of Frustration

2013

Monitoring hack: “Mongolyzer”

2013

Page 10: MongoDB Hacks of Frustration

2013

Monitoring hack: “Telemetry”

2013

Page 11: MongoDB Hacks of Frustration

2013

Data size and fragmentation

• Problem: Even with bare metal and SSDs, fragmentation can degrade performance by increasing data size beyond available memory.

– Can also be an issue with autobalancing as chunk moves induce increased paging and further degrade I/O

• We have enough replicas (~400) where we need to do this regularly

2013

Page 12: MongoDB Hacks of Frustration

2013

Alerts!

2013

Page 13: MongoDB Hacks of Frustration

2013

Hack: “Mackinac”

• (Mostly) automated repair script• “Kill file” mongod

– Drain queries gracefully from mongod– About kill files:

http://engineering.foursquare.com/2012/06/20/stability-in-the-midst-of-chaos/

• Stops mongod• Resyncs from primary

– We considered running compact() but it doesn't reclaim disk space. May revisit this though.

2013

Page 14: MongoDB Hacks of Frustration

2013

Hack: “Mackinac”

2013

Page 15: MongoDB Hacks of Frustration

2013

Hack: Shard key checker

• Checks for shard key usage in the app• Loads shard keys from mongo config servers,

matches keys against a given {query, document}• e.g.

db.users({ _id : 12345 }) // shard key match!db.users({ twitter: “[email protected]” }) // shard key miss!

• Why use this?

2013

Page 16: MongoDB Hacks of Frustration

2013

Detect all-shards queries!

• Problem: All-shards queries– Not all our queries use shard keys (unfortunately)

– Use up connections, network traffic, overhead in query processing on mongod + mongos

– Gets worse with more shards

– What happens if one of the replicas is not responding?

● Solution: Measure by intercepting queries with shard key checker and count misses

2013

Page 17: MongoDB Hacks of Frustration

2013

All-shards queries

[2013-06-09 19:24:27,706] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.venues.find({ "del" : { "$ne" : true} , "mayor" : xxxx , "closed" : { "$ne" : true}}).sort({ "mayor_count" : -1})

[2013-06-09 19:24:28,296] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.tips.find({ "uid" : xxxx}, { "_id" : 1})

[2013-06-09 19:24:28,326] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.photos.find({ "uid" : xxxx})

[2013-06-09 19:24:28,696] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.comments2.find({ "c.u" : xxxx})

[2013-06-09 19:24:32,246] WARN c.f.boot.RogueShardKeyChecker - Possible all-shards query: db.user_venue_aggregations2.find({ "_id" : { "$gte" : { "u" : xxxx , "v" : { "$oid" : "000000000000000000000000"}} , "$lte" : { "u" : xxxx , "v" : { "$oid" : "000000000000000000000000"}}}})

2013

Page 18: MongoDB Hacks of Frustration

2013

All-shards queries

2013

Page 19: MongoDB Hacks of Frustration

2013

Find hot chunks!

• Problem: Autobalancer balances by data size, but not load.

– Checkins shard key → {u : 1}

– Imagine a group of users who check in a bunch.

– Imagine the balancer putting all those users on the same shard, or even the same chunk.

• Solution: Intercept queries with shard key checker and bucket hits by chunk

2013

Page 20: MongoDB Hacks of Frustration

2013

Hack: Hot chunk detector

• Need to do a little more to make this work with the shard key checker

– Create trees of chunk ranges per-collection

– Match shard keys from queries to chunk ranges, accumulate counts

2013

Page 21: MongoDB Hacks of Frustration

2013

Hack: Hot chunk detector

2013

Page 22: MongoDB Hacks of Frustration

2013

Deeper hack: Hot chunk mover

• A standalone process reading from JSON endpoint on hot chunk detector

– Identifies the hottest chunk

– Attempts to split it (if necessary)

– Move the chunk to the “coldest” shard

• Subject to same problems as regular chunk moves, can disrupt latencies

• Currently using a hit ratio of p9999/p50 to identify hot chunk

• Work in progress

2013

Page 23: MongoDB Hacks of Frustration

2013

Sample hot chunk detector json

2013

Page 24: MongoDB Hacks of Frustration

2013

Fix data integrity!

• Problem: Application doesn't always clean up after itself properly.

– Duplicate documents can exist on multiple shards

• Solution: Compare the document shard key from the host replica against the canonical metadata in shard key checker (i.e. where the document should “live”)

2013

Page 25: MongoDB Hacks of Frustration

2013

Hack: “Chunksanity”

• Simple algorithm:– Connects to each shard

– Iterates through each document in each collection

– Verifies that the document is correctly placed according to chunk data in mongo config server

– Deletes any incorrectly placed documents

• Heavyweight process, run only periodically on specific suspicious collections

2013

Page 26: MongoDB Hacks of Frustration

2013

Sample Chunksanity logging

[2013-05-29 18:23:36,128] [main] INFO c.f.m.chunksanity.MongoChunkSanity - Logging misplaced docs to localhost/production_misplaced_docs

[2013-05-29 18:23:36,301] [main] INFO c.f.m.chunksanity.MongoChunkSanity - Verifying chunks on users at office-canary-2/10.101.1.175:26190

[2013-05-29 18:23:37,971] [ForkJoinPool-1-worker-1] INFO c.f.m.chunksanity.MongoChunkSanity - Looking at collection foursquare.users on shard shard0007 using filter: { } and shardKey { "_id" : 1.0}

[2013-05-29 18:24:06,892] [ForkJoinPool-1-worker-2] INFO c.f.m.chunksanity.MongoChunkSanity - Done with foursquare.users on shard users20 0/xxxx misplaced in 28911 ms

2013

Page 27: MongoDB Hacks of Frustration

20132013

Questions?

[email protected]/jobs

2013

Page 28: MongoDB Hacks of Frustration
Page 29: MongoDB Hacks of Frustration