mongodb schema design: four real-world examples
TRANSCRIPT
Perl Engineer & Evangelist, 10genMike Friedman
#MongoDBdays
Schema DesignFour Real-World Use Cases
Single Table En
Agenda• Why is schema design important• 4 Real World Schemas
– Inbox– History– Indexed Attributes– Multiple Identities
• Conclusions
Why is Schema Design important?
• Largest factor for a performant system• Schema design with MongoDB is
different• RDBMS – "What answers do I have?"• MongoDB – "What question will I have?"
#1 - Message Inbox
Let’s getSocial
Sending Messages
?
Design Goals• Efficiently send new messages to
recipients• Efficiently read inbox
Reading my Inbox
?
3 Approaches (there are more)• Fan out on Read• Fan out on Write• Fan out on Write with Bucketing
// Shard on "from"db.shardCollection( "mongodbdays.inbox", { from: 1 } )
// Make sure we have an index to handle inbox readsdb.inbox.ensureIndex( { to: 1, sent: 1 } )
msg = {from: "Joe",to: [ "Bob", "Jane" ],
sent: new Date(), message: "Hi!",
}
// Send a messagedb.inbox.save( msg )
// Read my inboxdb.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )
Fan out on read
Fan out on read – Send Message
Shard 1 Shard 2 Shard 3
Send Message
Fan out on read – Inbox Read
Shard 1 Shard 2 Shard 3
Read Inbox
Considerations• One document per message sent • Reading an inbox means finding all
messages with my own name in the recipient field
• Requires scatter-gather on sharded cluster• Then a lot of random IO on a shard to find
everything
// Shard on “recipient” and “sent” db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } )
msg = {from: "Joe",to: [ "Bob", "Jane" ],
sent: new Date(), message: "Hi!",
}
// Send a messagefor ( recipient in msg.to ) {
msg.recipient = msg.to[recipient]db.inbox.save( msg );
}
// Read my inboxdb.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )
Fan out on write
Fan out on write – Send Message
Shard 1 Shard 2 Shard 3
Send Message
Fan out on write– Read Inbox
Shard 1 Shard 2 Shard 3
Read Inbox
Considerations• One document per recipient• Reading my inbox is just finding all of the
messages with me as the recipient• Can shard on recipient, so inbox reads hit
one shard• But still lots of random IO on the shard
// Shard on “owner / sequence”db.shardCollection( "mongodbdays.inbox", { owner: 1, sequence: 1 } )db.shardCollection( "mongodbdays.users", { user_name: 1 } )
msg = {from: "Joe",to: [ "Bob", "Jane" ],
sent: new Date(), message: "Hi!",
}
Fan out on write with buckets
// Send a messagefor( recipient in msg.to) { count = db.users.findAndModify({
query: { user_name: msg.to[recipient] }, update: { "$inc": { "msg_count": 1 } }, upsert: true, new: true }).msg_count;
sequence = Math.floor(count / 50);
db.inbox.update({ owner: msg.to[recipient], sequence: sequence }, { $push: { "messages": msg } },{ upsert: true } );
}
// Read my inboxdb.inbox.find( { owner: "Joe" } ).sort ( { sequence: -1 } ).limit( 2 )
Fan out on write with buckets
Fan out on write with buckets• Each “inbox” document is an array of
messages• Append a message onto “inbox” of
recipient• Bucket inboxes so there’s not too many
messages per document• Can shard on recipient, so inbox reads hit
one shard• 1 or 2 documents to read the whole inbox
Fan out on write with buckets - Send
Shard 1 Shard 2 Shard 3
Send Message
Fan out on write with buckets - Read
Shard 1 Shard 2 Shard 3
Read Inbox
#2 – History
Design Goals• Need to retain a limited amount of history
e.g.– Hours, Days, Weeks– May be legislative requirement (e.g. HIPPA, SOX,
DPA)• Need to query efficiently by
– match– ranges
3 Approaches (there are more)• Bucket by Number of messages• Fixed size Array• Bucket by Date + TTL Collections
db.inbox.find() { owner: "Joe", sequence: 25, messages: [ { from: "Joe", to: [ "Bob", "Jane" ], sent: ISODate("2013-03-01T09:59:42.689Z"), message: "Hi!" }, …] }
// Query with a date rangedb.inbox.find ({owner: "friend1", messages: { $elemMatch: {sent:{$gte: ISODate("…") }}}})
// Remove elements based on a datedb.inbox.update({owner: "friend1" }, { $pull: { messages: { sent: { $gte: ISODate("…") } } } } )
Inbox – Bucket by # messages
Considerations• Shrinking documents, space can be
reclaimed with– db.runCommand ( { compact: '<collection>' } )
• Removing the document after the last element in the array as been removed– { "_id" : …, "messages" : [ ], "owner" : "friend1", "sequence" : 0 }
msg = { from: "Your Boss", to: [ "Bob" ],
sent: new Date(), message: "CALL ME NOW!"
}
// 2.4 Introduces $each, $sort and $slice for $pushdb.messages.update(
{ _id: 1 }, { $push: { messages: { $each: [ msg ],
$sort: { sent: 1 },
$slice: -50 }
} })
Maintain the latest – Fixed Size Array
Considerations• Need to compute the size of the array
based on retention period
// messages: one doc per user per daydb.inbox.findOne(){ _id: 1, to: "Joe", sequence: ISODate("2013-02-04T00:00:00.392Z"), messages: [ ] }// Auto expires data after 31536000 seconds = 1 yeardb.messages.ensureIndex( { sequence: 1 }, { expireAfterSeconds: 31536000 } )
TTL Collections
#3 – Indexed Attributes
Design Goal• Application needs to stored a variable
number of attributes e.g.– User defined Form– Meta Data tags
• Queries needed– Equality– Range based
• Need to be efficient, regardless of the number of attributes
2 Approaches (there are more)• Attributes as Embedded Document• Attributes as Objects in an Array
db.files.insert( { _id: "local.0", attr: { type: "text", size: 64, created: ISODate("..." } } )db.files.insert( { _id: "local.1", attr: { type: "text", size: 128} } )db.files.insert( { _id: "mongod", attr: { type: "binary", size: 256, created: ISODate("...") } } )// Need to create an index for each item in the sub-documentdb.files.ensureIndex( { "attr.type": 1 } )db.files.find( { "attr.type": "text"} )// Can perform range queriesdb.files.ensureIndex( { "attr.size": 1 } )db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )
Attributes as a Sub-Document
Considerations• Each attribute needs an Index• Each time you extend, you add an index• Lots and lots of indexes
db.files.insert( {_id: "local.0", attr: [ { type: "text" },
{ size: 64 },
{ created: ISODate("...") } ] } )
db.files.insert( { _id: "local.1", attr: [ { type: "text" },
{ size: 128 } ] } )
db.files.insert( { _id: "mongod", attr: [ { type: "binary" },
{ size: 256 }, { created: ISODate("...") } ] } )
db.files.ensureIndex( { attr: 1 } )
Attributes as Objects in Array
Considerations• Only one index needed on attr• Can support range queries, etc.• Index can be used only once per query
#4 – Multiple Identities
Design Goal• Ability to look up by a number of
different identities e.g.• Username• Email address• FB Handle• LinkedIn URL
2 Approaches (there are more)• Identifiers in a single document• Separate Identifiers from Content
db.users.findOne(){ _id: "joe", email: "[email protected], fb: "joe.smith", // facebook li: "joe.e.smith", // linkedin other: {…}}
// Shard collection by _iddb.shardCollection("mongodbdays.users", { _id: 1 } )// Create indexes on each keydb.users.ensureIndex( { email: 1} )db.users.ensureIndex( { fb: 1 } )db.users.ensureIndex( { li: 1 } )
Single Document by User
Read by _id (shard key)
Shard 1 Shard 2 Shard 3
find( { _id: "joe"} )
Considerations• Lookup by shard key is routed to 1 shard• Lookup by other identifier is scatter
gathered across all shards• Secondary keys cannot have a unique
index
// Create unique indexdb.identities.ensureIndex( { identifier : 1} , { unique: true} )
// Create a document for each users documentdb.identities.save( { identifier : { hndl: "joe" }, user: "1200-42" } )db.identities.save( { identifier : { email: "[email protected]" }, user: "1200-42" } )db.identities.save( { identifier : { li: "joe.e.smith" }, user: "1200-42" } )
// Shard collection by _iddb.shardCollection( "mydb.identities", { identifier : 1 } )// Create unique indexdb.users.ensureIndex( { _id: 1} , { unique: true} )// Shard collection by _iddb.shardCollection( "mydb.users", { _id: 1 } )
Document per Identity
Read requires 2 reads
Shard 1 Shard 2 Shard 3
db.identities.find({"identifier" : { "hndl" : "joe" }})
db.users.find( { _id: "1200-42"} )
Considerations• Lookup to Identities is a routed query• Lookup to Users is a routed query• Unique indexes available
Conclusion
Summary• Multiple ways to model a domain problem• Understand the key uses cases of your
app• Balance between ease of query vs. ease
of write• Random IO should be avoided
Perl Engineer & Evangelist, 10gen
Mike Friedman
#MongoDBdays
Thank You