schema design
TRANSCRIPT
Schema Design
Perl Engineer & Evangelist, MongoDB
Mike Friedman
Agenda
• What is a Record?
• Core Concepts
• What is an Entity?
• Associating Entities
• General Recommendations
All application development isSchema Design
Success comes fromProper Data Structure
What is a Record?
Key → Value
• One-dimensional storage
• Single value is a blob
• Query on key only
• No schema
• Value cannot be updated, only replaced
Key Blob
Relational
• Two-dimensional storage (tuples)
• Each field contains a single value
• Query on any field
• Very structured schema (table)
• In-place updates
• Normalization process requires many tables, joins, indexes, and poor data locality
PrimaryKey
Document
• N-dimensional storage
• Each field can contain 0, 1, many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data locality, requires fewer indexes, has better performance
_id
Core Concepts
Traditional Schema DesignFocus on data storage
Document Schema DesignFocus on data use
Another way to think about itTraditional:What answers do I have?
Document:What questions do I have?
Three Building Blocks ofDocument Schema Design
1 – Flexibility
• Choices for schema design
• Each record can have different fields
• Common structure can be enforced by application
• Easy to evolve as needed
2 – ArraysMultiple Values per Field
• Each field can be:– Absent– Set to null– Set to a single value– Set to an array of many values
• Query for any matching value– Can be indexed and each value in the array is in
the index
3 - Embedded Documents
• An acceptable value is a document
• Nested documents provide structure
• Query any field at any level– Can be indexed
What is an Entity?
An Entity
• Object in your model
• Associations with other entities
An Entity
• Object in your model
• Associations with other entities
Referencing (Relational)
Embedding (Document)
has_one embeds_one
belongs_to embedded_in
has_many embeds_many
has_and_belongs_to_manyMongoDB has both referencing and embedding for
universal coverage
Let's model something togetherHow about a business card?
Business Card
Referencing
Addresses
{“_id”: 1,“street”: “10260 Bandley
Dr”,“city”: “Cupertino”,“state”: “CA”,“zip_code”: ”95014”,“country”: “USA”
}
Contacts
{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1}
Embedding
Contacts
{ “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {
“street”: “10260 Bandley Dr”, “city”: “Cupertino”, “state”: “CA”, “zip_code”: ”95014”, “country”: “USA”
}, “phone”: “408-996-1010”}
Relational Schema
Contact
• name• compan
y• title• phone
Address
• street• city• state• zip_cod
e
Contact
• name• company• adress
• Street• City• State• Zip
• title• phone
• address• street• city• State• zip_cod
e
Document Schema
How are they different? Why?
Contact
• name• compan
y• title• phone
Address
• street• city• state• zip_cod
e
Contact
• name• company• adress
• Street• City• State• Zip
• title• phone
• address• street• city• state• zip_cod
e
Schema Flexibility
{ “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: {
“street”: “10260 Bandley Dr”, “city”: “Cupertino”, “state”: “CA”, “zip_code”: ”95014”
}, “phone”: “408-996-1010”}
{ “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “[email protected]”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100”}
Example
Let’s Look at anAddress Book
Address Book
• What questions do I have?
• What are my entities?
• What are my associations?
Address Book Entity-Relationship
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Associating Entities
One to One
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
One to OneSchema Design Choices
contact• twitter_id
twitter1 1
contact twitter• contact_id1 1
Redundant to track relationship on both sides • Both references must be updated for consistency
• May save a fetch?
Contact• twitter
twitter 1
One to OneGeneral Recommendation
• Full contact info all at once– Contact embeds twitter• Parent-child relationship
– “contains”
• No additional data duplication• Can query or index on embedded field
– e.g., “twitter.name”
Contact• twitter
twitter 1
One to Many
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
One to ManySchema Design Choices
contact• phone_ids: [
]phone1 N
contact phone• contact_id1 N
Redundant to track relationship on both sides • Both references must be updated for consistency
• Not possible in relational DBs• Save a fetch?
Contact• phones
phoneN
One to ManyGeneral Recommendation
• Full contact info all at once– Contact embeds multiple phones• Parent-children relationship
– “contains”
• No additional data duplication• Can query or index on any field
– e.g., { “phones.type”: “mobile” }– Exceptional cases…• Scaling: maximum document size is 16MB
Contact• phones
phoneN
Many to Many
Contacts• name• company• title
Addresses
• type• street• city• state• zip_code
Phones• type• number
Emails• type• address
Thumbnails
• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
N
N
N
1
1
1
11
Twitters• name• location• web• bio
1
1
Many to ManyTraditional Relational Association
Join table
Contacts• name• company• title• phone
Groups• name
GroupContacts
• group_id• contact_id
Use arrays instead
X
Many to ManySchema Design Choices
group• contact_ids:
[ ]contactN N
groupcontact• group_ids:
[ ]N N
Redundant to track relationship on both sides • Both references must be
updated for consistency
Redundant to track relationship on both sides • Duplicated data must be
updated for consistency
group• contacts
contactN
contact• groups
group N
Many to ManyGeneral Recommendation
• Depends on use case1. Simple address book• Contact references groups
2. Corporate email groups• Group embeds contacts for performance
• Exceptional cases– Scaling: maximum document size is 16MB– Scaling may affect performance and working set
groupcontact• group_ids:
[ ]N N
Contacts• name• company• title
addresses• type• street• city• state• zip_code
phones• type• number
emails• type• address
thumbnail• mime_type• data
Portraits• mime_type• data
Groups• name
N
1
N
1
twitter• name• location• web• bio
N
N
N
1
1
Document model - holistic and efficient representation
Contact document example
{
“name” : “Gary J. Murakami, Ph.D.”,
“company” : “MongoDB, Inc.”,
“title” : “Lead Engineer”,
“twitter” : {
“name” : “Gary Murakami”, “location” : “New Providence, NJ”,
“web” : “http://www.nobell.org”
},
“portrait_id” : 1,
“addresses” : [
{ “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” }
],
“phones” : [
{ “type” : “work”, “number” : “1-866-237-8815 x8015” }
],
“emails” : [
{ “type” : “work”, “address” : “[email protected]” },
{ “type” : “home”, “address” : “[email protected]” }
]
}
Working Set
To reduce the working set, consider…
• Reference bulk data, e.g., portrait
• Reference less-used data instead of embedding – Extract into referenced child document
Also for performance issues with large documents
General Recommendations
Legacy Migration
1. Copy existing schema & some data to MongoDB
2. Iterate schema design developmentMeasure performance, find bottlenecks, and embed
1. one to one associations first2. one to many associations next3. many to many associations
3. Migrate full dataset to new schema
New Software Application? Embed by default
Embedding over Referencing • Embedding is a bit like pre-joined data
– BSON (Binary JSON) document ops are easy for the server
• Embed (90/10 following rule of thumb)– When the “one” or “many” objects are viewed in
the context of their parent– For performance– For atomicity
• Reference– When you need more scaling– For easy consistency with “many to many”
associations without duplicated data
It’s All About Your Application
• Programs+Databases = (Big) Data Applications
• Your schema is the impedance matcher– Design choices: normalize/denormalize,
reference/embed– Melds programming with MongoDB for best of
both– Flexible for development and change
• Programs×MongoDB = Great Big Data Applications
Thank You
Perl Engineer & Evangelist, MongoDB
Mike Friedman