Azure DocumentDB:Deep Dive into
Advanced FeaturesAravind RamachandranProgram ManagerAzure DocumentDB@arkramac
Andrew LiuProgram ManagerAzure DocumentDB@aliuy8
A Quick Recap…
3 V’s of data : Endless possibilities
Learning
Gaming
Retail
Telematics
Mobile Apps
IoT
Velocity :High
Throughputwith Low Latency
Volume :Massive
Amounts of Data
Variety : Schema-freedom
The 2x2s of database tradeoffs
Latency
Dur
abili
tyLow
High
HighLow
Schema/Index Management
Que
ry
Poor
Rich
Agnostic
Required
Availability
Prog
ram
mab
ility
Low
High
HighLow
Scale
Dis
trib
uti
onSingle DC
World
Elastic
Static
Scale
Txn
Scop
e
Single item
Multiple items
HighLow
Performance Isolation
TCO
Low
High
AirtightNoisy Neighbor
DocumentDB: Capabilities
Guaranteed low latency
• <10ms reads/<15ms writes @ P99. • Requests are served from local
region• Write optimized, latch-free database
engine designed for SSDs and low latency access.
• Synchronous and automatic document indexing at sustained ingestion rates
Elastic and limitless
global scale• Independently scale throughput and storage - locally and globally
• Transparent partition management and routing
Multiple consistency levels
• Multiple well defined consistency levels• Intuitive programming model for relaxed consistency
models • Clear PACELC tradeoffs and 99.99% availability SLAs
SQL and JavaScript –
schema free• Automatic tree path based indexing • No schemas or secondary indices
required upfront• SQL and JavaScript language
integrated queries• Hash, range, and spatial• Multi-document, JavaScript language
integrated transactions
DocumentDB resource model Resources• identified by their logical and stable URI • Represented as JSON documents• Partitioned and across span machines, clusters and
regions
1
Resource model• Stateless interaction (HTTP and TCP)• Hierarchical overlay atop partitioning
model
2
Partitioning Model• Grid Partitioning – horizontal based on
hash/range and vertical across regions• Each partition made highly available via a
replica set
3
…
Replica-set
…
…
US-East
US-West
N Europe
Partitions
Partition set
Local distribution
Glo
bal d
istr
ibut
ion
Accessing DocumentDB
Java .NET
TCP/SSL HTTPS
DocumentDB Service
DocumentDB client SDKs and tools DocumentDB
Hadoop and Spark connectorsJSON, SQL,
JavaScript
MongoDB wire protocol
drivers for MongoDB
Java .NETRuby…
MongoDB toolchain, MongoDB client drivers, Parse SDK
Clients
BSON
Let’s talk about…• Modeling JSON Documents
• Collections and Scaling
• Query and Indexing
• Global Distribution
• Tips and Best Practices
Everything you need to know to build
Blazing fast, planet-scale applications!
Let’s talk about JSON documents
"With great power comes great responsibility“
- Uncle BenDocumentDB gives you the power of true schema-freedom.Generally de-normalize… but don't just do it blindy.
How do approaches differ?
Data normalizationORM
How do approaches differ?
Come as you are
Data normalizationORM
How do approaches differ?
Person
Address
ContactDetail
ContactDetailType
PersonContactDetailLnk
PersonIdContactDetailId
Id Id
Id Id
Modeling Data: The Relational Way
Person Id
Addresses
{ "id": "0ec1ab0c-de08-4e42-a429-...", "addresses": [ { "street": "1 Redmond Way", "city": "Redmond", "state": "WA", "zip": 98052} ], "contactDetails": [ {"type": "home", "detail": “555-1212"}, {"type": "email", "detail": “[email protected]"} ], ...}
Address…
Address…
ContactDetails
ContactDetail…
Modeling Data: The Document Way
To embed, or to reference, that is the question
{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "addresses": [ { "line1": "100 Some Street", "line2": "Unit 1", "city": "Seattle", "state": "WA", "zip": 98012 } ], "contactDetails": [ {"email: "[email protected]"}, {"phone": "+1 555 555-5555", "extension": 5555} ] }
Try model your entity as a self-contained documentGenerally, use embedded data models when:
There are "contains" relationships between entitiesThere are one-to-few relationships between entities Embedded data changes infrequentlyEmbedded data won’t grow without boundsEmbedded data is integral to data in a document
Data modeling with denormalization
Denormalizing typically provides for better read performance
In general, use normalized data models when:
Write performance is more important than read performanceRepresenting one-to-many relationshipsCan representing many-to-many relationshipsRelated data changes frequently
Provides more flexibility than embeddingMore round trips to read data
Data modeling with referencing
{"id": "xyz","username:
"user xyz"}
{"id":
"address_xyz","userid": "xyz",
"address" : {…
}}
{"id:
"contact_xyz","userid": "xyz","email" :
"[email protected]" "phone" : "555 5555"}
User document
Address document
Contact details document
Normalizing typically provides better write performance
No magic bullet
Hybrid Approach:Model on a property-level(as opposed to record-level)
Optimize your data model for your workload…(as opposed to blindly following types)
Modeling impacts RU due to document size
Hybrid models
{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "countOfBooks": 3, "books": [1, 2, 3], "images": [
{"thumbnail": "http://....png"} {"profile": "http://....png"}
] }
{ "id": 1, "name": "DocumentDB 101", "authors": [
{"id": 1, "name": "Thomas Andersen", "thumbnail": "http://....png"},
{"id": 2, "name": "William Wakefield", "thumbnail": "http://....png"}
] }
Author document
Book document
Collections + Elastic Scale
Elastic scale
Measuring Throughput (Request Units)
Replica gets a fixed budget of request units
Request Unit/sec (RU) is the normalized currency
% IOPS
% CPU
% Memory
READGET Documen
t
Documents
INSERT
POST
REPLACE
PUT Document
Operations consume request units (RUs)
QueryPOST Documen
ts
…
Min RU/sec
Max RU/sec
Inco
min
g Re
ques
ts
Replica Quiescent
Ratelimit
Nothrottling
Requests get rate limited if they exceed the SLA Customers pay for
reserved request units by the hour
What are partitions?
…. ….
Partition 1
Partition 2
Partition i Partition n
…
Collection
What are partitions?
…. ….
London
Paris
…
Partition 1
Partition 2
Partition i Partition n
New York …
Houston
Chicago
New Delhi
Mumbai
Boston
Berlin
…
Partition Key = city
Partitioning patterns Writes should scale across Partition Keys
…. ….
…
Partition 1
Partition 2
Partition i Partition n
…
……
Partitioning patterns Writes should scale across Partition Keys
…. ….
…
Partition 1
Partition 2
Partition i Partition n
…
……
Partitioning patterns Reads should minimize cross-partition lookups
…. ….
…
Partition 1
Partition 2
Partition i Partition n
…
……
Recipe for Choosing Partition Key• Start with the Workload – Is it Read vs Write heavy?
• Top Queries – Look for commonly filtered properties
• Transaction Boundary
• Avoid Storage + Performance Bottlenecks
• Aim for high cardinality… More partition key values = happiness
• Examples: Partition by TenantId or DeviceId… composite w/ Timestamp
Let's talk about Query and Indexing
Query and IndexingDemo
DocumentDB: SQL and JavaScript queries
{ "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }]};
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ], "headquarter": "Italy", "exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" } ]}; locations headquarter
0
country
Germany
city
Bonn
revenue
200
Italy
exports
city
Berlin
city
Athens
0 1
dealers
0
Hans
name
{ "results": [ { "locations": [ {"country":"Germany","city":"Berlin"}, {"country":"France","city":"Paris"} ] } ]}
0
locations
0 1
country
Germany
city
Berlin
country
France
city
Paris
results
SELECT C.locations FROM company C WHERE C.headquarter = "Belgium"
SQL
function businessLogic() { var country = "Belgium"; __.filter(function(x){return x.headquarter===country;});}
JavaScript
Indexing under the hood• Logically the index is a union of all the document trees• Structure contributed by the interior nodes, instance values are
the leavesCommonstructure
• Structural information and instance values are normalized into a unifying concept of JSON-Path
Terms Postings List
$/location/0/ 1, 2location/0/country/ 1, 2location/0/city/ 1, 20/country/Germany
1, 2
1/country/France 2 … …0/city/Moscow 20/dealers/0 2
0
Germany
location
0
location
country
0
country
Range & ORDERBY queries
0
Germany
location
0
location
country
0
country
Wildcard queries Spatial queries
0
coordinates
1 2
Dynamic Encoding of Postings List(E-WAH/differential)
Check out our
VLDB paper, her
e!
Queries that use the index
• Equality: =• Range: <, >, <=, >=• ORDER BY• String operators: STARTSWITH• Spatial operators: ST_WITHIN and ST_DISTANCE• Array operators: ARRAY_CONTAINS• Schema operators: IS_DEFINED, IS_NUMBER, IS_STRING, …
Indexing PoliciesConfiguration Level Options
Automatic Per collection True (default) or False Override with each document write
Indexing Mode Per collection Consistent, Lazy, and NoneNone for KV workloads
Included and excluded paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash, Range, and Spatial
Indexing Precision Per path Supports 1 – 100 per path (and max)Tradeoff storage, query RUs and write Rus
Let’s talk about Planet-Scale
Guaranteed low latency
“I want my data wherever my users are.”
Guaranteed high availability
Globally. With policy based failover.
99.99%
Multi-region DocumentDB databases
=DocumentDB Collection
…
Replica-set
…
…
US-East
US-West
India
Partitions
Partition set
Glo
bal d
istr
ibut
ion
Local distribution
Primary Replica-sets
…
2M RUs
…Secondary Replica-sets 2M
RUs …
2M RUs
Secondary Replica-sets
…A DocumentDB collection
2M RUs
Total RUs = Provisioned RUs x Number of regions
In this example: 2M RUs x 3 regions = 6M RUs
Programmable data consistency
“Its hard to write distributed apps.”
Strong consistency, High latency
Eventual consistency, Low latency
Consistency Levels• PACELC Theorem and the associated tradeoffs
Consistency Levels• Strong, Eventual, Bounded Staleness, and
Session
Strong
Bounded Staleness
Session
Eventual
LEFT TO RIGHT Weaker Consistency, Better Read scalability, Lower write latency
Client
P SS
Client
P SS
Client
P SS
Client
P SS
Client
• Consistent Prefix reads. • Reads lag behind writes by
K prefixes or T interval
• Monotonic reads, writes and Read your writes guarantee
Global DistributionDemo
DocumentDB Recent Updates
• Automatic Expiration via Time-To-Live (TTL)
• Expanded Geo-Spatial support for Polygons and Lines
• Preview Support for• Local Emulator• IP Filtering• Self-Service Backup + Restore• Protocol Support for MongoDB
Q&A and more resources…
AskDocDB@microsoft
Follow @DocumentDBUse #DocumentDB
documentdb.com
#azure-documentDB
Session Evaluations
ways to access
Go to passSummit.com
Download the GuideBook App and search: PASS Summit 2016
Follow the QR code link displayed on session signage throughout the conference venue and in the program guide
Submit by 5pmFriday November 6th toWIN prizes
Your feedback is important and valuable. 3
Thank You Learn more from
Azure [email protected] or follow @DocumentDB