azure documentdb
TRANSCRIPT
Italian Virtual Chapter – 19.10.2016
Azure DocumentDbMarco Parenzan
Microsoft MVP for AzureMicrosoft Azure Trainer @ Cloud Academy SAGL
Community Lead [email protected]
@marco_parenzan
Italian Virtual Chapter – 19.10.2016
Document Db◇ Fully managed◇ Schema agnostic◇ Scalable◇ Tunable consistency levels◇ Tunable indexing policies◇ Familiar SQL syntax for querying◇ JavaScript execution
Italian Virtual Chapter – 19.10.2016
Documents
Italian Virtual Chapter – 19.10.2016
Developer Appeal◇ Document is JSON Document◇ DocumentDb is a schemaless Db◇ Resilient to iterative schema changes◇ Promote code first development (mapping objects to json)◇ Low impedance as object / JSON store; no ORM required◇ Richer query and indexing (compared to KV stores) ◇ It just works◇ It’s fast◇ It’s great for Catalog Data, Preference and State, Event Store, User
Generated Content, Data Exchange
Italian Virtual Chapter – 19.10.2016
Train yourself with ViewModels◇ Implement a real contractsomething to exchange from Presentation to
BL/DA◇ ViewModel=a model that is functional just for presentation, not persistence
■ No more Ids■ No more null fields■ No more grayed/hidden fields■ No more graphs■ No more joins■ No many roles per entity (just one)
◇ Greatly represented in JSON
Italian Virtual Chapter – 19.10.2016
Come as you are
Data normalizationORM
Embedding vs. Referencing
Italian Virtual Chapter – 19.10.2016
embed reference
Embedding vs. referencing
Italian Virtual Chapter – 19.10.2016
Referencing◇ Representing one-to-many relationships.◇ Representing many-to-many relationships.◇ Related data changes frequently.◇ Referenced data could be unbounded◇ Provides more flexibility than embedding
■ More round trips to read data◇ Normalizing typically provides better write performance
Italian Virtual Chapter – 19.10.2016
Embedding◇ There are contains relationships between entities.◇ There are one-to-few relationships between entities.◇ There is embedded data that changes infrequently.◇ There is embedded data won't grow without bound.◇ There is embedded data that is integral to data in a document.
Italian Virtual Chapter – 19.10.2016
Resource Model◇ DocumentDb is Platform as a Service
■ No OnPremise◇ RESTful API
■ All DocDb elements public and accessible as Resource Uri◇ Resource
■ Json Resources
Italian Virtual Chapter – 19.10.2016
Resource Model Items
Database Account Databases Collections Documents
Italian Virtual Chapter – 19.10.2016
Database Account◇ Unit of Autorization◇ Unit of Consistency
Italian Virtual Chapter – 19.10.2016
Unit of Authorization◇ Master keys
■ Upon creation of a DocumentDB account, two master keys (primary and secondary) are created. These keys enable full administrative access to all resources within the DocumentDB account.
◇ Read-only keys■ Upon creation of a DocumentDB account, two read-only keys (primary
and secondary) are created. These keys enable read-only access to all resources within the DocumentDB account.
Italian Virtual Chapter – 19.10.2016
Unit of Consistency◇ Query / transaction throughput (and reliability – i.e., hardware failure)
depend on replication!■ All writes to the primary are replicated across two secondary replicas■ All reads are distributed across three copies■ “Scalability of throughput” – allowing different clients to read from
different replicas helps prevent bottlenecks◇ BUT replication takes time!
■ Potential scenario: some clients are reading while another is writing■ Now, the data is stale (out-of-date), inconsistent!
Italian Virtual Chapter – 19.10.2016
Tweakable Consistency◇ Trade-off: speed (performance & availability) or consistency (data
correctness)?■ “Does every read need the MOST current data?”■ “Or do I need every request to be handled and handled quickly?”
◇ 4 options …■ Strong, Session, Bounded Staleness, Eventual■ Default consistency for the entire Db…■ At collection basis in a future release■ On query basis (optional parameter on CreateDocumentQuery
method)
Italian Virtual Chapter – 19.10.2016
Stale data◇ ViewModel is state
■ ViewModel is disconnected state from our trueness (the DB!)■ ViewModel is duplicated state from DB■ Many users can duplicataten-uplicate state from DB
◇ So…which is reality?■ You have STALE data, you have a lot of smell
◇ What smells?■ Copies of the data that are not the truth
◇ Entity can be a lie, because it says “that will be the state”
Italian Virtual Chapter – 19.10.2016
CAP Theorem◇ Consistency:
■ All nodes should see the same data at the same time◇ Availability:
■ Node failures do not prevent survivors from continuing to operate◇ Partition-tolerance:
■ The system continues to operate despite network partitions◇ A distributed system can satisfy any two of these guarantees at the same
time but not all three
Italian Virtual Chapter – 19.10.2016
Strong◇ client always sees completely consistent data◇ Slowest reads / writes ◇ Mission critical: e.x. stock market, banking, airline reservation
Italian Virtual Chapter – 19.10.2016
Session◇ Default – even trade-off between performance & availability vs. data
correctness◇ client reads its own writes, but other clients reading this same data might
see older values
Italian Virtual Chapter – 19.10.2016
Bounded Staleness◇ client might see old data, but it can specify a limit for how old that data
can be (ex. 2 seconds) ◇ Updates happen in order received◇ similar to Session consistency, but speeds up reads while still preserving
the order of updates
Italian Virtual Chapter – 19.10.2016
Eventual◇ client might see old data for as long as it takes a write to propagate to all
replicas◇ High performance & availability, but a client might sometimes read out-of-
date information or see updates out of order
Italian Virtual Chapter – 19.10.2016
Setting Consistency◇ At the database level (see preview portal)◇ On a per-read or per-query basis (optional parameter on
CreateDocumentQuery method)
Italian Virtual Chapter – 19.10.2016
Globally Distributed◇ Azure DocumentDB gives you the
ability cheat the speed of light!◇ Not just for disaster recovery….
DocumentDB is unreasonably highly available
◇ Replicate data across any # of regions of your choice
◇ Low-latency access to your data around the globe
◇ Dynamically configure your write and read regions
Italian Virtual Chapter – 19.10.2016
Databases◇ Unit of Namespace
Italian Virtual Chapter – 19.10.2016
Collections
Italian Virtual Chapter – 19.10.2016
DocumentDb Performance◇ Data is saved on SSD◇ All writes to the primary are replicated across two secondary replicas
■ (Replicas are spread on different hardware in same region to protect against failures)
◇ All reads are distributed across the three copies (when and how depend on consistency level for db account and query)
Italian Virtual Chapter – 19.10.2016
Collections◇ A unit of scale for transaction
■ for stored procedures and triggers◇ A unit of query throughput
■ capacity units allocated uniformly across all collections)◇ A unit of replication
■ A collection is replicated three times◇ A container of JSON documents
■ JSON docs inside of a collection can vary dramatically
Italian Virtual Chapter – 19.10.2016
CollectionsDatabase Account
Users
Permissions
Collections Documents
Stored Procedures
Triggers
User Defined Functions
JS
JS
JS
AttachmentsDatabases
Italian Virtual Chapter – 19.10.2016
Unit of query throughput◇ Collection-based RU Reservation
■ Capacity units allocated uniformly across all collections)◇ Standard pricing tier with hourly billing◇ Performance levels can be adjusted ◇ Each collection = 10GB of SSD
■ Limit of 100 collections (1 TB) ■ Soft limit, can be lifted as needed per account (with Support)
Italian Virtual Chapter – 19.10.2016
Performance levels
Italian Virtual Chapter – 19.10.2016
Request Units◇ Predictable Performance◇ Each DocumentDB collection has
reserved throughput in terms of request units (RUs)
◇ Normalized currency across database operations
◇ RU=◇ RUs offer accurate accounting in
face of diverse database operations
Operation RU Consumed
Reading a single 1KB document 1
Reading a single 2KB document 2
Query with a simple predicate for a 1KB document 3
Creating a single 1 KB document with 10 JSON properties (consistent indexing)
14
Create a single 1 KB document with 100 JSON properties (consistent indexing)
20
Replacing a single 1 KB document 28
Execute a stored procedure with two create documents
30
Italian Virtual Chapter – 19.10.2016
DEMO
Italian Virtual Chapter – 19.10.2016
Partitioning
Italian Virtual Chapter – 19.10.2016
Why Partition?◇ Data Size
A single collection holds 10GB◇ Throughput
3 Performance tiers with a max of 2,500 RU/sec
Italian Virtual Chapter – 19.10.2016
Collection
Request
Partitioning our data
Italian Virtual Chapter – 19.10.2016
Partitioning our data
Partition 1
Request
Request
Partition 2
Logical grouping
Italian Virtual Chapter – 19.10.2016
Evenly distribute across n number of partitions (algorithmic) ….
Partitioning - Hash
Italian Virtual Chapter – 19.10.2016
Keep current data hot, Warm historical data, Scale-down older data, Purge / Archive
}current period
Partitioning - Range
Italian Virtual Chapter – 19.10.2016
Home tenant / user to a specific partition. Use "master" lookup.
Tenant Partition Id
Customer 1
Big Customer 2
Another 3
Cache this shard map
to avoid makingthe lookup the
bottleneck
Partitioning - Lookup
Italian Virtual Chapter – 19.10.2016
Indexing
Italian Virtual Chapter – 19.10.2016
Index policies◇ customize index management
including storage◇ overhead, throughput and query
consistency■ range, hash and spatial
indexes■ included and excluded paths■ indexing mode; consistent or
lazy■ index precision■ online, in-place index
transformations
{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/*", "indexes": [ { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Hash", "dataType": "String", "precision": 3 }, { "kind": "Spatial", "dataType": "Point" } ] } ], "excludedPaths": []}
Italian Virtual Chapter – 19.10.2016
Indexing PoliciesConfiguration Level Options
Automatic Per collection True (default) or False Override with each document write
Indexing Mode Per collection Consistent or Lazy Lazy for eventual updates/bulk ingestion
Included and excluded paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash (Default) and RangeHash for equality, range for range queries
Indexing Precision Per path Supports 3 – 7 per pathTradeoff storage, query RUs and write RUs
Italian Virtual Chapter – 19.10.2016
Indexing Paths Path Description/use case / Default path for collection. Recursive and applies to whole document tree.
/"prop"/? Serve queries like the following (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop = "value" SELCT * FROM collection c WHERE c.prop > 5
/"prop"/* All paths under the specified label.
/"prop"/"subprop"/ Used during query execution to prune documents that do not have the specified path.
/"prop"/"subprop"/? Serve queries (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop.subprop = "value" SELECT * FROM collection c WHERE c.prop.subprop > 5
Italian Virtual Chapter – 19.10.2016
Indexing tips◇ Use lazy indexing for faster peak time ingestion rates◇ Exclude unused paths from indexing for faster writes◇ Specify range index path type for all paths used in range queries◇ Vary index precision for write vs query performance vs storage tradeoffs◇ http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-doc
umentdb-part-2/
Italian Virtual Chapter – 19.10.2016
Querying
Italian Virtual Chapter – 19.10.2016
Query◇ Query over heterogeneous documents
without defining schema or managing indexes
◇ Query arbitrary paths, properties and values without specifying secondary indexes or indexing hints
◇ Execute queries with consistent results ◇ Supported SQL features; predicates,
iterations (arrays), sub-queries, logical operators, UDFs, intra-document JOINs, JSON transforms
◇ In general, more predicates result in a larger request charge.
◇ Additional predicates can help if they result in narrowing the overall result set.
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)where book.Title == "War and Peace" select book;
from book in client.CreateDocumentQuery<Book>(collectionSelfLink)where book.Author.Name == "Leo Tolstoy"select book.Author;
-- Nested lookup against indexSELECT B.AuthorFROM Books BWHERE B.Author.Name = "Leo Tolstoy"
-- Transformation, Filters, Array accessSELECT { Name: B.Title, Author: B.Author.Name }FROM Books BWHERE B.Price > 10 AND B.Language[0] = "English"
-- Joins, User Defined Functions (UDF)SELECT udf.CalculateRegionalTax(B.Price, "USA", "WA")FROM Books BJOIN L IN B.LanguagesWHERE L.Language = "Russian"
LINQ Query
SQL Query Grammar
Italian Virtual Chapter – 19.10.2016
DEMO
Italian Virtual Chapter – 19.10.2016
Programmability
Italian Virtual Chapter – 19.10.2016
function region(doc){ switch (doc.Location.Region) { case 0: return "North"; case 1: return "Middle"; case 2: return "South"; }}
Query with user-defined function◇ The
complexity of a query impacts the request units consumed for an operation:
◇ Use of user-defined functions (UDFs)■ SELE
CT or WHERE clauses
◇ To take advantage of indexing, try and have at least one filter against an indexed property when leveraging a UDF in the WHERE clause.
Italian Virtual Chapter – 19.10.2016
function count(filterQuery, continuationToken) { var collection = getContext().getCollection(); var maxResult = 25; // MAX number of docs to process in one batch, when reached, return to client/request continuation. // intentionally set low to demonstrate the concept. This can be much higher. Try experimenting. // We've had it in to the high thousands before seeing the stored proceudre timing out.
// The number of documents counted. var result = 0;
tryQuery(continuationToken);}
Executing Stored Procedures◇ Execute
“explicit” Javascript code on collection
Italian Virtual Chapter – 19.10.2016
function normalize() { var collection = getContext().getCollection(); var collectionLink = collection.getSelfLink(); var doc = getContext().getRequest().getBody();
var newDoc = { "Sensor": { "Id": doc.sensorId, "Class": 0 }, "Degree": { "Value": doc.degreeValue, "Type": 0 }, "Location": { "Name": doc.locationName, "Region": doc.locationRegion, "Longitude": doc.locationLong, "Latitude": doc.locationLat },"id": doc.id }; // Update the request -- this is what is going to be inserted. getContext().getRequest().setBody(newDoc);}
Triggers◇ Execute
“implicit” Javascript code on CRUD operations (Insert, Update, Delete) on collections
Italian Virtual Chapter – 19.10.2016
Conclusions
Italian Virtual Chapter – 19.10.2016
Conclusions◇ DocumentDb is a Restful service◇ Documents defines Unit of Costs with Resource Units◇ Database Account defines Accessibility and Consistency◇ Database is a Namespace placeholder◇ Containers is the unit of Scale
Italian Virtual Chapter – 19.10.2016
Usage: what is DocumentDb for?◇ User generated content◇ Many specific data (varbinary(MAX) in SQL)◇ Catalog data◇ Log data◇ User preferences data◇ Device sensor data◇ IoT use cases commonly share some patterns in how they ingest, process
and store data. First, these systems allow for data intake that can ingest bursts of data from device sensors of various locales. Next, these systems process and analyze streaming data to derive real time insights. And last but not least, most if not all data will eventually land in a data store for adhoc querying and offline analytics.
Italian Virtual Chapter – 19.10.2016
Any questions?You can find me at: [email protected]/@marco_parenzan
Thanks!