azure documentdb

Italian Virtual Chapter – 19.10.2016

Azure DocumentDbMarco Parenzan

Microsoft MVP for AzureMicrosoft Azure Trainer @ Cloud Academy SAGL

Community Lead [email protected]

@marco_parenzan

mailto:[email protected]


Document Db◇ Fully managed◇ Schema agnostic◇ Scalable◇ Tunable consistency levels◇ Tunable indexing policies◇ Familiar SQL syntax for querying◇ JavaScript execution


Documents


Developer Appeal◇ Document is JSON Document◇ DocumentDb is a schemaless Db◇ Resilient to iterative schema changes◇ Promote code first development (mapping objects to json)◇ Low impedance as object / JSON store; no ORM required◇ Richer query and indexing (compared to KV stores) ◇ It just works◇ It’s fast◇ It’s great for Catalog Data, Preference and State, Event Store, User

Generated Content, Data Exchange


Train yourself with ViewModels◇ Implement a real contractsomething to exchange from Presentation to

BL/DA◇ ViewModel=a model that is functional just for presentation, not persistence

￭ No more Ids￭ No more null fields￭ No more grayed/hidden fields￭ No more graphs￭ No more joins￭ No many roles per entity (just one)

◇ Greatly represented in JSON


Come as you are

Data normalizationORM

Embedding vs. Referencing


embed reference

Embedding vs. referencing


Referencing◇ Representing one-to-many relationships.◇ Representing many-to-many relationships.◇ Related data changes frequently.◇ Referenced data could be unbounded◇ Provides more flexibility than embedding

￭ More round trips to read data◇ Normalizing typically provides better write performance


Embedding◇ There are contains relationships between entities.◇ There are one-to-few relationships between entities.◇ There is embedded data that changes infrequently.◇ There is embedded data won't grow without bound.◇ There is embedded data that is integral to data in a document.


Resource Model◇ DocumentDb is Platform as a Service

￭ No OnPremise◇ RESTful API

￭ All DocDb elements public and accessible as Resource Uri◇ Resource

￭ Json Resources


Resource Model Items

Database Account Databases Collections Documents


Database Account◇ Unit of Autorization◇ Unit of Consistency


Unit of Authorization◇ Master keys

￭ Upon creation of a DocumentDB account, two master keys (primary and secondary) are created. These keys enable full administrative access to all resources within the DocumentDB account.

◇ Read-only keys￭ Upon creation of a DocumentDB account, two read-only keys (primary

and secondary) are created. These keys enable read-only access to all resources within the DocumentDB account.


Unit of Consistency◇ Query / transaction throughput (and reliability – i.e., hardware failure)

depend on replication!￭ All writes to the primary are replicated across two secondary replicas￭ All reads are distributed across three copies￭ “Scalability of throughput” – allowing different clients to read from

different replicas helps prevent bottlenecks◇ BUT replication takes time!

￭ Potential scenario: some clients are reading while another is writing￭ Now, the data is stale (out-of-date), inconsistent!


Tweakable Consistency◇ Trade-off: speed (performance & availability) or consistency (data

correctness)?￭ “Does every read need the MOST current data?”￭ “Or do I need every request to be handled and handled quickly?”

◇ 4 options …￭ Strong, Session, Bounded Staleness, Eventual￭ Default consistency for the entire Db…￭ At collection basis in a future release￭ On query basis (optional parameter on CreateDocumentQuery

method)


Stale data◇ ViewModel is state

￭ ViewModel is disconnected state from our trueness (the DB!)￭ ViewModel is duplicated state from DB￭ Many users can duplicataten-uplicate state from DB

◇ So…which is reality?￭ You have STALE data, you have a lot of smell

◇ What smells?￭ Copies of the data that are not the truth

◇ Entity can be a lie, because it says “that will be the state”


CAP Theorem◇ Consistency:

￭ All nodes should see the same data at the same time◇ Availability:

￭ Node failures do not prevent survivors from continuing to operate◇ Partition-tolerance:

￭ The system continues to operate despite network partitions◇ A distributed system can satisfy any two of these guarantees at the same

time but not all three


Strong◇ client always sees completely consistent data◇ Slowest reads / writes ◇ Mission critical: e.x. stock market, banking, airline reservation


Session◇ Default – even trade-off between performance & availability vs. data

correctness◇ client reads its own writes, but other clients reading this same data might

see older values


Bounded Staleness◇ client might see old data, but it can specify a limit for how old that data

can be (ex. 2 seconds) ◇ Updates happen in order received◇ similar to Session consistency, but speeds up reads while still preserving

the order of updates


Eventual◇ client might see old data for as long as it takes a write to propagate to all

replicas◇ High performance & availability, but a client might sometimes read out-of-

date information or see updates out of order


Setting Consistency◇ At the database level (see preview portal)◇ On a per-read or per-query basis (optional parameter on

CreateDocumentQuery method)


Globally Distributed◇ Azure DocumentDB gives you the

ability cheat the speed of light!◇ Not just for disaster recovery….

DocumentDB is unreasonably highly available

◇ Replicate data across any # of regions of your choice

◇ Low-latency access to your data around the globe

◇ Dynamically configure your write and read regions


Databases◇ Unit of Namespace


Collections


DocumentDb Performance◇ Data is saved on SSD◇ All writes to the primary are replicated across two secondary replicas

￭ (Replicas are spread on different hardware in same region to protect against failures)

◇ All reads are distributed across the three copies (when and how depend on consistency level for db account and query)


Collections◇ A unit of scale for transaction

￭ for stored procedures and triggers◇ A unit of query throughput

￭ capacity units allocated uniformly across all collections)◇ A unit of replication

￭ A collection is replicated three times◇ A container of JSON documents

￭ JSON docs inside of a collection can vary dramatically


CollectionsDatabase Account

Users

Permissions

Collections Documents

Stored Procedures

Triggers

User Defined Functions

JS

JS

JS

AttachmentsDatabases


Unit of query throughput◇ Collection-based RU Reservation

￭ Capacity units allocated uniformly across all collections)◇ Standard pricing tier with hourly billing◇ Performance levels can be adjusted ◇ Each collection = 10GB of SSD

￭ Limit of 100 collections (1 TB) ￭ Soft limit, can be lifted as needed per account (with Support)


Performance levels


Request Units◇ Predictable Performance◇ Each DocumentDB collection has

reserved throughput in terms of request units (RUs)

◇ Normalized currency across database operations

◇ RU=◇ RUs offer accurate accounting in

face of diverse database operations

Operation RU Consumed

Reading a single 1KB document 1

Reading a single 2KB document 2

Query with a simple predicate for a 1KB document 3

Creating a single 1 KB document with 10 JSON properties (consistent indexing)

14

Create a single 1 KB document with 100 JSON properties (consistent indexing)

20

Replacing a single 1 KB document 28

Execute a stored procedure with two create documents

30


DEMO


Partitioning


Why Partition?◇ Data Size

A single collection holds 10GB◇ Throughput

3 Performance tiers with a max of 2,500 RU/sec


Collection

Request

Partitioning our data


Partitioning our data

Partition 1

Request

Request

Partition 2

Logical grouping


Evenly distribute across n number of partitions (algorithmic) ….

Partitioning - Hash


Keep current data hot, Warm historical data, Scale-down older data, Purge / Archive

}current period

Partitioning - Range


Home tenant / user to a specific partition. Use "master" lookup.

Tenant Partition Id

Customer 1

Big Customer 2

Another 3

Cache this shard map

to avoid makingthe lookup the

bottleneck

Partitioning - Lookup


Indexing


Index policies◇ customize index management

including storage◇ overhead, throughput and query

consistency￭ range, hash and spatial

indexes￭ included and excluded paths￭ indexing mode; consistent or

lazy￭ index precision￭ online, in-place index

transformations

{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/*", "indexes": [ { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Hash", "dataType": "String", "precision": 3 }, { "kind": "Spatial", "dataType": "Point" } ] } ], "excludedPaths": []}


Indexing PoliciesConfiguration Level Options

Automatic Per collection True (default) or False Override with each document write

Indexing Mode Per collection Consistent or Lazy Lazy for eventual updates/bulk ingestion

Included and excluded paths

Per path Individual path or recursive includes (? And *)

Indexing Type Per path Support Hash (Default) and RangeHash for equality, range for range queries

Indexing Precision Per path Supports 3 – 7 per pathTradeoff storage, query RUs and write RUs


Indexing Paths Path Description/use case / Default path for collection. Recursive and applies to whole document tree.

/"prop"/? Serve queries like the following (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop = "value" SELCT * FROM collection c WHERE c.prop > 5

/"prop"/* All paths under the specified label.

/"prop"/"subprop"/ Used during query execution to prune documents that do not have the specified path.

/"prop"/"subprop"/? Serve queries (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop.subprop = "value" SELECT * FROM collection c WHERE c.prop.subprop > 5


Indexing tips◇ Use lazy indexing for faster peak time ingestion rates◇ Exclude unused paths from indexing for faster writes◇ Specify range index path type for all paths used in range queries◇ Vary index precision for write vs query performance vs storage tradeoffs◇ http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-doc

umentdb-part-2/

http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-documentdb-part-2/

http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-documentdb-part-2/


Querying


Query◇ Query over heterogeneous documents

without defining schema or managing indexes

◇ Query arbitrary paths, properties and values without specifying secondary indexes or indexing hints

◇ Execute queries with consistent results ◇ Supported SQL features; predicates,

iterations (arrays), sub-queries, logical operators, UDFs, intra-document JOINs, JSON transforms

◇ In general, more predicates result in a larger request charge.

◇ Additional predicates can help if they result in narrowing the overall result set.

from book in client.CreateDocumentQuery<Book>(collectionSelfLink)where book.Title == "War and Peace" select book;

from book in client.CreateDocumentQuery<Book>(collectionSelfLink)where book.Author.Name == "Leo Tolstoy"select book.Author;

-- Nested lookup against indexSELECT B.AuthorFROM Books BWHERE B.Author.Name = "Leo Tolstoy"

-- Transformation, Filters, Array accessSELECT { Name: B.Title, Author: B.Author.Name }FROM Books BWHERE B.Price > 10 AND B.Language[0] = "English"

-- Joins, User Defined Functions (UDF)SELECT udf.CalculateRegionalTax(B.Price, "USA", "WA")FROM Books BJOIN L IN B.LanguagesWHERE L.Language = "Russian"

LINQ Query

SQL Query Grammar


DEMO


Programmability


function region(doc){ switch (doc.Location.Region) { case 0: return "North"; case 1: return "Middle"; case 2: return "South"; }}

Query with user-defined function◇ The

complexity of a query impacts the request units consumed for an operation:

◇ Use of user-defined functions (UDFs)￭ SELE

CT or WHERE clauses

◇ To take advantage of indexing, try and have at least one filter against an indexed property when leveraging a UDF in the WHERE clause.


function count(filterQuery, continuationToken) { var collection = getContext().getCollection(); var maxResult = 25; // MAX number of docs to process in one batch, when reached, return to client/request continuation. // intentionally set low to demonstrate the concept. This can be much higher. Try experimenting. // We've had it in to the high thousands before seeing the stored proceudre timing out.

// The number of documents counted. var result = 0;

tryQuery(continuationToken);}

Executing Stored Procedures◇ Execute

“explicit” Javascript code on collection


Conclusions


Conclusions◇ DocumentDb is a Restful service◇ Documents defines Unit of Costs with Resource Units◇ Database Account defines Accessibility and Consistency◇ Database is a Namespace placeholder◇ Containers is the unit of Scale


Usage: what is DocumentDb for?◇ User generated content◇ Many specific data (varbinary(MAX) in SQL)◇ Catalog data◇ Log data◇ User preferences data◇ Device sensor data◇ IoT use cases commonly share some patterns in how they ingest, process

and store data. First, these systems allow for data intake that can ingest bursts of data from device sensors of various locales. Next, these systems process and analyze streaming data to derive real time insights. And last but not least, most if not all data will eventually land in a data store for adhoc querying and offline analytics.


Any questions?You can find me at: [email protected]/@marco_parenzan

Thanks!

azure documentdb

Software