azure documentdb: advanced features for large scale-apps

Azure DocumentDB:Advanced Features for Large-Scale Apps

{ "name": "Andrew Liu", "e-mail": "[email protected]", "twitter": "@aliuy8"}

First… a Rant

managing servers makes me cry

structuring data is really hard

managing schema and indexes makes me angry

DocumentDBNoSQL… as a service!

Let's talk about…• A quick recap on NoSQL

• Big Data Challenges

• Partitioning, Data Modeling, Stored Procedures

• Q&A

• NoSQL is buzzword

• NoSQL is varied• Key-value• Wide-column • Graph• Document-oriented

NoSQL in a nutshell

{ "name": "SmugMug", "permalink": "smugmug", "homepage_url": "http://www.smugmug.com", "blog_url": "http://blogs.smugmug.com/", "category_code": "photo_video", "products": [ { "name": "SmugMug", "permalink": "smugmug" } ], "offices": [ { "description": "", "address1": "67 E. Evelyn Ave", "address2": "", "zip_code": "94041", "city": "Mountain View", "state_code": "CA", "country_code": "USA", "latitude": 37.390056, "longitude": -122.067692 } ]}

Perfect for these

Documentsschema-agnostic JSON store

for

hierarchical and de-normalized data at scale

Not these documents

{ "name": "SmugMug", "permalink": "smugmug", "homepage_url": "http://www.smugmug.com", "blog_url": "http://blogs.smugmug.com/", "category_code": "photo_video", "products": [ { "name": "SmugMug", "permalink": "smugmug" } ], "offices": [ { "description": "", "address1": "67 E. Evelyn Ave", "address2": "", "zip_code": "94041", "city": "Mountain View", "state_code": "CA", "country_code": "USA", "latitude": 37.390056, "longitude": -122.067692 } ]}

Perfect for these

Documentsschema-agnostic JSON store

for

hierarchical and de-normalized data at scale

Azure DocumentDB

Elastic Limitless scale

Millions of RPSMany TBs of data

Transparent Partitioning

<10ms Reads<15ms Writes

@P99

Low-latency access around the globe!

Guaranteed low latency

Globally replicated

Automatic IndexingEasy-to-learn query

grammarMulti-Record Transactions

Schema Freedom

Blazing fast, planet scale NoSQL service

99.99% SLAs for availability, latency, and throughput

How does this fit in the Azure family?

“If all you have is a hammer, everything looks like a nail“

-Abraham Maslow

The database renaissance!

Choose the right tools for the right job

Problem 1: Variety

Item Author Pages

Language

Harry Potter and the Sorcerer’s Stone

J.K. Rowling 309 English

Game of Thrones: A Song of Ice and Fire

George R.R. Martin

864 English

Item Author Pages

Language




George R.R. Martin

864 English

Lenovo Thinkpad X1 Carbon

??? ??? ???

Item Author Pages Language Processor Memory StorageHarry Potter and the Sorcerer’s Stone

J.K. Rowling

309 English ??? ??? ???


George R.R. Martin

864 English ??? ??? ???

Lenovo Thinkpad X1 Carbon

??? ??? ??? Core i7 3.3ghz

8 GB 256 GB SSD

What a waste of space…

Item Author Pages

Language




George R.R. Martin

864 English

Item CPU Memory StorageLenovo Thinkpad X1 Carbon

Core i7 3.3ghz

8 GB 256 GB SSD

More tables!

Okay… What if I have 100,000 product types?Or I have varying features for a single product

type?

ProductId Item1 Harry Potter and the

Sorcerer’s Stone2 Game of Thrones: A Song

of Ice and Fire3 Lenovo Thinkpad X1

Carbon

ProductId Attribute Value1 Author J.K. Rowling1 Pages 309

…2 Author George R.R. Martin2 Pages 864

…3 Processor Core i7 3.3ghz3 Memory 8 GB

…

{ "ItemType": "Book", "Title": "Harry Potter and the Sorcerer's Stone", "Author": "J.K. Rowling", "Pages": "864", "Languages": [ "English", "Spanish", "Portuguese", "Russian", "French" ]} {

"ItemType": "Laptop", "Name": "Lenovo Thinkpad X1 Carbon", "Processor": "Core i7 3.3 Ghz", "Memory": "8 GB DDR3L SDRAM", "Storage": "256 GB SSD", "Graphics": "Intel HD Graphics 4400", "Weight": "1 pound"}

It just works.

Problem 2: Scale (Volume and Velocity)

Let’s begin with a Story

Indexing JSON and fighting zombies at SCALE

Next Games Game Development Studio Based in

Helsinki, Finland

65 employees

Develop F2P mobile games for iOS and Android

Based on own & licensed IP

The Walking Dead TV show

Drama about a zombie walker apocalypse on AMC

First cable drama to beat broadcast shows

Most watched cable TV show in the US (16M users)

The Challenge

Scale with expectation of millions of users on Day 1

Deliver real time responsiveness for a lag-free, gaming experience

Highly competitive – high scoresand global leaderboards critical

More Users, More Problems

The Results

#1 in Apple app store free appsduring launch week

>1M downloads

~1B queries per day

99p queries served under 10ms

Just throw some data in a database!

Not that easy…

Why is this such a hard problem?

Caches Scoreboard keeps updating…

SQL database Need to shard

Schema and Index Management Loss of relational benefits

Azure Table Storage Secondary Indexes Latency Throughput

Planet-Scale NoSQL

Horizontal Scaling for storage andthroughput

High performance with SSDs andautomatic indexing

Operating on a global scale

Partitioning

Fact: Managing shards is really painful.

Elastic Scale

Good news: DocumentDB has done all the heavy lifting.

Request Units

Request Unit (RU) is the normalized currency

% Memory

% IOPS

% CPU

Replica gets a fixed budget of Request Units

READGET Resourc

e

Resourceset

INSERT

POSTResource

DELETEDELETE Resourc

e

QueryPOST Document

sSQL

EXECUTEPOST sprocsargs

REPLACE

PUTResource

Resource

Predictable PerformanceMost import metric in DocumentDB!

Partitioned Collections

What’s left? Choosing a Partition Key

Choosing a Partition Key• Workload – Read vs Write heavy?

• Top Queries

• Transaction Boundary

• Avoid Storage + Performance Bottlenecks

• Multi-Tenancy: Tenant Size

• Examples: partition by tenant, device, timestamp, or composite

Creating partitioned collections //pre-defined collectionsDocumentCollection collectionSpec = new DocumentCollection { Id = "Walkers" };RequestOptions options = new RequestOptions { OfferType = "S3" };

DocumentCollection documentCollection = await client.CreateDocumentCollectionAsync("dbs/" + database.Id, collectionSpec, options);

//partitioned collectionsDocumentCollection collectionSpec = new DocumentCollection { Id = "Walkers" };collectionSpec.PartitionKey.Paths.Add(“/walkerId”);int collectionThroughput = 100000; RequestOptions options = new RequestOptions { OfferThroughput = collectionThroughput };

DocumentCollection documentCollection = await client.CreateDocumentCollectionAsync("dbs/" + database.Id, collectionSpec, options);

Let's talk about a physics problem

Globally Distributed

• Not just for disaster recovery…. DocumentDB is unreasonably highly available

• Replicate data across any # of regions of your choice

• Low-latency access to your data around the globe

• Dynamically configure your write and read regions

Azure DocumentDB gives you the ability cheat the speed of light!

… with well-defined consistency models!

Bounded Staleness

Session

Eventual

Strong

LEFT TO RIGHT Relaxed consistency => better performance and availability

Consistency Level Strong Bounded Staleness Session Eventual

Total global order Yes Yes, outside of the “staleness window”

No, partial “session” order

No

Consistent prefix guarantee

Yes Yes Yes Yes

Monotonic reads Yes Yes, across regions outside of the staleness window and within a region all the time

Yes, for the given session

No

Monotonic writes Yes Yes Yes YesRead your writes Yes Yes (in the write region) Yes No

Strong consistency, High latency

Eventual consistency, Low latency

27%3%

54%

16%

Observed Distribution

Bounded-StalenessEventualSessionStrong

App defined regional preferencesConnectionPolicy docClientConnectionPolicy = new ConnectionPolicy { ConnectionMode =

ConnectionMode.Direct, ConnectionProtocol = Protocol.Tcp };

docClientConnectionPolicy.PreferredLocations.Add(LocationNames.EastUS2);docClientConnectionPolicy.PreferredLocations.Add(LocationNames.WestUS);

docClient = new DocumentClient( new Uri("https://myglobaldb.documents.azure.com:443"),

"PARvqUuBw2QTO4rRXr6d1GnLCR7VinERcYrBQvDRh6EDTJLOHtZxgjTS4pv8nQv2Lg1QQLBLfO6TVziOZKvYow==", docClientConnectionPolicy);

Enjoy true schema-freedom

Automatic Indexing• Index is a union of all the document trees

Commonstructure

1 2

Terms Postings List/Values

$/location/0/ 1, 2location/0/country/

1, 2

location/0/city/ 1, 20/country/Germany

1, 2

1/country/France 2 … …0/city/Moscow 20/dealers/0 2

http://aka.ms/docdbvldb

No need to define secondary indices / schema hints!

http://aka.ms/docdbvldb

Index policiescustomize index management including storageoverhead, throughput and query consistency

range, hash and spatial indexes included and excluded paths indexing mode; consistent or lazy index precision online, in-place index transformations

{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/*", "indexes": [ { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Hash", "dataType": "String", "precision": 3 }, { "kind": "Spatial", "dataType": "Point" } ] } ], "excludedPaths": []}

-- Nested lookup against indexSELECT Books.AuthorFROM BooksWHERE Books.Author.Name = "Leo Tolstoy"

-- Transformation, Filters, Array accessSELECT { Name: Books.Title, Author: Books.Author.Name }FROM BooksWHERE Books.Price > 10 AND Books.Languages[0] = "English"

-- Joins, User Defined Functions (UDF)SELECT CalculateRegionalTax(Books.Price, "USA", "WA")FROM BooksJOIN LanguagesArr IN Books.LanguagesWHERE LanguagesArr.Language = "Russian"

SQL Query Grammar

Query over schema-free JSON

JavaScript as a Modern Day T-SQL

Transactional Integrated JavaScript

function(playerId1, playerId2) { var playersToSwap = __.filter (function (document) { return (document.id == playerId1 || document.id == playerId2); });

var player1 = playersToSwap[0], player2 = playersToSwap[1]; var player1ItemTemp = player1.item; player1.item = player2.item; player2.item = player1ItemTemp;

__.replaceDocument(player1) .then(function() { return __.replaceDocument(player2); }) .fail(function(error){ throw 'Unable to update players, abort'; });}

client.executeStoredProcedureAsync ("procs/1234", ["MasterChief", "SolidSnake“]) .then(function (response) { console.log(“success!"); }, function (err) { console.log("Failed to swap!", error); });

Client Database

Transactional Integrated JavaScript

Getting Started

Fully managed as a service

API and Toolchain Options

DocumentDB

REST over HTTPS/TCP

Java .NET

PowerBI

Tip: Data Modeling

{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "addresses": [ { "line1": "100 Some Street", "line2": "Unit 1", "city": "Seattle", "state": "WA", "zip": 98012 } ], "contactDetails": [ {"email: "[email protected]"}, {"phone": "+1 555 555-5555", "extension": 5555} ] }

Try model your entity as a self-contained documentGenerally, use embedded data models when:

There are "contains" relationships between entitiesThere are one-to-few relationships between entities Embedded data changes infrequentlyEmbedded data won’t grow without boundEmbedded data is integral to data in a document

Data modeling with denormalization

Denormalizing typically provides for better read performance

In general, use normalized data models when:

Write performance is more important than read performanceRepresenting one-to-many relationshipsCan representing many-to-many relationshipsRelated data changes frequently

Provides more flexibility than embeddingMore round trips to read data

Data modeling with referencing

{"id": "xyz","username:

"user xyz"}

{"id": "address_xyz","userid": "xyz",

"address" : {…

}}

{"id: "contact_xyz","userid": "xyz","email" :

"[email protected]" "phone" : "555 5555"}

User document

Address document

Contact details document

Normalizing typically provides better write performance

No magic bulletThink about how your data is going to be written, read and model accordingly

Hybrid models ~ denormalize + reference + aggregate

{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "countOfBooks": 3, "books": [1, 2, 3], "images": [

{"thumbnail": "http://....png"} {"profile": "http://....png"}

] }

{ "id": 1, "name": "DocumentDB 101", "authors": [

{"id": 1, "name": "Thomas Andersen", "thumbnail": "http://....png"},

{"id": 2, "name": "William Wakefield", "thumbnail": "http://....png"}

] }

Author document

Book document

• De-normalize data where appropriate

• Collections != Tables

• Tuning / Perf• Consistency Levels• Index Policies• Understand Query Costs / Limits / Avoid Scans• Pre-aggregate where possible

Quick Tips

Thank YouGet started with Azure DocumentDB

http://www.azure.com/docdb

Query Demo:https://www.documentdb.com/sql/demo

Andrew [email protected]

@aliuy8

azure documentdb: advanced features for large scale-apps

Software