under the covers of dynamodb

Under the Covers of DynamoDB

Matt Wood

Principal Data Scientist

Hello.

1. Getting started

2. Data modeling

3. Partitioning

4. Replication & Analytics

Overview

5. Customer story: Localytics

Getting started

DynamoDB is a managed

NoSQL database service.

Store and retrieve any amount of data.

Serve any level of request traffic.

Without the operational burden.

Consistent, predictable performance.

Single digit millisecond latency.

Backed on solid-state drives.

Flexible data model.

Key/attribute pairs. No schema required.

Easy to create. Easy to adjust.

Seamless scalability.

No table size limits. Unlimited storage.

No downtime.

Durable.

Consistent, disk only writes.

Replication across data centers and availability zones.

Without the operational burden.

Focus on your app.

Two decisions + three clicks

= ready for use

Primary keys

Level of throughput

= ready for use

Primary keys

Level of throughput

Provisioned throughput.

Reserve IOPS for reads and writes.

Scale up for down at any time.

Pay per capacity unit.

Priced per hour of provisioned throughput.

Write throughput.

Size of item x writes per second

$0.0065 for 10 write units

Consistent writes.

Atomic increment and decrement.

Optimistic concurrency control: conditional writes.

Transactions.

Item level transactions only.

Puts, updates and deletes are ACID.

Read throughput.

Strong or eventual consistency

Read throughput.

Provisioned units = size of item x reads per second

$0.0065 per hour for 50 units

Read throughput.

Provisioned units = size of item x reads per second

$0.0065 per hour for 100 units

Read throughput.

Same latency expectations.

Mix and match at ‘read time’.

Provisioned throughput is

managed by DynamoDB.

Data is partitioned and

managed by DynamoDB.

Indexed data storage.

$0.25 per GB per month.

Tiered bandwidth pricing:

aws.amazon.com/dynamodb/pricing

Reserved capacity.

Up to 53% for 1 year reservation.

Up to 76% for 3 year reservation.

Authentication.

Session based to minimize latency.

Uses the Amazon Security Token Service.

Handled by AWS SDKs.

Integrates with IAM.

Monitoring.

CloudWatch metrics:

latency, consumed read and write throughput,

errors and throttling.

Libraries, mappers and mocks.

ColdFusion, Django, Erlang, Java, .Net,

Node.js, Perl, PHP, Python, Ruby

http://j.mp/dynamodb-libs

Data modeling

id = 100 date = 2012-05-16-

09-00-10 total = 25.00

id = 101 date = 2012-05-15-

15-00-11 total = 35.00

id = 101 date = 2012-05-16-

12-00-10 total = 100.00

id = 102 date = 2012-03-20-

18-23-10 total = 20.00

id = 102 date = 2012-03-20-

18-23-10 total = 120.00

id = 100 date = 2012-05-16-

09-00-10 total = 25.00

id = 101 date = 2012-05-15-

15-00-11 total = 35.00

id = 101 date = 2012-05-16-

12-00-10 total = 100.00

id = 102 date = 2012-03-20-

18-23-10 total = 20.00

id = 102 date = 2012-03-20-

18-23-10 total = 120.00

id = 100 date = 2012-05-16-

09-00-10 total = 25.00

id = 101 date = 2012-05-15-

15-00-11 total = 35.00

id = 101 date = 2012-05-16-

12-00-10 total = 100.00

id = 102 date = 2012-03-20-

18-23-10 total = 20.00

id = 102 date = 2012-03-20-

18-23-10 total = 120.00

id = 100 date = 2012-05-16-

09-00-10 total = 25.00

id = 101 date = 2012-05-15-

15-00-11 total = 35.00

id = 101 date = 2012-05-16-

12-00-10 total = 100.00

id = 102 date = 2012-03-20-

18-23-10 total = 20.00

id = 102 date = 2012-03-20-

18-23-10 total = 120.00

Attribute

Where is the schema?

Tables do not require a formal schema.

Items are an arbitrarily sized hash.

Indexing.

Items are indexed by primary and secondary keys.

Primary keys can be composite.

Secondary keys are local to the table.

ID Date Total

Hash key

ID Date Total

Hash key Range key

Composite primary key

ID Date Total

Hash key Range key Secondary range key

Programming DynamoDB.

Small but perfectly formed API.

CreateTable

UpdateTable

DeleteTable

DescribeTable

ListTables

PutItem

GetItem

UpdateItem

DeleteItem

BatchGetItem

BatchWriteItem

CreateTable

UpdateTable

DeleteTable

DescribeTable

ListTables

PutItem

GetItem

UpdateItem

DeleteItem

BatchGetItem

BatchWriteItem

CreateTable

UpdateTable

DeleteTable

DescribeTable

ListTables

PutItem

GetItem

UpdateItem

DeleteItem

BatchGetItem

BatchWriteItem

Conditional updates.

PutItem, UpdateItem, DeleteItem can take

optional conditions for operation.

UpdateItem performs atomic increments.

One API call, multiple items

BatchGet returns multiple items by key.

Throughput is measured by IO, not API calls.

BatchWrite performs up to 25 put or delete operations.

CreateTable

UpdateTable

DeleteTable

DescribeTable

ListTables

PutItem

GetItem

UpdateItem

DeleteItem

BatchGetItem

BatchWriteItem

Query vs Scan

Query returns items by key.

Scan reads the whole table sequentially.

Query patterns

Retrieve all items by hash key.

Range key conditions:

==, <, >, >=, <=, begins with, between.

Counts. Top and bottom n values.

Paged responses.

Mapping relationships.

EXAMPLE 1:

Players

user_id =

location =

Cambridge

joined =

2011-07-04

user_id =

jeffbarr

location =

Seattle

joined =

2012-01-20

user_id =

werner

location =

Worldwide

joined =

2011-05-15

Players

user_id =

location =

Cambridge

joined =

2011-07-04

user_id =

jeffbarr

location =

Seattle

joined =

2012-01-20

user_id =

werner

location =

Worldwide

joined =

2011-05-15

Scores user_id =

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

werner

location =

bejewelled

score =

55,000

Players

user_id =

location =

Cambridge

joined =

2011-07-04

user_id =

jeffbarr

location =

Seattle

joined =

2012-01-20

user_id =

werner

location =

Worldwide

joined =

2011-05-15

Scores Leader boards

user_id =

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

werner

location =

bejewelled

score =

55,000

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

game =

tetris

score =

9,000,000

user_id =

jeffbarr

Players

user_id =

location =

Cambridge

joined =

2011-07-04

user_id =

jeffbarr

location =

Seattle

joined =

2012-01-20

user_id =

werner

location =

Worldwide

joined =

2011-05-15

user_id =

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

werner

location =

bejewelled

score =

55,000

Scores game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

game =

tetris

score =

9,000,000

user_id =

jeffbarr

Leader boards

Query for scores

by user

Players

user_id =

location =

Cambridge

joined =

2011-07-04

user_id =

jeffbarr

location =

Seattle

joined =

2012-01-20

user_id =

werner

location =

Worldwide

joined =

2011-05-15

Scores Leader boards

user_id =

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

werner

location =

bejewelled

score =

55,000

game =

angry-birds

score =

11,000

user_id =

game =

tetris

score =

1,223,000

user_id =

game =

tetris

score =

9,000,000

user_id =

jeffbarr

High scores by game

Storing large items.

EXAMPLE 2:

Unlimited storage.

Unlimited attributes per item.

Unlimited items per table.

Maximum of 64k per item.

message_id = 1 part = 1 message =

message_id = 1 part = 2 message =

message_id = 1 part = 3 joined =

Split across items.

message_id = 1 message =

http://s3.amazonaws.com...

Store a pointer to S3.

Time series data

EXAMPLE 3:

event_id =

timestamp =

2013-04-16-09-59-01

event_id =

timestamp =

2013-04-16-09-59-02

event_id =

timestamp =

2013-04-16-09-59-02

Hot and cold tables. April

event_id =

timestamp =

2013-03-01-09-59-01

event_id =

timestamp =

2013-03-01-09-59-02

event_id =

timestamp =

2013-03-01-09-59-02

April March February January December

Archive data.

Move old data to S3: lower cost.

Still available for analytics.

Run queries across hot and cold data

with Elastic MapReduce.

Partitioning

Uniform workload.

Data stored across multiple partitions.

Data is primarily distributed by primary key.

Provisioned throughput is divided evenly across partitions.

To achieve and maintain full

provisioned throughput, spread

workload evenly across hash keys.

Non-Uniform workload.

Might be throttled, even at high levels of throughput.

Distinct values for hash keys.

BEST PRACTICE 1:

Hash key elements should have a

high number of distinct values.

user_id =

first_name =

last_name =

user_id =

jeffbarr

first_name =

last_name =

user_id =

werner

first_name =

Werner

last_name =

Vogels

user_id =

simone

first_name =

Simone

last_name =

Brunozzi

... ... ...

Lots of users with unique user_id.

Workload well distributed across hash key.

Avoid limited hash key values.

BEST PRACTICE 2:

Hash key elements should have a

high number of distinct values.

status =

date =

2012-04-01-00-00-01

status =

date =

2012-04-01-00-00-01

status

date =

2012-04-01-00-00-01

status =

date =

2012-04-01-00-00-01

Small number of status codes.

Unevenly, non-uniform workload.

Model for even distribution.

BEST PRACTICE 3:

Access by hash key value should be evenly

distributed across the dataset.

mobile_id =

access_date =

2012-04-01-00-00-01

mobile_id =

access_date =

2012-04-01-00-00-02

mobile_id =

access_date =

2012-04-01-00-00-03

mobile_id =

access_date =

2012-04-01-00-00-04

... ...

Large number of devices.

Small number which are much more popular than others.

Workload unevenly distributed.

mobile_id =

access_date =

2012-04-01-00-00-01

mobile_id =

access_date =

2012-04-01-00-00-02

mobile_id =

access_date =

2012-04-01-00-00-03

mobile_id =

access_date =

2012-04-01-00-00-04

... ...

Sample access pattern.

Workload randomized by hash key.

Replication & Analytics

Seamless scale.

Scalable methods for data processing.

Scalable methods for backup/restore.

Amazon Elastic MapReduce.

Managed Hadoop service for

data-intensive workflows.

aws.amazon.com/emr

create external table items_db

(id string, votes bigint, views bigint) stored by

'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'

tblproperties

("dynamodb.table.name" = "items",

"dynamodb.column.mapping" =

"id:id,votes:votes,views:views");

select id, likes, views

from items_db

order by views desc;

Mohit Dilawari

Director of Engineering

@mdilawari

DynamoDB @ Localytics

About Localytics

• Mobile App Analytics Service

• 750+ Million Devices and over 20,000 Apps

• Customers Include:

…and many more.

About the Development Team

• Small team of four managing entire AWS infrastructure - 100 EC2 Instances

• Experts in BigData • Leveraging Amazon's service has been the key to our success

• Large scale users of: • SQS • S3 • ELB • RDS • Route53 • Elastic Cache • EMR

…and of course DynamoDB

Why DynamoDB?

Set it and Forget it

Our use-case: Dedup Data

• Each datapoint includes a globally unique ID • Mobile traffic over 2G/3G will upload periodic duplicate data • We accept data up to a 28 day window

First Design for Dedup table

Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333

Table Name = dedup_table

aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111

"Test and Set" in a single operation

Optimization One - Data Aging

• Partition by Month • Create new table day before the month

• Need to keep two months of data

Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

Check Previous month:

Table Name = March2013_dedup

Not Here!

Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

Test and Set in current month:

Inserted

Table Name = April2013_dedup

bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111

bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222 bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

Optimization Two

• Reduce the index size - Reduces costs • Each item has a 100 byte overhead which is substantial

• Combine multiple IDs together to one record • Split each ID into two halves

o First half is the key. Second Half is added to the set

Optimization Two - Use Sets

Unique ID: ccccccccccccccccccccccccccc999999999999999

Prefix Values

aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333]

bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666]

ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ]

ccccccccccccccccccccccccccc 999999999999999

Optimization Three - Combine Months

• Go back to a single table

Prefix March2013 April2013

aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434....

bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767.....

ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313....

One Operation 1. Delete February2013 Field 2. Check ID in March2013 • Test and Set into April 2013

Compare Plans for 20 Billion IDs per month

Plan Storage Costs

Read Costs

Write Costs Total Savings

Naive (after a year)

$8400 0 $4000 $12400

Data Age $900 $350 $4000 $5250 57%

Using Sets $150 $350 $4000 $4500 64%

Multiple Months $150 0 $4000 $4150 67%

Thank You @mdilawari

1. Getting started

2. Data modeling

3. Partitioning

4. Replication & Analytics

Summary

5. Customer story: Localytics

Free tier.

aws.amazon.com/dynamodb

Thank you!

matthew@amazon.com

under the covers of dynamodb

item date

table date

hash key iddate total

attribute date

hour of provisioned

range key conditions

provisioned units

data modeling

Technology

dynamodb at hasoffers

aws techworkshop dynamodb / customer presentation: infopark...

deep dive on amazon dynamodb

design patterns using amazon dynamodb

amazon dynamodb - developer guide · pdf fileamazon dynamodb...

masterclass webinar: amazon dynamodb july 2014

amazon dynamodb lessen's learned by beginner

compare dynamodb vs. mongodb

dynamodb, análisis del paper

rspec 3.0: under the covers

autoscale dynamodb with dynamic dynamodb

(bdt313) amazon dynamodb for big data

lab 8 - streaming data to dynamodb - amazon s3 · pdf...

02.amazon dynamodb

aws under the covers with amazon dynamodb ip expo 2013

under the covers with the web

building applications with dynamodb

dynamodb gsg

dynamodb deep dive

amazon web services: dynamodb 101 with itoc australia - what...