nosql revolution: under the covers of distributed systems at scale (spot401) | aws re:invent 2013

@ksshams @swami_79

SPOT 401 - Leading the NoSQL Revolution:

under the covers of Distributed Systems @ scale

@ksshams @swami_79

what are we covering?

The evolution of large scale

distributed systems @ Amazon from

the 90’s to today

The lessons we learned and insights you can employ in your own distributed systems

@ksshams @swami_79

let’s start with a story about a little

company called amazon.com

@ksshams @swami_79

once upon a time...

(in 2000)

episode 1

@ksshams @swami_79

a thousand miles

away...

(seattle)

@ksshams @swami_79

amazon.com - a rapidly growing Internet based retail

business relied on relational databases

@ksshams @swami_79

we had 1000s of independent services

@ksshams @swami_79

each service managed its own state in RDBMs

@ksshams @swami_79

RDBMs are actually kind of cool

@ksshams @swami_79

first of all... SQL!!

@ksshams @swami_79

so it is easier to query..

@ksshams @swami_79

easier to learn

@ksshams @swami_79

they are as versatile as a swiss army knife

complex queries

key-value access

transactions analytics

@ksshams @swami_79

RDBMs are *very* similar to

Swiss Army Knives

@ksshams @swami_79

but sometimes.. swiss army knifes..

can be more than what you bargained for

@ksshams @swami_79

partitioning

easy re-

partitioning HARD..

@ksshams @swami_79

so we bought

bigger boxes...

@ksshams @swami_79

Q4 was hard-work at Amazon

benchmark new

hardware

migrate to new

hardware

repartition databases

pray ...

@ksshams @swami_79

RDBMs availability challenges..

@ksshams @swami_79

then.. (in 2005)

episode 2

@ksshams @swami_79

amazon dynamo predecessor to

dynamoDB

specialist tool :

•limited querying capabilities

•simpler consistency

replicated DHT with consistent hashing

optimistic replication

“sloppy quorum”

anti-entropy mechanism

object versioning

@ksshams @swami_79

dynamo had many benefits • higher availability

• we traded it off for eventual consistency

• incremental scalability

• no more repartitioning

• no need to architect apps for peak

• just add boxes

• simpler querying model ==>> predictable performance

@ksshams @swami_79

but dynamo was not perfect...

lacked strong consistency

@ksshams @swami_79

scaling was easier, but...

@ksshams @swami_79

steep learning curve

@ksshams @swami_79

dynamo was a product ... ==>> not a service...

@ksshams @swami_79

then.. (in 2012)

episode 3

@ksshams @swami_79

“Even though we have years of experience with large, complex

NoSQL architectures, we are happy to be finally out of the

business of managing it ourselves.” - Don MacAskill, CEO

• NoSQL database

• fast & predictable performance

• seamless scalability

• easy administration

DynamoDB

@ksshams @swami_79

services, services, services

@ksshams @swami_79

amazon.com’s experience with services

@ksshams @swami_79

how do you create a successful service?

@ksshams @swami_79

with great services, comes great responsibility

@ksshams @swami_79

Architect

Customer

@ksshams @swami_79

DynamoDB Goals and Philosophies

never compromise on durability scale is our

problem easy to use

scale in rps consistent and low

latencies

@ksshams @swami_79

how to build these large scale services?

@ksshams @swami_79

don’t compromise on durability…

@ksshams @swami_79

don’t compromise on… availability

@ksshams @swami_79

plan for success, plan for scalability

@ksshams @swami_79

Fault tolerant design is key.. • Everything fails all the time

• Planning for failures is not easy

• How do you ensure your recovery strategies work correctly?

@ksshams @swami_79

Byzantine General Problem

@ksshams @swami_79

A simple 2-way replication system of a traditional database…

Primary Standby

Writes

@ksshams @swami_79 @ksshams @swami_79

S is dead, need to trigger new

replica

P is dead, need to promote myself

Improved Replication: Quorum

Replica

Writes Replica

Quorum: Successful write on a majority

Not so easy..

Replica B

Replica C

Writes from

client A Replica A

Replica D

New member in the

Should I continue to serve reads?

Should I start a new quorum?

Replica E Replica F

Reads and

Writes from

client B

Classic Split Brain Issue in Replicated systems leading to lost writes!

@ksshams @swami_79

Building correct distributed systems is not straight forward..

• How do you handle replica failures?

• How do you ensure there is not a parallel

quorum?

• How do you handle partial failures of replicas?

• How do you handle concurrent failures?

correctness is hard, but necessary

Formal Methods

to minimize bugs, we must have a precise description of the design

Formal Methods

code is too detailed

how would you express partial failures or concurrency?

design documents and diagrams are vague & imprecise

Formal Methods

law of large numbers is your friend,

so design for scale

until you hit large numbers

@ksshams @swami_79

TLA+ to the rescue?

@ksshams @swami_79

PlusCal

@ksshams @swami_79

formal methods are necessary

but not sufficient..

@ksshams @swami_79

customer

@ksshams @swami_79

forget to test - no, serious don’t ly

embrace failure and don’t be surprised

simulate failures at unit

test level

fault injection testing

datacenter testing

network brown out testing

scale testing

testing is a lifelong journey

@ksshams @swami_79

testing is necessary but not sufficient..

@ksshams @swami_79

Architect

Customer

@ksshams @swami_79

release cycle

gamma simulate real

one box does it work?

phased deployment

treading lightly

monitor does it still

@ksshams @swami_79

Monitor customer behavior

@ksshams @swami_79

measuring customer experience is key

don’t be satisfied by average - look at

99 percentile

@ksshams @swami_79

understand the scaling dimensions

@ksshams @swami_79

understand how your service will be abused

@ksshams @swami_79

let’s see these rules in action through a true story

@ksshams @swami_79

we were building distributed systems all over

amazon.com

@ksshams @swami_79

we needed a uniform and correct way to do

consensus..

@ksshams @swami_79

so we built a paxos lock library service

@ksshams @swami_79

such a service is so much more useful than just leader election.. it became a distributed

state store

@ksshams @swami_79

such a service is so much more useful than just leader election.. or a distributed state

wait wait.. you’re telling me if I poll,

I can detect node failure?

@ksshams @swami_79

we acted quickly - and scaled up our entire fleet

with more nodes

doh!!!!

we slowed consensus...

@ksshams @swami_79

understand the scaling dimensions

& scale them independently...

@ksshams @swami_79

State Store

a lock service has 3 components..

@ksshams @swami_79

State Store

they must be scaled independently..

@ksshams @swami_79

State Store

@ksshams @swami_79

State Store

Let’s Go Over The demo from this morning

stream ingestion

Real-time tweet analytics using DynamoDB

• Stream from Kinesis to DynamoDB

• What data do want in real-time? • (per-second, top words)

• How does DynamoDB help? • Atomic counters (per-word counts in that second)

• Indexed queries (top N word-counts in that second

Time Word Count

2013-10-13T12:00 Earth 9

2013-10-13T12:00 Mars 10

2013-10-13T12:00 Pluto 5

2013-10-13T12:03 Earth 8

Time Count Word

2013-10-13T12:00 5 Pluto

2013-10-13T12:00 9 Earth

2013-10-13T12:00 10 Mars

2013-10-13T12:03 8 Earth

WordCount Table Local Secondary Index

DynamoDB cost: $0.25 / hr

stream ingestion

Aggregate queries using Redshift

• Simple Redshift connector (buffer files, store in s3, call copy command)

• Manifest copy connector • 2 streams

• transaction table for deduplication

• manifest copy

Right tool for right job…

• Canal -> DynamoDB -> Redshift -> Glacier…

You are not done yet..

• Listen to customer feedback

• Iterate..

Example: DynamoDB

• Start with immediate needs of reliable, super scalable, low latency datastore

• Iterate • Developers wanted flexible query: Local Secondary Indexes

• Developers wanted parallel loads: Parallel Scans

• Mobile developers wanted direct access to their datastore: Fine-grained Access Control

• Mobile developers wanted geo-awareness: Geospatial library

• Developers wanted DynamoDB on their laptop: DynamoDB Local

• Developers wanted richer query: Global Secondary Indexes

• We will continue to innovate..

Sacred Tenets in Distributed Systems

plan for success – plan for scalability

don’t compromise durability for performance

plan for failures - fault -tolerance is key consistent performance

is important release - think of blast radius

insist on correctness

@ksshams @swami_79

understand scaling

dimensions

observe how service is

scalability

over features

strive for

correctness

relentlessly test

monitor like a hawk

@ksshams @swami_79

Please give us your feedback on this

presentation

Don’t miss SPOT 201!!!

SPOT 401

@ksshams @swami_79

nosql revolution: under the covers of distributed systems at scale (spot401) | aws re:invent 2013

formal methods

leader election

scaled independently

dont compromise

failures

amazon

dynamodb

services

Technology

feedback on aws re:invent 2016

aws re:invent 2016: large-scale aws migrations (ent204)

mobile game architectures on aws (mbl201) | aws re:invent...

(sec201) aws security keynote address | aws re:invent 2014

migrating my.t-mobile.com to aws (ent214) | aws re:invent...

aws re:invent re:flection - spot pricing

netapp private storage for aws (ent216) | aws re:invent 2013

(sov203) understanding aws storage options | aws re:invent...

zero to sixty: aws opsworks (dmg202) | aws re:invent 2013

aws security – keynote address (sec101) | aws re:invent...

(spot301) aws innovation at scale | aws re:invent 2014

aws re:invent 2016 photo report

aws re:invent - accelerating research

(app304) aws cloudformation best practices | aws re:invent...

aws security ideas - re:invent 2016

aws re:invent 2016: new launch! introducing aws greengrass...

aws re:invent 2016 fast forward

(ent302) cost optimization on aws | aws re:invent 2014

aws re:invent 2016: relational and nosql databases on aws:...

data replication options in aws (arc302) | aws re:invent...