Transcript
Page 1: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

SPOT 401 - Leading the NoSQL Revolution:

under the covers of Distributed Systems @ scale

Page 2: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

what are we covering?

The evolution of large scale

distributed systems @ Amazon from

the 90’s to today

The lessons we learned and insights you can employ in your own distributed systems

Page 3: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

let’s start with a story about a little

company called amazon.com

Page 4: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

once upon a time...

(in 2000)

episode 1

Page 5: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

a thousand miles

away...

(seattle)

Page 6: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon.com - a rapidly growing Internet based retail

business relied on relational databases

Page 7: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we had 1000s of independent services

Page 8: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

each service managed its own state in RDBMs

Page 9: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs are actually kind of cool

Page 10: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

first of all... SQL!!

Page 11: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so it is easier to query..

Page 12: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

easier to learn

Page 13: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

they are as versatile as a swiss army knife

complex queries

key-value access

transactions analytics

Page 14: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs are *very* similar to

Swiss Army Knives

Page 15: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but sometimes.. swiss army knifes..

can be more than what you bargained for

Page 16: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

partitioning

easy re-

partitioning HARD..

Page 17: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so we bought

bigger boxes...

Page 18: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Q4 was hard-work at Amazon

benchmark new

hardware

migrate to new

hardware

repartition databases

pray ...

Page 19: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs availability challenges..

Page 20: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

then.. (in 2005)

episode 2

Page 21: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon dynamo predecessor to

dynamoDB

specialist tool :

•limited querying capabilities

•simpler consistency

replicated DHT with consistent hashing

optimistic replication

“sloppy quorum”

anti-entropy mechanism

object versioning

Page 22: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

dynamo had many benefits • higher availability

• we traded it off for eventual consistency

• incremental scalability

• no more repartitioning

• no need to architect apps for peak

• just add boxes

• simpler querying model ==>> predictable performance

Page 23: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

lacked strong consistency

Page 24: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

scaling was easier, but...

Page 25: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

steep learning curve

Page 26: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

dynamo was a product ... ==>> not a service...

Page 27: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

then.. (in 2012)

episode 3

Page 28: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

ADMIN

“Even though we have years of experience with large, complex

NoSQL architectures, we are happy to be finally out of the

business of managing it ourselves.” - Don MacAskill, CEO

• NoSQL database

• fast & predictable performance

• seamless scalability

• easy administration

DynamoDB

Page 29: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

services, services, services

Page 30: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon.com’s experience with services

Page 31: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

how do you create a successful service?

Page 32: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

with great services, comes great responsibility

Page 33: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Architect

Customer

Page 34: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

DynamoDB Goals and Philosophies

never compromise on durability scale is our

problem easy to use

scale in rps consistent and low

latencies

Page 35: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

how to build these large scale services?

Page 36: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

don’t compromise on durability…

Page 37: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

don’t compromise on… availability

Page 38: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

plan for success, plan for scalability

Page 39: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Page 40: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Fault tolerant design is key.. • Everything fails all the time

• Planning for failures is not easy

• How do you ensure your recovery strategies work correctly?

Page 41: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Byzantine General Problem

Page 42: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

A simple 2-way replication system of a traditional database…

Primary Standby

Writes

Page 43: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

P S

S is dead, need to trigger new

replica

P is dead, need to promote myself

P’

Page 44: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

Improved Replication: Quorum

Replica

Replica

Writes Replica

Quorum: Successful write on a majority

Page 45: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Not so easy..

Replica B

Replica C

Writes from

client A Replica A

Replica D

New member in the

group

Should I continue to serve reads?

Should I start a new quorum?

Replica E Replica F

Reads and

Writes from

client B

Classic Split Brain Issue in Replicated systems leading to lost writes!

Page 46: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Building correct distributed systems is not straight forward..

• How do you handle replica failures?

• How do you ensure there is not a parallel

quorum?

• How do you handle partial failures of replicas?

• How do you handle concurrent failures?

Page 47: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

correctness is hard, but necessary

Page 48: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

Page 49: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

to minimize bugs, we must have a precise description of the design

Page 50: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

code is too detailed

how would you express partial failures or concurrency?

design documents and diagrams are vague & imprecise

Page 51: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

law of large numbers is your friend,

so design for scale

until you hit large numbers

Page 52: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

TLA+ to the rescue?

Page 53: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

PlusCal

Page 54: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

formal methods are necessary

but not sufficient..

Page 55: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

customer

Page 56: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

forget to test - no, serious don’t ly

Page 57: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

embrace failure and don’t be surprised

simulate failures at unit

test level

fault injection testing

datacenter testing

network brown out testing

scale testing

Page 58: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

testing is a lifelong journey

Page 59: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

testing is necessary but not sufficient..

Page 60: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Architect

Customer

Page 61: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

release cycle

gamma simulate real

world

one box does it work?

phased deployment

treading lightly

monitor does it still

work?

Page 62: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Monitor customer behavior

Page 63: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

measuring customer experience is key

don’t be satisfied by average - look at

99 percentile

Page 64: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand the scaling dimensions

Page 65: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand how your service will be abused

Page 66: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

let’s see these rules in action through a true story

Page 67: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we were building distributed systems all over

amazon.com

Page 68: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we needed a uniform and correct way to do

consensus..

Page 69: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so we built a paxos lock library service

Page 70: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

such a service is so much more useful than just leader election.. it became a distributed

state store

Page 71: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

such a service is so much more useful than just leader election.. or a distributed state

store

wait wait.. you’re telling me if I poll,

I can detect node failure?

Page 72: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we acted quickly - and scaled up our entire fleet

with more nodes

doh!!!!

we slowed consensus...

Page 73: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand the scaling dimensions

& scale them independently...

Page 74: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

a lock service has 3 components..

Page 75: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 76: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 77: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 78: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Let’s Go Over The demo from this morning

Page 79: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 80: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 81: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 82: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Real-time tweet analytics using DynamoDB

• Stream from Kinesis to DynamoDB

• What data do want in real-time? • (per-second, top words)

• How does DynamoDB help? • Atomic counters (per-word counts in that second)

• Indexed queries (top N word-counts in that second

Page 83: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Time Word Count

2013-10-13T12:00 Earth 9

2013-10-13T12:00 Mars 10

2013-10-13T12:00 Pluto 5

2013-10-13T12:03 Earth 8

Time Count Word

2013-10-13T12:00 5 Pluto

2013-10-13T12:00 9 Earth

2013-10-13T12:00 10 Mars

2013-10-13T12:03 8 Earth

WordCount Table Local Secondary Index

Page 84: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

DynamoDB cost: $0.25 / hr

Page 85: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 86: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Aggregate queries using Redshift

• Simple Redshift connector (buffer files, store in s3, call copy command)

• Manifest copy connector • 2 streams

• transaction table for deduplication

• manifest copy

Page 87: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Right tool for right job…

• Canal -> DynamoDB -> Redshift -> Glacier…

Page 88: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

You are not done yet..

• Listen to customer feedback

• Iterate..

Page 89: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Example: DynamoDB

• Start with immediate needs of reliable, super scalable, low latency datastore

• Iterate • Developers wanted flexible query: Local Secondary Indexes

• Developers wanted parallel loads: Parallel Scans

• Mobile developers wanted direct access to their datastore: Fine-grained Access Control

• Mobile developers wanted geo-awareness: Geospatial library

• Developers wanted DynamoDB on their laptop: DynamoDB Local

• Developers wanted richer query: Global Secondary Indexes

• We will continue to innovate..

Page 90: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

Sacred Tenets in Distributed Systems

plan for success – plan for scalability

don’t compromise durability for performance

plan for failures - fault -tolerance is key consistent performance

is important release - think of blast radius

insist on correctness

Page 91: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand scaling

dimensions

observe how service is

used

scalability

over features

strive for

correctness

relentlessly test

monitor like a hawk

Page 92: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Please give us your feedback on this

presentation

Don’t miss SPOT 201!!!

SPOT 401

Page 93: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79


Top Related