distributing your data

52
HELLO DEVDAY PDX 2010

Upload: justin-marney

Post on 12-May-2015

1.093 views

Category:

Technology


0 download

DESCRIPTION

A presentation on a few of the fundamental concepts behind distributed data stores. Presented at Devnation Portland 2010 http://devnation.us/events/10.

TRANSCRIPT

Page 1: Distributing Your Data

HELLO DEVDAY

PDX 2010

Page 2: Distributing Your Data

Justin Marney

HELLO DEVDAY

PDX 2010

Page 3: Distributing Your Data

Justin Marney

HELLO DEVDAY

Viget Labs

PDX 2010

Page 4: Distributing Your Data

Justin Marney

HELLO DEVDAY

Viget Labs

GMU BCS ‘06

PDX 2010

Page 5: Distributing Your Data

Justin Marney

Viget Labs

GMU BCS ‘06

Ruby ‘07

PDX 2010

Page 6: Distributing Your Data

Viget ‘08

Viget Labs

GMU BCS ‘06

Ruby ‘07

PDX 2010

Page 7: Distributing Your Data

PDX 2010

Page 8: Distributing Your Data

PDX 2010

Page 9: Distributing Your Data

DISTRIBUTING YOUR DATA

PDX 2010

Page 10: Distributing Your Data

DISTRIBUTING YOUR DATA

WHY?

PDX 2010

Page 11: Distributing Your Data

DISTRIBUTING YOUR DATA

WHY?

web applications are judged by their level of availability

PDX 2010

Page 12: Distributing Your Data

web applications are judged by their level of availability

WHY?

ability to continue operating during failure scenarios

DISTRIBUTING YOUR DATA

PDX 2010

Page 13: Distributing Your Data

web applications are judged by their level of availability

WHY?

ability to continue operating during failure scenarios

ability to manage availability during failure scenarios

PDX 2010

Page 14: Distributing Your Data

web applications are judged by their level of availability

WHY?

ability to continue operating during failure scenarios

ability to manage availability during node failure

increase throughput

PDX 2010

Page 15: Distributing Your Data

web applications are judged by their level of availability

ability to continue operating during failure scenarios

ability to manage availability during node failure

increase throughput

increase durability

PDX 2010

Page 16: Distributing Your Data

ability to continue operating during failure scenarios

ability to manage availability during node failure

increase throughput

increase durability

increase scalability

PDX 2010

Page 17: Distributing Your Data

SCALABILITY"I can add twice as much X and get twice as much Y."

X = processor, RAM, disks, servers, bandwidth

Y = throughput, storage space, uptime

PDX 2010

Page 18: Distributing Your Data

SCALABILITYscalability is a ratio.

2:2 = linear scalability ratio

scalability ratio allows you to predict how much it will cost you to grow.

PDX 2010

Page 19: Distributing Your Data

SCALABILITYUP/DOWN/VERTICAL/HORIZONTAL/L/R/L/R/A/B/START

PDX 2010

Page 20: Distributing Your Data

SCALABILITYUPgrow your infrastructuremultiple data centershigher bandwidthfaster machines

PDX 2010

Page 21: Distributing Your Data

SCALABILITYDOWNshrink your infrastructuremobileset-toplaptop

PDX 2010

Page 22: Distributing Your Data

SCALABILITYVERTICALadd to a single nodeCPURAMRAID

PDX 2010

Page 23: Distributing Your Data

SCALABILITYHORIZONTALadd more nodesdistribute the loadcommodity costlimited only by capital

PDX 2010

Page 24: Distributing Your Data

@gary_hustwit: Dear Twitter: when a World Cup match is at the 90th minute, you might want to turn on a few more servers.

PDX 2010

Page 25: Distributing Your Data

ASYNCHRONOUSA distributed transaction is bound by availability of all nodes.

PDX 2010

Page 26: Distributing Your Data

ASYNCHRONOUSA distributed transaction is bound by availability of all nodes.

(.99^1) = .99

(.99^2) = .98

(.99^3) = .97

PDX 2010

Page 27: Distributing Your Data

ASYNCHRONOUSAsynchronous systems operate without the concept of global state.

The concurrency model more accurately reflects the real world.

PDX 2010

Page 28: Distributing Your Data

ASYNCHRONOUSAsynchronous systems operate without the concept of global state.

The concurrency model more accurately reflects the real world.

What about my ACID!?

PDX 2010

Page 29: Distributing Your Data

Consistent

ACIDAtomic

Isolated

Durable

Series of database operations either all occur, or nothing occurs.

Transaction does not violate any integrity constraints during execution.

Cannot access data that is modified during an incomplete transaction.

Transactions that have committed will survive permanently.

PDX 2010

Page 30: Distributing Your Data

ACIDDefines a set of characteristics that aim to ensure consistency.

What happens when we realize that in order scale we need to distribute our data and handle asynchronous operations?

PDX 2010

Page 31: Distributing Your Data

ACIDWithout global state, no Atomicity.

Without a linear timeline, no transactions and no Isolation.

The canonical location of data might not exist, therefore no D.

PDX 2010

Page 32: Distributing Your Data

Without A, I, or D, Consistency in terms of entity integrity is no longer guaranteed.

PDX 2010

Page 33: Distributing Your Data

CAP TheoremEric Brewer @ 2000 Principles of Distributed Computing (PODC).

Seth Gilbert and Nancy Lynch published a formal proof in 2002.

PDX 2010

Page 34: Distributing Your Data

CAP AcronymConsistency: Multiple values for the same piece of data are not allowed.

Availability: If a non-failing node can be reached the system functions.

Partition-Tolerance: Regardless of packet loss, if a non-failing node is reached the system functions.

PDX 2010

Page 35: Distributing Your Data

CAP Theorem

Consistency, Availability, Partition-Tolerance: Choose One...

PDX 2010

Page 36: Distributing Your Data

Single node systems bound by CAP.

CAP Theorem

100% Partition-tolerant100% ConsistentNo Availability Guarantee

PDX 2010

Page 37: Distributing Your Data

Multi-node systems bound by CAP.

AP : Dynamo, no ACID

CP : Quorum, distributed databases

CA : DT, 2PC, ACID

CAP Theorem

PDX 2010

Page 38: Distributing Your Data

CAP doesn't say AP systems are the solution to your problem.

CAP Theorem

Most systems are a hybrid of CA, CP, & AP.

Not an absolute decision.

PDX 2010

Page 39: Distributing Your Data

Understand the trade-offs and use that understanding to build a system that fails predictably.

CAP Theorem

Enables you to build a system that degrades gracefully during a failure.

PDX 2010

Page 40: Distributing Your Data

BASEDan Pritchett

Associate for Computing Machinery Queue, 2008

BASE: An ACID Alternative

PDX 2010

Page 41: Distributing Your Data

Basically AvailableSoft StateEventually Consistent

BASE: An ACID Alternative

BASE

PDX 2010

Page 42: Distributing Your Data

Basically AvailableSoft StateEventually Consistent

BASE: An ACID Alternative

BASE

PDX 2010

Page 43: Distributing Your Data

Eventually ConsistentRename to Managed Consistency.

Does not mean probable or hopeful or indefinite time in the future.

Describes what happens during a failure.

PDX 2010

Page 44: Distributing Your Data

Eventually ConsistentDuring certain scenarios a decision must be made to either return inconsistent data or deny a request.

EC allows you control the level of consistency vs. availability in your application.

PDX 2010

Page 45: Distributing Your Data

Eventually ConsistentIn order to achieve availability in an asynchronous system, accept that failures are going to happen.

Understand failure points and know what you are willing to give up in order to achieve availability.

PDX 2010

Page 46: Distributing Your Data

How can we model the operations we perform on our data to be asynchronous & EC?

PDX 2010

Page 47: Distributing Your Data

Model system as a network of independent components.

Partition components along functional boundaries.

Don't interact with your data as one big global state.

PDX 2010

Page 48: Distributing Your Data

This doesn't meant every part of your system must operate this way!

Use ACID 2.0 to help identify and architect components than can.

PDX 2010

Page 49: Distributing Your Data

ACID 2.0

Commutative

Associative

Idempotent

Distributed

Order of operations does not change the result.

Operations can be aggregated in any order.

Operation can be applied multiple times without changing the result.

Operations are distributed and processed asynchronously.

PDX 2010

Page 50: Distributing Your Data

OPS BROSIncremental scalability

Homogeneous node responsibilities

Heterogeneous node capabilities

PDX 2010