data distribution theory

14

Click here to load reader

Upload: william-laforest

Post on 19-Jun-2015

424 views

Category:

Technology


0 download

DESCRIPTION

This covers some key concepts and techniques when one needs to distribute data across many nodes cutting across products ranging from caches to databases. CAVEAT: If you haven't seen me present this in person slide 7 and 12 wont make much sense. Will be uploading a video version before long

TRANSCRIPT

Page 1: Data Distribution Theory

1

Data Distribution Theory

Will LaForestSenior Director of 10gen [email protected]@WLaForest

Open source, high performance database

Page 2: Data Distribution Theory

2

• I will be talking about distributing data across machines connected by reliable network

• I will NOT be talking about on disk arrangements (well maybe a little)

• I will NOT be talking replication – This has some overlaps but in most respects can be

considered orthogonally

• There is a ton of implementation minutia from technology to technology that I will try to avoid

What am I talking about?

Page 3: Data Distribution Theory

3

• Need to scale for more • Cost effective to scale horizontally by distributing• Fundamentally limited by some resource

– Memory– IO Capacity– Disk– CPU

• Lots of systems need to distribute– Web servers/app servers– File systems– Databases– Caches

Why Distribute?

Page 4: Data Distribution Theory

4

• Its always been this way– From time to time people forget

• Stateful MTS Objects• Stateful EJB

• Concurrent access of data not as simple– We will set aside fencing/locking for another day

• RDBMS not built for distributed computing– Not surprising since theory was from 40 years ago– Model works because joins fast– BUT generically efficient distributed joins difficult– Ditto for distributed transaction

Stateless stuff is easy to distribute

Page 5: Data Distribution Theory

5

• Given a data record what node do I store it on?• Round robin/ ”random”

– Evenly distribute across set severs– Doesn’t take into account rebalancing– Expiring a lot of data? Not too bad (MVCC, expiring cache)– MarkLogic

• Hash based (not talking the pipe)– Many search engines & caches– Amazon Dynamo (Cassandra*)

• Range based– BigTable (HBase)– MongoDB

Ways to distribute

Page 6: Data Distribution Theory

6

• Distribute on the hash of some attribute• Simple way is hash(att) mod N

– What happens when N changes (we add a node)?• The industry standard is consistent hashing• Pros

– Evenly distributes across nodes – Avoid hot spots– Great for high write throughput

• Cons– No data locality– Scattered reads on each node– Scatter gather on all queries

Hash Based

Page 7: Data Distribution Theory

7

Hash Ring

1

2

3

A

BC

D

E4

|

2128

0

• Circles represent

nodes.

hash(hostname)• Letters represent

data points

hash(attribute)

Whats wrong with this specific ring?

Page 8: Data Distribution Theory

8

Getting a good distribution

• Use a hash algorithm with even

distribution (like MD5 or SHA-1)

• Use multiple points or replicas on

the hash ring

• Instead of just hash(“Host1”)

• hash(“Host1-1”) .. hash(“Host1-r”)

• Running simulations you get a

plot that looks like this (see Tom

White reference)

• Based on 10 nodes and 10k

points1 5 20 100 500

5

10

20

50

100

sta

ndar

d de

viat

ion

s

replicas

Page 9: Data Distribution Theory

9

Hash Ring

2 - 3

1 - 5

3 - 1

1 - R

2 - 2

3 - 4

1 - 1

2 - 4

3 - R

1 - 3

2 - 1

1 - 4

3 - 2

3 - 3

1 - 2

2 - R• We have R

replicas for each node

• The hash ring could also be used to determine replicas by using the same strategy with data

Page 10: Data Distribution Theory

10

• Also known as sharding• Distribute based upon an attribute (the key)

– Or multiple keys (compound)

• Pros– Better for reads– Data locality so… – Querying/reads with shard attribute terms avoid scatter – Data can be arranged in contiguous blocks – If hash based indexing only allow for range queries on key

Range based

Page 11: Data Distribution Theory

11

• Cons– Requires more consideration a-priori – Pick the right shard key– Can develop hot spots– Leads to more data balancing activities

• Chunking can be done on many levels– BigTable breaks into tablets– MongoDB uses “chunks”

Range based, continued

Page 12: Data Distribution Theory

12

Ranged Based Partitioning

r- ∞ ∞MeyerAbrams LaForest ScheichiIsaacj

• Pick a key(s) to partition on• Map the key space to the nodes• Range to node mappings

adjusted to keep data as

distributed as possible

• In this example we are partitioning by Last Name

• What happens if we partition by hash(attribute)?

Page 13: Data Distribution Theory

13

• Use a key with a high cardinality– Sufficient granularity to your “chunks”

• What are your write vs read requirements• Read and query king?

1. Shard key should be something most of your queries use2. Also something that distributes reads evenly (avoiding

read hotspots)3. Reading scaling can sometimes be accommodated by

replication• Write throughput biggest concern?

1. You may want to consider partitioning on hash2. Avoid hot spots3. What happens if shard on systematically increasing key?

Sharding Considerations

Page 14: Data Distribution Theory

14

• Consistent Hashing and Random Trees– One of the original papers on consistent hashing

• Tom White: Consistent Hashing– Great blog post on consistent hashing

Citations