dstore: recovery-friendly, self-managing clustered hash table andy huang and armando fox stanford...

DStore: Recovery-friendly, self-managing DStore: Recovery-friendly, self-managing clustered hash tableclustered hash table

Andy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University

© 2003 Andy Huang

OutlineOutline

Proposal Proposal Why? – The goalWhy? – The goal

What? – The class of state we focus onWhat? – The class of state we focus on

How? – Technique for achieving the goalHow? – Technique for achieving the goal

Quorum algorithm and recovery resultsQuorum algorithm and recovery results

Repartitioning algorithm and availability resultsRepartitioning algorithm and availability results

ConclusionConclusion

© 2003 Andy Huang

Why? What? and How?Why? What? and How?

© 2003 Andy Huang

Why? Simplify state-managementWhy? Simplify state-management

SIMPLESIMPLE COMPLEXCOMPLEX

ConfiguratioConfigurationn

““plug-and-play”plug-and-play” repartitionrepartition

RecoveryRecovery simple & non-simple & non-intrusiveintrusive

unavailability unavailability ~minutes~minutes

FrontendsApp Servers DB/FS

LAN

LAN

© 2003 Andy Huang

User preferences: User preferences: Explicit: name, address, etc.Explicit: name, address, etc.

Implicit: usage statistics (Amazon’s “items viewed”)Implicit: usage statistics (Amazon’s “items viewed”)

Collaborative workflow data:Collaborative workflow data: Examples: insurance claims, human resources filesExamples: insurance claims, human resources files

What? Non-transactional dataWhat? Non-transactional data

read-read-mostlymostly

(catalogs)(catalogs)

non-transactional r/wnon-transactional r/w(user prefs, workflow data)(user prefs, workflow data)

transactionaltransactional(billing)(billing)

© 2003 Andy Huang

Hypothesis: A state store designed for non-Hypothesis: A state store designed for non-transactional data can be decoupled so that it transactional data can be decoupled so that it can be managed like a stateless systemcan be managed like a stateless system

Technique 1: Expose a hash table APITechnique 1: Expose a hash table API Repartitioning scheme is simple (no complex data Repartitioning scheme is simple (no complex data

dependencies)dependencies)

Technique 2: Use quorums (read/write Technique 2: Use quorums (read/write ≥ ≥ majority)majority) Recovery is simple (no special case recovery Recovery is simple (no special case recovery

mechanism)mechanism)

Recovery is non-intrusive (data available throughout)Recovery is non-intrusive (data available throughout)

How? Decouple using hash table and How? Decouple using hash table and quorumsquorums

© 2003 Andy Huang

Brick: stores dataBrick: stores data

Dlib: exposes hash Dlib: exposes hash table API to app table API to app server and executes server and executes quorum-based quorum-based reads/writes on bricks reads/writes on bricks

Replica groups: Replica groups: bricks storing the bricks storing the same portion of the same portion of the key space are in the key space are in the same same replica groupreplica group

Architecture overviewArchitecture overview

Dli

b

App Servers

LAND

lib

Bricks

© 2003 Andy Huang

Quorum algorithmQuorum algorithm

© 2003 Andy Huang

Algorithm: Wavering readsAlgorithm: Wavering reads

No two-phase commit No two-phase commit (complicates recovery (complicates recovery and introduces coupling)and introduces coupling)

CC11 attempts to write, but attempts to write, but

fails before completionfails before completion

Quorum property Quorum property violated: reading a violated: reading a majority doesn’t majority doesn’t guarantee latest value is guarantee latest value is returnedreturned

Result: wavering readsResult: wavering readsR1 R2 R3

x = 0

C1 C2

0

read

1

read

0

read

1

write(1)

© 2003 Andy Huang

Algorithm: Read writebackAlgorithm: Read writeback

Idea: commit partial Idea: commit partial write when it is first readwrite when it is first read

Commit pointCommit point Before x=0Before x=0

After x=1After x=1

Proven linearizable under Proven linearizable under fail-stop modelfail-stop model

C1 R1 R2 R3

x = 0

C2

0

read

read

11

1

write(1)

© 2003 Andy Huang

Algorithm: Crash recoveryAlgorithm: Crash recovery

Fail-stop not an accurate Fail-stop not an accurate model: implies client that model: implies client that generated the request generated the request fails permanentlyfails permanently

With writeback, commit With writeback, commit point occurs sometime in point occurs sometime in the futurethe future

A writer expects request A writer expects request to succeed or fail, not be to succeed or fail, not be “in-progress”“in-progress”

read0

read

R1 R2 R3

x = 0

C1 C2

1

write(1)

write

11

© 2003 Andy Huang

Algorithm: Write in-progressAlgorithm: Write in-progress

Requirement: write must Requirement: write must be committed/aborted on be committed/aborted on the next readthe next read

Record “write in-Record “write in-progress” on clientprogress” on client On submit: write “start” On submit: write “start”

cookiecookie

On return: write “end” On return: write “end” cookiecookie

On read: if “start” cookie On read: if “start” cookie has no matching “end,” has no matching “end,” read allread all

R1 R2 R3

x = 0

C1 C2

read

1

write

11

1

write(1)

© 2003 Andy Huang

Algorithm: The common caseAlgorithm: The common case

Write all, wait for a majorityWrite all, wait for a majority Normally, all replicas perform the writeNormally, all replicas perform the write

Read majorityRead majority Normally, replicas return non-conflicting valuesNormally, replicas return non-conflicting values

Writeback performed when a brick fails or when it Writeback performed when a brick fails or when it is temporarily overloaded and missed some writesis temporarily overloaded and missed some writes

Read all performed when an app server failsRead all performed when an app server fails

© 2003 Andy Huang

Results: Simple, non-intrusive recoveryResults: Simple, non-intrusive recovery

Normal operationNormal operation: : majority must complete majority must complete writewrite

FailureFailure: if fewer than a : if fewer than a majority fail, writes can majority fail, writes can succeedsucceed

RecoveryRecovery: equivalent to : equivalent to missing a few writes missing a few writes under normal operationunder normal operation SimpleSimple: no special code: no special code

Non-intrusiveNon-intrusive: availability : availability throughoutthroughout

© 2003 Andy Huang

Benchmark: Simple, non-intrusive recoveryBenchmark: Simple, non-intrusive recovery

read

0 K

5 K

10 K

15 K

Th

rou

gh

pu

t (r

eq/s

ec)

write

0 K

1 K

2 K

3 K

0 30 60 90 120 150 180

Time (sec)

Thro

ughp

ut (r

eq/s

ec)

Benchmark:Benchmark: t=60 sect=60 sec

one brick killedone brick killed

t=120 sect=120 secbrick restartedbrick restarted

Summary:Summary: Data available Data available

during failure and during failure and recoveryrecovery

Recovering brick Recovering brick restores restores throughput in throughput in secondsseconds

© 2003 Andy Huang

Algorithm: Online repartitioningAlgorithm: Online repartitioning

Split replica group ID Split replica group ID (rgid), but announce (rgid), but announce bothboth

Take a brick offline Take a brick offline (looks just like a failure)(looks just like a failure)

Copy data to new brickCopy data to new brick

Change rgid and bring Change rgid and bring both bricks onlineboth bricks online

00 10

00 10

00 10

00 10

00 10

00 10

00 10

00 10

00 10

10

00 10

00 10

00 10

© 2003 Andy Huang

Benchmark: Online repartitioningBenchmark: Online repartitioning

Benchmark:Benchmark: t=120 sect=120 sec

group 0 repartitionedgroup 0 repartitioned

t=240 sect=240 secgroup 1 repartitionedgroup 1 repartitioned

Non-intrusive:Non-intrusive: Data available during Data available during

entire processentire process

Appears as if brick just Appears as if brick just failed and recovered failed and recovered (but there are now (but there are now more bricks)more bricks)

0 K

10 K

20 K

30 K

40 K

0 60 120 180 240 300 360

Time (sec)

Th

rou

gh

pu

t (r

eq/s

ec)

© 2003 Andy Huang

ConclusionConclusion

Goal: Simplify management for non-transactional Goal: Simplify management for non-transactional datadata

Techniques: Expose hash table API and use quorumsTechniques: Expose hash table API and use quorums

Results:Results: Recovery is simple and non-intrusiveRecovery is simple and non-intrusive

Repartitioning can be done fully onlineRepartitioning can be done fully online

Next steps:Next steps: True “plug-and-play” – automatically repartition when bricks True “plug-and-play” – automatically repartition when bricks

are added/removed (simplified by hash table partitioning are added/removed (simplified by hash table partitioning scheme)scheme)

Questions: [email protected]: [email protected]

dstore: recovery-friendly, self-managing clustered hash table andy huang and armando fox stanford...

Documents