dstore: recovery-friendly, self-managing clustered hash table andy huang and armando fox stanford...
TRANSCRIPT
DStore: Recovery-friendly, self-managing DStore: Recovery-friendly, self-managing clustered hash tableclustered hash table
Andy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University
© 2003 Andy Huang
OutlineOutline
Proposal Proposal Why? – The goalWhy? – The goal
What? – The class of state we focus onWhat? – The class of state we focus on
How? – Technique for achieving the goalHow? – Technique for achieving the goal
Quorum algorithm and recovery resultsQuorum algorithm and recovery results
Repartitioning algorithm and availability resultsRepartitioning algorithm and availability results
ConclusionConclusion
© 2003 Andy Huang
Why? What? and How?Why? What? and How?
© 2003 Andy Huang
Why? Simplify state-managementWhy? Simplify state-management
SIMPLESIMPLE COMPLEXCOMPLEX
ConfiguratioConfigurationn
““plug-and-play”plug-and-play” repartitionrepartition
RecoveryRecovery simple & non-simple & non-intrusiveintrusive
unavailability unavailability ~minutes~minutes
FrontendsApp Servers DB/FS
LAN
LAN
© 2003 Andy Huang
User preferences: User preferences: Explicit: name, address, etc.Explicit: name, address, etc.
Implicit: usage statistics (Amazon’s “items viewed”)Implicit: usage statistics (Amazon’s “items viewed”)
Collaborative workflow data:Collaborative workflow data: Examples: insurance claims, human resources filesExamples: insurance claims, human resources files
What? Non-transactional dataWhat? Non-transactional data
read-read-mostlymostly
(catalogs)(catalogs)
non-transactional r/wnon-transactional r/w(user prefs, workflow data)(user prefs, workflow data)
transactionaltransactional(billing)(billing)
© 2003 Andy Huang
Hypothesis: A state store designed for non-Hypothesis: A state store designed for non-transactional data can be decoupled so that it transactional data can be decoupled so that it can be managed like a stateless systemcan be managed like a stateless system
Technique 1: Expose a hash table APITechnique 1: Expose a hash table API Repartitioning scheme is simple (no complex data Repartitioning scheme is simple (no complex data
dependencies)dependencies)
Technique 2: Use quorums (read/write Technique 2: Use quorums (read/write ≥ ≥ majority)majority) Recovery is simple (no special case recovery Recovery is simple (no special case recovery
mechanism)mechanism)
Recovery is non-intrusive (data available throughout)Recovery is non-intrusive (data available throughout)
How? Decouple using hash table and How? Decouple using hash table and quorumsquorums
© 2003 Andy Huang
Brick: stores dataBrick: stores data
Dlib: exposes hash Dlib: exposes hash table API to app table API to app server and executes server and executes quorum-based quorum-based reads/writes on bricks reads/writes on bricks
Replica groups: Replica groups: bricks storing the bricks storing the same portion of the same portion of the key space are in the key space are in the same same replica groupreplica group
Architecture overviewArchitecture overview
Dli
b
App Servers
LAND
lib
Bricks
© 2003 Andy Huang
Quorum algorithmQuorum algorithm
© 2003 Andy Huang
Algorithm: Wavering readsAlgorithm: Wavering reads
No two-phase commit No two-phase commit (complicates recovery (complicates recovery and introduces coupling)and introduces coupling)
CC11 attempts to write, but attempts to write, but
fails before completionfails before completion
Quorum property Quorum property violated: reading a violated: reading a majority doesn’t majority doesn’t guarantee latest value is guarantee latest value is returnedreturned
Result: wavering readsResult: wavering readsR1 R2 R3
x = 0
C1 C2
0
read
1
read
0
read
1
write(1)
© 2003 Andy Huang
Algorithm: Read writebackAlgorithm: Read writeback
Idea: commit partial Idea: commit partial write when it is first readwrite when it is first read
Commit pointCommit point Before x=0Before x=0
After x=1After x=1
Proven linearizable under Proven linearizable under fail-stop modelfail-stop model
C1 R1 R2 R3
x = 0
C2
0
read
read
11
1
write(1)
© 2003 Andy Huang
Algorithm: Crash recoveryAlgorithm: Crash recovery
Fail-stop not an accurate Fail-stop not an accurate model: implies client that model: implies client that generated the request generated the request fails permanentlyfails permanently
With writeback, commit With writeback, commit point occurs sometime in point occurs sometime in the futurethe future
A writer expects request A writer expects request to succeed or fail, not be to succeed or fail, not be “in-progress”“in-progress”
read0
read
R1 R2 R3
x = 0
C1 C2
1
write(1)
write
11
© 2003 Andy Huang
Algorithm: Write in-progressAlgorithm: Write in-progress
Requirement: write must Requirement: write must be committed/aborted on be committed/aborted on the next readthe next read
Record “write in-Record “write in-progress” on clientprogress” on client On submit: write “start” On submit: write “start”
cookiecookie
On return: write “end” On return: write “end” cookiecookie
On read: if “start” cookie On read: if “start” cookie has no matching “end,” has no matching “end,” read allread all
R1 R2 R3
x = 0
C1 C2
read
1
write
11
1
write(1)
© 2003 Andy Huang
Algorithm: The common caseAlgorithm: The common case
Write all, wait for a majorityWrite all, wait for a majority Normally, all replicas perform the writeNormally, all replicas perform the write
Read majorityRead majority Normally, replicas return non-conflicting valuesNormally, replicas return non-conflicting values
Writeback performed when a brick fails or when it Writeback performed when a brick fails or when it is temporarily overloaded and missed some writesis temporarily overloaded and missed some writes
Read all performed when an app server failsRead all performed when an app server fails
© 2003 Andy Huang
Recovery resultsRecovery results
© 2003 Andy Huang
Results: Simple, non-intrusive recoveryResults: Simple, non-intrusive recovery
Normal operationNormal operation: : majority must complete majority must complete writewrite
FailureFailure: if fewer than a : if fewer than a majority fail, writes can majority fail, writes can succeedsucceed
RecoveryRecovery: equivalent to : equivalent to missing a few writes missing a few writes under normal operationunder normal operation SimpleSimple: no special code: no special code
Non-intrusiveNon-intrusive: availability : availability throughoutthroughout
© 2003 Andy Huang
Benchmark: Simple, non-intrusive recoveryBenchmark: Simple, non-intrusive recovery
read
0 K
5 K
10 K
15 K
Th
rou
gh
pu
t (r
eq/s
ec)
write
0 K
1 K
2 K
3 K
0 30 60 90 120 150 180
Time (sec)
Thro
ughp
ut (r
eq/s
ec)
Benchmark:Benchmark: t=60 sect=60 sec
one brick killedone brick killed
t=120 sect=120 secbrick restartedbrick restarted
Summary:Summary: Data available Data available
during failure and during failure and recoveryrecovery
Recovering brick Recovering brick restores restores throughput in throughput in secondsseconds
© 2003 Andy Huang
Repartitioning algorithmRepartitioning algorithm&&
Availability resultsAvailability results
© 2003 Andy Huang
Algorithm: Online repartitioningAlgorithm: Online repartitioning
Split replica group ID Split replica group ID (rgid), but announce (rgid), but announce bothboth
Take a brick offline Take a brick offline (looks just like a failure)(looks just like a failure)
Copy data to new brickCopy data to new brick
Change rgid and bring Change rgid and bring both bricks onlineboth bricks online
00 10
00 10
00 10
00 10
00 10
00 10
00 10
00 10
00 10
10
00 10
00 10
00 10
© 2003 Andy Huang
Benchmark: Online repartitioningBenchmark: Online repartitioning
Benchmark:Benchmark: t=120 sect=120 sec
group 0 repartitionedgroup 0 repartitioned
t=240 sect=240 secgroup 1 repartitionedgroup 1 repartitioned
Non-intrusive:Non-intrusive: Data available during Data available during
entire processentire process
Appears as if brick just Appears as if brick just failed and recovered failed and recovered (but there are now (but there are now more bricks)more bricks)
0 K
10 K
20 K
30 K
40 K
0 60 120 180 240 300 360
Time (sec)
Th
rou
gh
pu
t (r
eq/s
ec)
© 2003 Andy Huang
ConclusionConclusion
Goal: Simplify management for non-transactional Goal: Simplify management for non-transactional datadata
Techniques: Expose hash table API and use quorumsTechniques: Expose hash table API and use quorums
Results:Results: Recovery is simple and non-intrusiveRecovery is simple and non-intrusive
Repartitioning can be done fully onlineRepartitioning can be done fully online
Next steps:Next steps: True “plug-and-play” – automatically repartition when bricks True “plug-and-play” – automatically repartition when bricks
are added/removed (simplified by hash table partitioning are added/removed (simplified by hash table partitioning scheme)scheme)
Questions: [email protected]: [email protected]