haryadi gunawi uc berkeley 1. 2 research background fate and destini
TRANSCRIPT
1
Haryadi GunawiUC Berkeley
Towards ReliableCloud Storage
2
Outline Research background FATE and DESTINI
3
Yesterday’s storage
Local storage
StorageServers
4
Tomorrow’s storage
e.g. Google Laptop Cloud Storage
5
Tomorrow’s storage
Internet-service FS:GoogleFS, HadoopFS, CloudStore, ...
Custom Storage:Facebook Haystack Photo Store, Microsoft StarTrack (Map Apps),
Amazon S3, EBS, ...
Key-Value Store:Cassandra, Voldemort, ...
Structured Storage:Yahoo! PNUTS, Google BigTable, Hbase, ...
“This is not just data. It’s my life.And I would be sick if I lost it” [CNN ’10]
+ Replication+ Scale-up+ Migration+ ...
Reliable?
Highly-available?
Headlines
6
Cloudy with a
chance offailure
7
Outline Research background FATE and DESTINI
Motivation FATE DESTINI Evaluation
Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Koushik Sen, Dhruba Borthakur
8
Failure recovery ...
Cloud Thousands of commodity machines “Rare (HW) failures become frequent”
[Hamilton]
Failure recovery “… has to come from the software” [Dean] “… must be a first-class op”
[Ramakrishnan et al.]
... hard to get right
9
Google Chubby (lock service) Four occasions of data loss Whole-system down More problems after injecting multiple
failures (randomly)
Google BigTable (key-value store) Chubby bug affects BigTable availability
More details? How often? Other problems and
implications?
10
Open-source cloud projects Very valuable bug/issue repository! HDFS (1400+), ZooKeeper (900+),
Cassandra (1600+), Hbase (3000+), Hadoop (7000+)
HDFS JIRA study 1300 issues, 4 years (April 2006 to July 2010) Select recovery problems due to hardware
failures 91 recovery bugs/issues
... hard to get right
Implications Count
Data loss 13
Unavailability 48
Corruption 19
Misc. 10
11
Why? Testing is not advanced enough
[Google] Failure model: multiple, diverse failures
Recovery is under-specified [Hamilton] Lots of custom recovery Implementation is complex
Need two advancements: Exercise complex failure modes Write specifications and test the
implementation
12
Cloud software
Cloud testing
FATE
DESTINI
Failure Testing Service
Declarative Testing Specifications
X2
X1
Violatespecs?
13
Outline Research background FATE and DESTINI
Motivation FATE
- Architecture- Failure exploration
DESTINI Evaluation
14
M 1C 2 3
No failures
Setup Recovery: Recreate fresh pipeline (1, 2, 4)
Data Transfer Recovery: Continue on surviving nodes (1, 2)
M 1C 2 3M 1C 2 3 4
Alloc
Req
Setup
Stage
DataTransf
er
X1
X2
HadoopFS (HDFS)Write
Protocol
15
Failures and FATE Failures
Anytime: different stages different recovery Anywhere: N2 crash, and then N3 Any type: bad disks, partitioned nodes/racks
FATE Systematically exercise multiple, diverse failures How? need to “remember” failures – via failure IDs
M 1C 2 3 M 1C 2 3 4
16
Failure IDs Abstraction of I/O failures Building failure IDs
Intercept every I/O Inject possible failures
- Ex: crash, network partition, disk failure (LSE/corruption)
Node2 Node3
I/O information
OutputStream.read() inBlockReceiver.java
<stack trace>Net I/O from N3 to N2
“Data Ack”
Failure Crash After
Failure ID: 25
Note:FIDs
A, B, C, ...
X
17
FATE Architecture
Java SDK
Target system(e.g. Hadoop FS)
Workload Driver
while (new FIDs){ hdfs.write()}
AspectJFailure Surface
FailureServer(fail/no fail?)
I/O Info
Brute-force exploration
18
M 1C 2 3
A
A
B
A
B C
Exp #1: A
Exp #2: B
Exp #3: C
M 1C 2 3
A
B C
B
A
A
AB
AC
B CBC
1 failure / run 2 failures / run
19
Outline Research background FATE and DESTINI
Motivation FATE
- Architecture- Failure exploration challenge and solution
DESTINI Evaluation
20
Combinatorial explosion Exercised over 40,000 unique combinations of 1,
2, and 3 failures per run 80 hours of testing time!
New challenge: Combinatorial explosion Need smarter exploration strategies
1 2 3
A1 A2A1 B2B1 A2B1 B2...
2 failures / run
A1
B1
A2
B2
A3
B3
21
Pruning multiple failures Properties of multiple failures
Pairwise dependent failures Pairwise independent failures
Goal: exercise distinct recovery behaviors Key: some failures result in similar
recovery Result: > 10x faster, and found the
same bugs
22
Dependent failures Failure dependency graph
Inject single failures first Record subsequent dependent IDs
- Ex: X depends on A Brute-force: AX, BX, CX, DX, CY, DY
Recovery clustering Two clusters: {X} and {X, Y}
Only exercise distinct clusters Pick a failureID that triggers the
recovery cluster Results: AX, CX, CY
FID Subseq FIDsA XB XC X, YD X, Y
A B C D
X Y
23
Independent failures (1)
Independent combinations FP2 x N (N – 1) FP = 2, N = 3, Tot
= 24
Symmetric code Just pick two nodes N (N – 1) 2 FP2 x 2
1 2 3
A1
B1
A2
B2
A3
B3
1 2 3
A1
B1
A2
B2
A3
B3
24
Independent failures (2) FP2 bottleneck
FP = 4, total = 16 Real example, FP = 15
Recovery clustering Cluster A and B if: fail(A) == fail(B) Reduce FP2 to FP2
clustered
E.g.15 FPs to 8 FPsclustered
A1
B1
A2
B2
C1
D1
C2
D2
A1
B1
A2
B2
C1
D1
C2
D2
25
FATE Summary Exercise multiple, diverse failures
Via failure IDs abstraction Challenge: combinatorial explosion of multiple
failures
Smart failure exploration strategies > 10x improvement Built on top of failure IDs abstraction
Current limitations No I/O reordering Manual workload setup
26
Outline Research background FATE and DESTINI
Motivation FATE DESTINI
- Overview- Building specifications
Evaluation
27
DESTINI: declarative specs Is the system correct under
failures? Need to write specifications[It is] great to document (in a spec) the HDFS write protocol ...
…, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently.
Implemen-tation
Specs
X1
X2
28
Declarative specs How to write specifications?
Developer friendly (clear, concise, easy) Existing approaches:
- Unit test (ugly, bloated, not formal)- Others too verbose and long
Declarative relational logic language (Datalog) Key: easy to express logical relations
29
Specs in DESTINI How to write specs?
Violations Expectations Facts
How to write recovery specs? “... recovery is under specified” [Hamilton] Precise failure events Precise check timings
How to test implementation? Interpose I/O calls (lightweight) Deduce expectations and facts from I/O events
Implemen-
tation
Specs
30
Outline Research background FATE and DESTINI
Motivation FATE DESTINI
- Overview- Building specifications
Evaluation
31
Specification template
violationTable(…) :- expectationTable(…), NOT-IN actualTable(…)
“Throw a violation if an expectation is different
from the actual behavior”
Datalog syntax:
head() :- predicates(), …:- derivation, AND
32
Data transfer recovery
expectedNodes(Block, Node)
B Node 1
B Node 2
actualNodes(Block, Node)B Node 1
B Node 2
incorrectNodes(Block, Node)
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
M 1C 2 3
X
B B
DataTransf
er
“Replicas should exist in surviving nodes”
33
Recovery bug
expectedNodes(Block, Node)
B Node 1
B Node 2
actualNodes(Block, Node)B Node 1
incorrectNodes(Block, Node)
B Node 2
M 1C 2 3
X
B B
incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
34
Building expectations Ex: which nodes should
have the blocks? Deduce expectations
from I/O events (italic)
M C
getBlockPipe(…)Give me 3 nodes for B
[Node1, Node2, Node3]
M C
expectedNodes(Block, Node)
B Node 1
B Node 2
B Node 3
expectedNodes (B, N) :-
getBlockPipe (B, N);
1 2 3
X2
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
35
Updating expectations
DEL expectedNodes (B, N) :-
expectedNodes (B, N), fateCrashNode (N)
expectedNodes(Block, Node)
B Node 1
B Node 2
B Node 3
M 1C 2 3
X
B B
#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
#2: expectedNodes(B, N) :- getBlockPipe(B,N)
36
Precise failure eventsDEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode (N), writeStage (B, Stage), Stage == “Data
Transfer”;
Different stages different recovery
behaviors
#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)
#2: expectedNodes(B,N) :- getBlockPipe(B,N)#3: expectedNodes(B,N) :- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage ==
“DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT<N>) :- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe (B, N); #7: acksCnt (B, CNT<A>) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck (B, P, A);
M 1C 2 3
M 1C 2 3 4
Datatransferrecover
y
vs.
Setupstage
recovery
Precise failure events
37
Facts Ex: which nodes
actually store the valid blocks? Deduced from disk
I/O events in 8 datalog rules
actualNodes(Block, Node)
B Node 1
M 1C 2 3
X
B
#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)
38
Violation and check-timing
Recovery ≠ invariant If recovery is ongoing, invariants are violated Don’t want false alarms
Need precise check timings Ex: upon block completion
#1:incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B,
N), completeBlock (B);
39
Data-transfer recovery spec
r1 incorrectNodes (B, N) :- cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);
r2 pipeNodes (B, Pos, N) :- getBlkPipe (UFile, B, Gs, Pos, N);
r3 expectedNodes (B, N) :- getBlkPipe (UFile, B, Gs, Pos, N);
r4 DEL expectedNodes (B, N) :- fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);
r5 setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack);
r6 goodAcksCnt (B, COUNT<Ack>)
:- setupAcks (B, Pos, Ack), Ack == ’OK’;
r7 nodesCnt (B, COUNT<Node>)
:- pipeNodes (B, , N, );
r8 pipeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;
r9 blkGenStamp (B, Gs) :- dnpNextGenStamp (B, Gs);
r10 blkGenStamp (B, Gs) :- cnpGetBlkPipe (UFile, B, Gs, , );
r11 diskFiles (N, File) :- fsCreate (N, File);
r12 diskFiles (N, Dst) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);
r13 DEL diskFiles (N, Src) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);
r14 fileTypes (N, File, Type) :- diskFiles(N, File), Type := Util.getType(File);
r15 blkMetas (N, B, Gs) :- fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);
r16 actualNodes (B, N) :- blkMetas (N, B, Gs), blkGenStamp (B, Gs);
Failureevent(crash
)
Expectation
Actual FactsI/O
Events
40
More detailed specs
The spec: “something is wrong” Why? Where is the bug?
Let’s write more detailed specs
M 1C 2 3
X
B B
M 1C 2 3
X
B
41
Adding specs First analysis
Client’s pipeline excludes Node2, why? Maybe, client gets a bad ack for Node2errBadAck (N) :- dataAck (N, “Error”), liveNodes (N)
Second analysis Client gets a bad ack for Node2, why? Maybe, Node1 could not communicate to Node2errBadConnect (N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN)
We catch the bug! Node2 cannot talk to Node3 (crash) Node2 terminates all connections (including Node1!) Node1 thinks Node2 is dead
X
B
M 1C 2 3
42
Catch bugs with more specs
More detailed specs
Catch bugs closer to the source and earlier in time
Global view
Client’s view
Datanode’s view
43
More ... Design patterns
Add detailed specs Refine existing specs Write specs from different views (global,
client, dn) Incorporate diverse failures (crashes, net
partitions) Express different violations (data-loss,
unavailability)
44
Outline Research background FATE and DESTINI
Motivation FATE DESTINI Evaluation
45
Evaluation Implementation complexity
~6000 LOC in Java
Target 3 popular cloud systems HadoopFS (primary target)
- Underlying storage for Hadoop/MapReduce ZooKeeper
- Distributed synchronization service Cassandra
- Distributed key-value store
Recovery bugs Found 22 new HDFS bugs (confirmed)
- Data loss, unavailability bugs Reproduced 51 old bugs
46
Availability bug“If multiple racks are available (reachable), a block should be stored in a minimum of two racks”
Rack #1 Rack #2
B
B
B
Client
B
Replication Monitor
errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_);
“Throw a violation if a block is only stored in one rack, but the rack is connected to another rack”
Availability bug!#replicas = 3,locations are not checked,B is not migrated to R2
rackCnt
B, 1
blkRacksB, R1
connectedR1, R2
errorSingleRackB
FATE injectsrack partitioning
47
Pruning Efficiency Reduce #experiments by an order of magnitude
Each experiment = 4-9 seconds
Found the same number of bugs (by experience)
# Exps
7720
618
5000
Write +2
crashes
Append +2
crashes
Bru
te
Forc
ePru
ned
Write +3
crashes
Append +3
crashes
48
Specification simplicity
Compared to other related work
Framework #Chks Lines/Chk
D3S [NSDI ’08] 10 53
Pip [NSDI ’06] 44 43
WiDS [NSDI ’07] 15 22
P2 Monitor [EuroSys ’06]
11 12
DESTINI 74 5
49
Summary Cloud systems
All good, but must manage failures Performance, reliability, availability depend
on failure recovery
FATE and DESTINI Explore multiple, diverse failures
systematically Facilitate declarative recovery specifications A unified framework
Real-world adoption in progress
50
Research background FATE and DESTINI Thanks! Questions?
END