haryadi gunawi uc berkeley 1. 2 research background fate and destini

1

Haryadi GunawiUC Berkeley

Towards ReliableCloud Storage

2

Outline Research background FATE and DESTINI

3

Yesterday’s storage

Local storage

StorageServers

4

Tomorrow’s storage

e.g. Google Laptop Cloud Storage

5

Tomorrow’s storage

Internet-service FS:GoogleFS, HadoopFS, CloudStore, ...

Custom Storage:Facebook Haystack Photo Store, Microsoft StarTrack (Map Apps),

Amazon S3, EBS, ...

Key-Value Store:Cassandra, Voldemort, ...

Structured Storage:Yahoo! PNUTS, Google BigTable, Hbase, ...

“This is not just data. It’s my life.And I would be sick if I lost it” [CNN ’10]

+ Replication+ Scale-up+ Migration+ ...

Reliable?

Highly-available?

Headlines

6

Cloudy with a

chance offailure

7


Motivation FATE DESTINI Evaluation

Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Koushik Sen, Dhruba Borthakur

8

Failure recovery ...

Cloud Thousands of commodity machines “Rare (HW) failures become frequent”

[Hamilton]

Failure recovery “… has to come from the software” [Dean] “… must be a first-class op”

[Ramakrishnan et al.]

... hard to get right

9

Google Chubby (lock service) Four occasions of data loss Whole-system down More problems after injecting multiple

failures (randomly)

Google BigTable (key-value store) Chubby bug affects BigTable availability

More details? How often? Other problems and

implications?

10

Open-source cloud projects Very valuable bug/issue repository! HDFS (1400+), ZooKeeper (900+),

Cassandra (1600+), Hbase (3000+), Hadoop (7000+)

HDFS JIRA study 1300 issues, 4 years (April 2006 to July 2010) Select recovery problems due to hardware

failures 91 recovery bugs/issues

... hard to get right

Implications Count

Data loss 13

Unavailability 48

Corruption 19

Misc. 10

11

Why? Testing is not advanced enough

[Google] Failure model: multiple, diverse failures

Recovery is under-specified [Hamilton] Lots of custom recovery Implementation is complex

Need two advancements: Exercise complex failure modes Write specifications and test the

implementation

12

Cloud software

Cloud testing

FATE

DESTINI

Failure Testing Service

Declarative Testing Specifications

X2

X1

Violatespecs?

13


Motivation FATE

- Architecture- Failure exploration

DESTINI Evaluation

14

M 1C 2 3

No failures

Setup Recovery: Recreate fresh pipeline (1, 2, 4)

Data Transfer Recovery: Continue on surviving nodes (1, 2)

M 1C 2 3M 1C 2 3 4

Alloc

Req

Setup

Stage

DataTransf

er

X1

X2

HadoopFS (HDFS)Write

Protocol

15

Failures and FATE Failures

Anytime: different stages different recovery Anywhere: N2 crash, and then N3 Any type: bad disks, partitioned nodes/racks

FATE Systematically exercise multiple, diverse failures How? need to “remember” failures – via failure IDs

M 1C 2 3 M 1C 2 3 4

16

Failure IDs Abstraction of I/O failures Building failure IDs

Intercept every I/O Inject possible failures

- Ex: crash, network partition, disk failure (LSE/corruption)

Node2 Node3

I/O information

OutputStream.read() inBlockReceiver.java

<stack trace>Net I/O from N3 to N2

“Data Ack”

Failure Crash After

Failure ID: 25

Note:FIDs

A, B, C, ...

X

17

FATE Architecture

Java SDK

Target system(e.g. Hadoop FS)

Workload Driver

while (new FIDs){ hdfs.write()}

AspectJFailure Surface

FailureServer(fail/no fail?)

I/O Info

Brute-force exploration

18

M 1C 2 3

A

A

B

A

B C

Exp #1: A

Exp #2: B

Exp #3: C

M 1C 2 3

A

B C

B

A

A

AB

AC

B CBC

1 failure / run 2 failures / run

19


Motivation FATE

- Architecture- Failure exploration challenge and solution

DESTINI Evaluation

20

Combinatorial explosion Exercised over 40,000 unique combinations of 1,

2, and 3 failures per run 80 hours of testing time!

New challenge: Combinatorial explosion Need smarter exploration strategies

1 2 3

A1 A2A1 B2B1 A2B1 B2...

2 failures / run

A1

B1

A2

B2

A3

B3

21

Pruning multiple failures Properties of multiple failures

Pairwise dependent failures Pairwise independent failures

Goal: exercise distinct recovery behaviors Key: some failures result in similar

recovery Result: > 10x faster, and found the

same bugs

22

Dependent failures Failure dependency graph

Inject single failures first Record subsequent dependent IDs

- Ex: X depends on A Brute-force: AX, BX, CX, DX, CY, DY

Recovery clustering Two clusters: {X} and {X, Y}

Only exercise distinct clusters Pick a failureID that triggers the

recovery cluster Results: AX, CX, CY

FID Subseq FIDsA XB XC X, YD X, Y

A B C D

X Y

23

Independent failures (1)

Independent combinations FP2 x N (N – 1) FP = 2, N = 3, Tot

= 24

Symmetric code Just pick two nodes N (N – 1) 2 FP2 x 2

1 2 3

A1

B1

A2

B2

A3

B3

1 2 3

A1

B1

A2

B2

A3

B3

24

Independent failures (2) FP2 bottleneck

FP = 4, total = 16 Real example, FP = 15

Recovery clustering Cluster A and B if: fail(A) == fail(B) Reduce FP2 to FP2

clustered

E.g.15 FPs to 8 FPsclustered

A1

B1

A2

B2

C1

D1

C2

D2

A1

B1

A2

B2

C1

D1

C2

D2

25

FATE Summary Exercise multiple, diverse failures

Via failure IDs abstraction Challenge: combinatorial explosion of multiple

failures

Smart failure exploration strategies > 10x improvement Built on top of failure IDs abstraction

Current limitations No I/O reordering Manual workload setup

26


Motivation FATE DESTINI

- Overview- Building specifications

Evaluation

27

DESTINI: declarative specs Is the system correct under

failures? Need to write specifications[It is] great to document (in a spec) the HDFS write protocol ...

…, but we shouldn't spend too much time on it, … a formal spec may be overkill for a protocol we plan to deprecate imminently.

Implemen-tation

Specs

X1

X2

28

Declarative specs How to write specifications?

Developer friendly (clear, concise, easy) Existing approaches:

- Unit test (ugly, bloated, not formal)- Others too verbose and long

Declarative relational logic language (Datalog) Key: easy to express logical relations

29

Specs in DESTINI How to write specs?

Violations Expectations Facts

How to write recovery specs? “... recovery is under specified” [Hamilton] Precise failure events Precise check timings

How to test implementation? Interpose I/O calls (lightweight) Deduce expectations and facts from I/O events

Implemen-

tation

Specs

30


Motivation FATE DESTINI

- Overview- Building specifications

Evaluation

31

Specification template

violationTable(…) :- expectationTable(…), NOT-IN actualTable(…)

“Throw a violation if an expectation is different

from the actual behavior”

Datalog syntax:

head() :- predicates(), …:- derivation, AND

32

Data transfer recovery

expectedNodes(Block, Node)

B Node 1

B Node 2

actualNodes(Block, Node)B Node 1

B Node 2

incorrectNodes(Block, Node)

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

M 1C 2 3

X

B B

DataTransf

er

“Replicas should exist in surviving nodes”

33

Recovery bug


B Node 1

B Node 2

actualNodes(Block, Node)B Node 1

incorrectNodes(Block, Node)

B Node 2

M 1C 2 3

X

B B

incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

34

Building expectations Ex: which nodes should

have the blocks? Deduce expectations

from I/O events (italic)

M C

getBlockPipe(…)Give me 3 nodes for B

[Node1, Node2, Node3]

M C


B Node 1

B Node 2

B Node 3

expectedNodes (B, N) :-

getBlockPipe (B, N);

1 2 3

X2

#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

35

Updating expectations

DEL expectedNodes (B, N) :-

expectedNodes (B, N), fateCrashNode (N)


B Node 1

B Node 2

B Node 3

M 1C 2 3

X

B B

#1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

#2: expectedNodes(B, N) :- getBlockPipe(B,N)

36

Precise failure eventsDEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode (N), writeStage (B, Stage), Stage == “Data

Transfer”;

Different stages different recovery

behaviors

#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

#2: expectedNodes(B,N) :- getBlockPipe(B,N)#3: expectedNodes(B,N) :- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage ==

“DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT<N>) :- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe (B, N); #7: acksCnt (B, CNT<A>) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck (B, P, A);

M 1C 2 3

M 1C 2 3 4

Datatransferrecover

y

vs.

Setupstage

recovery

Precise failure events

37

Facts Ex: which nodes

actually store the valid blocks? Deduced from disk

I/O events in 8 datalog rules

actualNodes(Block, Node)

B Node 1

M 1C 2 3

X

B

#1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

38

Violation and check-timing

Recovery ≠ invariant If recovery is ongoing, invariants are violated Don’t want false alarms

Need precise check timings Ex: upon block completion

#1:incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B,

N), completeBlock (B);

39

Data-transfer recovery spec

r1 incorrectNodes (B, N) :- cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N);

r2 pipeNodes (B, Pos, N) :- getBlkPipe (UFile, B, Gs, Pos, N);

r3 expectedNodes (B, N) :- getBlkPipe (UFile, B, Gs, Pos, N);

r4 DEL expectedNodes (B, N) :- fateCrashNode (N), pipeStage (B, Stg), Stg == 2, expectedNodes (B, N);

r5 setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack);

r6 goodAcksCnt (B, COUNT<Ack>)

:- setupAcks (B, Pos, Ack), Ack == ’OK’;

r7 nodesCnt (B, COUNT<Node>)

:- pipeNodes (B, , N, );

r8 pipeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2;

r9 blkGenStamp (B, Gs) :- dnpNextGenStamp (B, Gs);

r10 blkGenStamp (B, Gs) :- cnpGetBlkPipe (UFile, B, Gs, , );

r11 diskFiles (N, File) :- fsCreate (N, File);

r12 diskFiles (N, Dst) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r13 DEL diskFiles (N, Src) :- fsRename (N, Src, Dst), diskFiles (N, Src, Type);

r14 fileTypes (N, File, Type) :- diskFiles(N, File), Type := Util.getType(File);

r15 blkMetas (N, B, Gs) :- fileTypes (N, File, Type), Type == metafile, Gs := Util.getGs(File);

r16 actualNodes (B, N) :- blkMetas (N, B, Gs), blkGenStamp (B, Gs);

Failureevent(crash

)

Expectation

Actual FactsI/O

Events

40

More detailed specs

The spec: “something is wrong” Why? Where is the bug?

Let’s write more detailed specs

M 1C 2 3

X

B B

M 1C 2 3

X

B

41

Adding specs First analysis

Client’s pipeline excludes Node2, why? Maybe, client gets a bad ack for Node2errBadAck (N) :- dataAck (N, “Error”), liveNodes (N)

Second analysis Client gets a bad ack for Node2, why? Maybe, Node1 could not communicate to Node2errBadConnect (N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN)

We catch the bug! Node2 cannot talk to Node3 (crash) Node2 terminates all connections (including Node1!) Node1 thinks Node2 is dead

X

B

M 1C 2 3

42

Catch bugs with more specs

More detailed specs

Catch bugs closer to the source and earlier in time

Global view

Client’s view

Datanode’s view

43

More ... Design patterns

Add detailed specs Refine existing specs Write specs from different views (global,

client, dn) Incorporate diverse failures (crashes, net

partitions) Express different violations (data-loss,

unavailability)

44


Motivation FATE DESTINI Evaluation

45

Evaluation Implementation complexity

~6000 LOC in Java

Target 3 popular cloud systems HadoopFS (primary target)

- Underlying storage for Hadoop/MapReduce ZooKeeper

- Distributed synchronization service Cassandra

- Distributed key-value store

Recovery bugs Found 22 new HDFS bugs (confirmed)

- Data loss, unavailability bugs Reproduced 51 old bugs

46

Availability bug“If multiple racks are available (reachable), a block should be stored in a minimum of two racks”

Rack #1 Rack #2

B

B

B

Client

B

Replication Monitor

errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_);

“Throw a violation if a block is only stored in one rack, but the rack is connected to another rack”

Availability bug!#replicas = 3,locations are not checked,B is not migrated to R2

rackCnt

B, 1

blkRacksB, R1

connectedR1, R2

errorSingleRackB

FATE injectsrack partitioning

47

Pruning Efficiency Reduce #experiments by an order of magnitude

Each experiment = 4-9 seconds

Found the same number of bugs (by experience)

# Exps

7720

618

5000

Write +2

crashes

Append +2

crashes

Bru

te

Forc

ePru

ned

Write +3

crashes

Append +3

crashes

48

Specification simplicity

Compared to other related work

Framework #Chks Lines/Chk

D3S [NSDI ’08] 10 53

Pip [NSDI ’06] 44 43

WiDS [NSDI ’07] 15 22

P2 Monitor [EuroSys ’06]

11 12

DESTINI 74 5

49

Summary Cloud systems

All good, but must manage failures Performance, reliability, availability depend

on failure recovery

FATE and DESTINI Explore multiple, diverse failures

systematically Facilitate declarative recovery specifications A unified framework

Real-world adoption in progress

50

Research background FATE and DESTINI Thanks! Questions?

END

haryadi gunawi uc berkeley 1. 2 research background fate and destini

Documents

destini slide

chance of failure slide

implementation slide

protocol slide

failure ids m1c23m1c234

io info slide

diverse failures recovery

dhruba borthakur slide