robustness in the salus scalable block store yang wang, manos kapritsos, zuocheng ren, prince...

38
Robustness in the Salus scalable block store Yang Wang , Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin 1

Upload: jerome-king

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Robustness in the Salus scalable block store

Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam,

Lorenzo Alvisi, and Mike DahlinUniversity of Texas at Austin

1

Page 2: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Storage: Do not lose data

Fsck, IRON, ZFS, …Past

Now & Future

2

Page 3: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Salus: Provide robustness to scalable systems

Scalable systems: Crash failure + certain disk corruption(GFS/Bigtable, HDFS/HBase, WAS, …)

Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus

3

Page 4: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Start from “Scalable”

4

File systemFile system Key valueKey value DatabaseDatabaseBlock storeBlock storeBlock storeBlock store

Scalable

Robust

Key valueKey value

Borrow scalable designs

Page 5: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Scalable architecture

Metadata server

Data servers

Clients

Parallel data transfer

Data is replicated for durability and availability5

Computing node:• No persistent storage

Storage nodes

Page 6: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Problems

?

• No ordering guarantees for writes• Single points of failures:

– Computing nodes can corrupt data.– Client can accept corrupted data.

6

Page 7: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

1. No ordering guarantees for writes

Metadata server

Data servers

Clients

Block store requires barrier semantics

7

1 2 3

1 2

Page 8: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

2. Computing nodes can corrupt data

Metadata server

Data servers

Clients

Single point of failure

8

Page 9: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

3. Client can accept corrupted data

Metadata server

Data servers

Clients

Single point of failure

Does a simple disk checksum work?Can not prevent errors in memory, etc.

9

Page 10: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Salus: Overview

Metadata server

Block drivers(single driver per virtual disk)

Pipelined commit

Active storage

End-to-end verification

10

Failure model: Almost arbitrary but no impersonation

Page 11: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Pipelined commit

• Goal: barrier semantics– A request can be marked as a barrier.– All previous ones must be executed before it.

• Naïve solution:– Waits at a barrier: Lose parallelism

• Look similar to distributed transaction– Well-known solution: Two phase commit (2PC)

11

Page 12: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Problem of 2PC

1 2 3

4 5 4 5

Prepared

Driver

Servers

Leader

Prepared

Leader

Batch i

Batch i+1

1 3

2

12

1 2 3 4 5

Barriers

Page 13: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Problem of 2PC

1 2 3

4 5 4 5

Driver

Servers

Leader

Leader

Batch i

Batch i+1

1 3

2

13

Commit

Commit

Problem can still happen.Need ordering guarantee between different batches.

Page 14: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Pipelined commit

1 2 3

4 5

1 3

2

4 5

Lead of batch i-1

Driver

Servers

Leader

Commit

CommitLeader

Batch i

Batch i+1

Parallel data transfer

14

Batch i-1 committed

Batch i committed

Sequential commit

Page 15: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Outline

• Challenges• Salus

– Pipelined commit– Active storage– Scalable end-to-end checks

• Evaluation

15

Page 16: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Active storage

• Goal: a single node cannot corrupt data• Well-known solution: BFT replication

– Problem: require at least 2f+1 replicas

• Salus’ goal: Use f+1 computing nodes– f+1 is enough to guarantee safety– How about availability/liveness?

16

Page 17: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Active storage: unanimous consentComputing node

Storage nodes

17

Page 18: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Active storage: unanimous consentComputing nodes

Storage nodes

• Unanimous consent:– All updates must be agreed by f+1 computing nodes.

• Additional benefit: – Colocate computing and storage: save network bandwidth

18

Page 19: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Active storage: restore availabilityComputing nodes

Storage nodes

• What if something goes wrong?– Problem: we may not know which one is faulty.

• Replace all computing nodes

19

Page 20: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Active storage: restore availabilityComputing nodes

Storage nodes

• What if something goes wrong?– Problem: we may not know which one is faulty.

• Replace all computing nodes– Computing nodes do not have persistent states.

20

Page 21: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Discussion

• Does Salus achieve BFT with f+1 replication?– No. Fork can happen in a combination of failures.

• Is it worth using 2f+1 replicas to eliminate this case?– No. Fork can still happen as a result of other events (e.g.

geo-replication).

21

Page 22: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Outline

• Challenges• Salus

– Pipelined commit– Active storage– Scalable end-to-end checks

• Evaluation

22

Page 23: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Evaluation

• Is Salus robust against failures?

• Do the additional mechanisms hurt scalability?

23

Page 24: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Is Salus robust against failures?

24

• Always ensure barrier semantics– Experiment: Crash block driver– Result: Requests after “holes” are discarded

• Single computing node cannot corrupt data– Experiment: Inject errors into computing node– Result: Storage nodes reject it and trigger replacement

• Block drivers never accept corrupted data– Experiment: Inject errors into computing/storage node– Result: Block driver detects errors and retries other nodes

Page 25: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Do the additional mechanisms hurt scalability?

25

Page 26: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Conclusion

• Pipelined commit– Write in parallel – Provide ordering guarantees

• Active Storage– Eliminate single point of failures– Consume less network bandwidth

26

High robustness ≠ Low performance

Page 27: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Salus: scalable end-to-end verification

• Goal: read safely from one node– The client should be able to verify the reply.– If corrupted, the client retries another node.

• Well-known solution: Merkle tree– Problem: scalability

• Salus’ solution:– Single writer– Distribute the tree among servers

27

Page 28: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Scalable end-to-end verification

Server 1 Server 2 Server 3 Server 4

Client maintains the top tree.

Client does not need to store anything persistently.It can rebuild the top tree from the servers.

28

Page 29: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Scalable and robust storage

More hardware More complex softwareMore failures

29

Page 30: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Start from “Scalable”

Scalable

Robust

30

Page 31: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Pipelined commit - challenge

• Is 2PC slow?– Locking? Single writer, no locking– Phase 2 network overhead? Small compare to Phase 1.– Phase 2 disk overhead? Eliminate that, push

complexity to recovery

• Order across transactions– Easy when failure-free.– Complicate recovery.

Key challenge: Recovery

31

Page 32: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Scalable architecture

Computation node:• No persistent storage• Data forwarding, garbage collection, etc• Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), …

32

Computation node

Storage nodes

Page 33: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Salus: Overview

• Block store interface (Amazon EBS):– Volume: a fixed number of fixed-size blocks– Single block driver per volume: barrier semantic

• Failure model: arbitrary but no identity forge• Key ideas:

– Pipelined commit – write in parallel and in order– Active storage – replicate computation nodes– Scalable end-to-end checks – client verifies replies

33

Page 34: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Complex recovery

• Still need to guarantee barrier semantic.

• What if servers fail concurrently?

• What if different replicas diverge?

34

Page 35: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Support concurrent users

• No perfect solution now

• Give up linearizability– Casual or Fork-Join-Casual (Depot)

• Give up scalability– SUNDR

35

Page 36: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

How to handle storage node failures?

• Computing nodes generate certificates when writing to storage nodes

• If a storage node fails, computing nodes agree on the current states

• Approach: look for longest and valid prefix• After that, re-replicate data• As long as one correct storage node is

available, no data loss

36

Page 37: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Why barrier semantics?

• Goal: provide atomicity• Example (file system): to append to a file, we

need to modify a data block and the corresponding pointer to that block

• It’s bad if pointer is modified but data is not.• Solution: modify data first and then pointer.

Pointer is attached with a barrier.

37

Page 38: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike

Cost of robustness

38

Robustness

Cost

CrashSalus

Eliminate fork

Fully BFT

Use 2f+1 replication

Use signature and N-way programming