![Page 1: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/1.jpg)
Robustness in the Salus scalable block store
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam,
Lorenzo Alvisi, and Mike DahlinUniversity of Texas at Austin
1
![Page 2: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/2.jpg)
Storage: Do not lose data
Fsck, IRON, ZFS, …Past
Now & Future
2
![Page 3: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/3.jpg)
Salus: Provide robustness to scalable systems
Scalable systems: Crash failure + certain disk corruption(GFS/Bigtable, HDFS/HBase, WAS, …)
Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus
3
![Page 4: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/4.jpg)
Start from “Scalable”
4
File systemFile system Key valueKey value DatabaseDatabaseBlock storeBlock storeBlock storeBlock store
Scalable
Robust
Key valueKey value
Borrow scalable designs
![Page 5: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/5.jpg)
Scalable architecture
Metadata server
Data servers
Clients
Parallel data transfer
Data is replicated for durability and availability5
Computing node:• No persistent storage
Storage nodes
![Page 6: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/6.jpg)
Problems
?
• No ordering guarantees for writes• Single points of failures:
– Computing nodes can corrupt data.– Client can accept corrupted data.
6
![Page 7: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/7.jpg)
1. No ordering guarantees for writes
Metadata server
Data servers
Clients
Block store requires barrier semantics
7
1 2 3
1 2
![Page 8: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/8.jpg)
2. Computing nodes can corrupt data
Metadata server
Data servers
Clients
Single point of failure
8
![Page 9: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/9.jpg)
3. Client can accept corrupted data
Metadata server
Data servers
Clients
Single point of failure
Does a simple disk checksum work?Can not prevent errors in memory, etc.
9
![Page 10: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/10.jpg)
Salus: Overview
Metadata server
Block drivers(single driver per virtual disk)
Pipelined commit
Active storage
End-to-end verification
10
Failure model: Almost arbitrary but no impersonation
![Page 11: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/11.jpg)
Pipelined commit
• Goal: barrier semantics– A request can be marked as a barrier.– All previous ones must be executed before it.
• Naïve solution:– Waits at a barrier: Lose parallelism
• Look similar to distributed transaction– Well-known solution: Two phase commit (2PC)
11
![Page 12: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/12.jpg)
Problem of 2PC
1 2 3
4 5 4 5
Prepared
Driver
Servers
Leader
Prepared
Leader
Batch i
Batch i+1
1 3
2
12
1 2 3 4 5
Barriers
![Page 13: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/13.jpg)
Problem of 2PC
1 2 3
4 5 4 5
Driver
Servers
Leader
Leader
Batch i
Batch i+1
1 3
2
13
Commit
Commit
Problem can still happen.Need ordering guarantee between different batches.
![Page 14: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/14.jpg)
Pipelined commit
1 2 3
4 5
1 3
2
4 5
Lead of batch i-1
Driver
Servers
Leader
Commit
CommitLeader
Batch i
Batch i+1
Parallel data transfer
14
Batch i-1 committed
Batch i committed
Sequential commit
![Page 15: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/15.jpg)
Outline
• Challenges• Salus
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
15
![Page 16: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/16.jpg)
Active storage
• Goal: a single node cannot corrupt data• Well-known solution: BFT replication
– Problem: require at least 2f+1 replicas
• Salus’ goal: Use f+1 computing nodes– f+1 is enough to guarantee safety– How about availability/liveness?
16
![Page 17: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/17.jpg)
Active storage: unanimous consentComputing node
Storage nodes
17
![Page 18: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/18.jpg)
Active storage: unanimous consentComputing nodes
Storage nodes
• Unanimous consent:– All updates must be agreed by f+1 computing nodes.
• Additional benefit: – Colocate computing and storage: save network bandwidth
18
![Page 19: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/19.jpg)
Active storage: restore availabilityComputing nodes
Storage nodes
• What if something goes wrong?– Problem: we may not know which one is faulty.
• Replace all computing nodes
19
![Page 20: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/20.jpg)
Active storage: restore availabilityComputing nodes
Storage nodes
• What if something goes wrong?– Problem: we may not know which one is faulty.
• Replace all computing nodes– Computing nodes do not have persistent states.
20
![Page 21: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/21.jpg)
Discussion
• Does Salus achieve BFT with f+1 replication?– No. Fork can happen in a combination of failures.
• Is it worth using 2f+1 replicas to eliminate this case?– No. Fork can still happen as a result of other events (e.g.
geo-replication).
21
![Page 22: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/22.jpg)
Outline
• Challenges• Salus
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
22
![Page 23: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/23.jpg)
Evaluation
• Is Salus robust against failures?
• Do the additional mechanisms hurt scalability?
23
![Page 24: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/24.jpg)
Is Salus robust against failures?
24
• Always ensure barrier semantics– Experiment: Crash block driver– Result: Requests after “holes” are discarded
• Single computing node cannot corrupt data– Experiment: Inject errors into computing node– Result: Storage nodes reject it and trigger replacement
• Block drivers never accept corrupted data– Experiment: Inject errors into computing/storage node– Result: Block driver detects errors and retries other nodes
![Page 25: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/25.jpg)
Do the additional mechanisms hurt scalability?
25
![Page 26: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/26.jpg)
Conclusion
• Pipelined commit– Write in parallel – Provide ordering guarantees
• Active Storage– Eliminate single point of failures– Consume less network bandwidth
26
High robustness ≠ Low performance
![Page 27: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/27.jpg)
Salus: scalable end-to-end verification
• Goal: read safely from one node– The client should be able to verify the reply.– If corrupted, the client retries another node.
• Well-known solution: Merkle tree– Problem: scalability
• Salus’ solution:– Single writer– Distribute the tree among servers
27
![Page 28: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/28.jpg)
Scalable end-to-end verification
Server 1 Server 2 Server 3 Server 4
Client maintains the top tree.
Client does not need to store anything persistently.It can rebuild the top tree from the servers.
28
![Page 29: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/29.jpg)
Scalable and robust storage
More hardware More complex softwareMore failures
29
![Page 30: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/30.jpg)
Start from “Scalable”
Scalable
Robust
30
![Page 31: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/31.jpg)
Pipelined commit - challenge
• Is 2PC slow?– Locking? Single writer, no locking– Phase 2 network overhead? Small compare to Phase 1.– Phase 2 disk overhead? Eliminate that, push
complexity to recovery
• Order across transactions– Easy when failure-free.– Complicate recovery.
Key challenge: Recovery
31
![Page 32: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/32.jpg)
Scalable architecture
Computation node:• No persistent storage• Data forwarding, garbage collection, etc• Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), …
32
Computation node
Storage nodes
![Page 33: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/33.jpg)
Salus: Overview
• Block store interface (Amazon EBS):– Volume: a fixed number of fixed-size blocks– Single block driver per volume: barrier semantic
• Failure model: arbitrary but no identity forge• Key ideas:
– Pipelined commit – write in parallel and in order– Active storage – replicate computation nodes– Scalable end-to-end checks – client verifies replies
33
![Page 34: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/34.jpg)
Complex recovery
• Still need to guarantee barrier semantic.
• What if servers fail concurrently?
• What if different replicas diverge?
34
![Page 35: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/35.jpg)
Support concurrent users
• No perfect solution now
• Give up linearizability– Casual or Fork-Join-Casual (Depot)
• Give up scalability– SUNDR
35
![Page 36: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/36.jpg)
How to handle storage node failures?
• Computing nodes generate certificates when writing to storage nodes
• If a storage node fails, computing nodes agree on the current states
• Approach: look for longest and valid prefix• After that, re-replicate data• As long as one correct storage node is
available, no data loss
36
![Page 37: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/37.jpg)
Why barrier semantics?
• Goal: provide atomicity• Example (file system): to append to a file, we
need to modify a data block and the corresponding pointer to that block
• It’s bad if pointer is modified but data is not.• Solution: modify data first and then pointer.
Pointer is attached with a barrier.
37
![Page 38: Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike](https://reader033.vdocuments.us/reader033/viewer/2022042822/56649f225503460f94c3a4c1/html5/thumbnails/38.jpg)
Cost of robustness
38
Robustness
Cost
CrashSalus
Eliminate fork
Fully BFT
Use 2f+1 replication
Use signature and N-way programming