exalt nsdi clean - usenix · exalt:’empowering’researchers’to’...
TRANSCRIPT
![Page 1: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/1.jpg)
Exalt: Empowering Researchers to Evaluate Large-‐Scale Storage Systems
Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and Mike Dahlin The University of Texas at Austin
![Page 2: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/2.jpg)
We need to evaluate our prototypes
Design Implementation
Evaluation
![Page 3: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/3.jpg)
We need to evaluate our prototypes
Design Implementation
Evaluation
![Page 4: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/4.jpg)
Industrial deployment: tens of PBs thousands of nodes
Researchers: hundreds of TBs hundreds of nodes
• Salus (Wang et al. NSDI 13): 108 servers • Eiger (Lloyd et al. NSDI 13): 256 servers • Spanner (Corbett et al. OSDI 12): Hundreds of servers
How does one validate the scalability of a storage system?
![Page 5: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/5.jpg)
Extrapolation?
• Measure with a small cluster • Predict the behavior at full scale • Assumption: – Resource consumption grows linearly with scale
CPU
Network
100 nodes
10%
5%
Extrapolate: The system can scale to 1,000 nodes.
Scale
Resource uHlizaHon
![Page 6: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/6.jpg)
Extrapolation?
• Measure with a small cluster • Predict the behavior at full scale • Assumption: May not hold – Resource consumption grows linearly with scale
CPU
Network
100 nodes
10%
Scale
Resource uHlizaHon
![Page 7: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/7.jpg)
Can we run prototypes at full scale?
7
Processes
Machines
![Page 8: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/8.jpg)
Can we run prototypes at full scale?
• Colocate multiple processes on one node
8
Processes
Machines
![Page 9: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/9.jpg)
Can we run prototypes at full scale?
• Colocate multiple processes on one node
9
Processes
Machines
Problem: Limited I/O resource
![Page 10: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/10.jpg)
Data content doesn’t affect system behavior
!
• Clients can write/read synthetic data !
• Abstract away data on I/O devices !
• Reduce resource requirement of each process
![Page 11: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/11.jpg)
How to abstract away data?
• Discard data? (David, Agrawal et al. FAST 2011) – Doesn’t work with large-scale storage systems
Upper layer (Bigtable, HBase, …)
Lower layer (GFS, HDFS, …)
Data
Metadata
• Our approach: Compress data
Treat all bytes as data
![Page 12: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/12.jpg)
Requirements of compression
• CPU efficient – General-purpose algorithms (e.g. Gzip) are CPU heavy !
• High compression ratio !
• Lossless compression !
• Be able to work with mixed data and metadata
![Page 13: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/13.jpg)
Challenge: Data mixed with metadata
• System may add metadata • System may split data (possibly nondeteministically)
MetadataClient data
Key: Locate metadata inside data
![Page 14: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/14.jpg)
• Make data distinguishable from metadata –Flag: sequence of bytes that does not appear in metadata
• Efficiently locate metadata: Follow sorted pattern –Marker: number of remaining bytes to the end
Flag Marker
Solution: Tardis data pattern
1016 1008 1000 ... 0
1KB data chunk and 4-byte flags and markers
Tardis
![Page 15: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/15.jpg)
Tardis compression
Search for flag
Retrieve marker Skip 504 bytes
Search for flag again
504 ... 1016 1008
Retrieve markerSkip 1016 bytes: Hit the end of chunk
504 ... 1016 1008
Tardis 1024:16
33,000 times faster than gzip
00
Tardis 512:512
Starting point Length
Original data
Compressed data
![Page 16: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/16.jpg)
How to find an appropriate flag?
• Scan all metadata: Expensive !
• Observation: Tardis is only used for testing !
• A randomly chosen 8-byte flag works – HDFS – HBase
![Page 17: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/17.jpg)
Testing with Tardis• Run potential bottleneck nodes in real mode. • Run most nodes in emulated mode.
……
Clients
Process
Network
Bottleneck
![Page 18: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/18.jpg)
Implementation
• Emulated devices: disk, network, and memory !
• Disk and network: Transparent emulation – Byte code instrumentation (BCI) – Usage: java -Xbootclasspath exalt.jar <original app> !
• Memory: Require code modification – None for HDFS; 71 LOC for HBase
![Page 19: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/19.jpg)
Case studies
• Apply our emulator to HDFS and HBase – Measure their scalability – When we find a problem, analyze its root cause, and fix it !
• Testbed: – Texas Advanced Computing Center (TACC)
![Page 20: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/20.jpg)
Scalability of HDFS
Increase number of RPC threads
Put debug information in tmpfs
Same as reported by HDFS developers
Disable sync Put metadata in tmpfs
![Page 21: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/21.jpg)
One problem of HDFS: Big files
HDFS performance degradation as file grows large.
long[] computeContentSummary(long[] summary) { long bytes = 0; for(Block blk : blocks) { bytes += blk.getNumBytes(); } summary[0] += bytes; …… }
![Page 22: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/22.jpg)
Applying Exalt more broadly
• CPU intensive systems? –DieCast (Gupta et al. NSDI 2008) !
• Data sensitive applications/benchmarks? – Record (on a large testbed) and replay (on a small one) !
• The target system modifies data? – Ad-hoc solutions for de-duplication, encryption, etc
![Page 23: Exalt NSDI clean - USENIX · Exalt:’Empowering’Researchers’to’ Evaluate’Large8Scale’Storage’Systems’ Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and](https://reader035.vdocuments.us/reader035/viewer/2022081616/60035107b024cc3ebe0452eb/html5/thumbnails/23.jpg)
Conclusion
Industry
Researchers
https://code.google.com/p/exalt/