less is more: 2x storage efficiency with hdfs erasure coding
TRANSCRIPT
![Page 1: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/1.jpg)
LESS IS MORE2X storage efficiency with HDFS erasure coding
![Page 2: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/2.jpg)
HDFS inherits 3-way replication from Google File System- Simple, scalable and robust
200% storage overhead Secondary replicas rarely accessed
Replication is Expensive
![Page 3: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/3.jpg)
Erasure Coding Saves Storage Simplified Example: storing 2 bits
Same data durability- can lose any 1 bit
Half the storage overhead Slower recovery
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
![Page 4: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/4.jpg)
Erasure Coding Saves Storage Facebook
- f4 stores 65PB of BLOBs in EC Windows Azure Storage (WAS)
- A PB of new data every 1~2 days- All “sealed” data stored in EC
Google File System- Large portion of data stored in EC
![Page 5: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/5.jpg)
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
![Page 6: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/6.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
![Page 7: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/7.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X Y⊕0 0 00 1 11 0 11 1 0
Y = 0 1 = 1⊕
![Page 8: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/8.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)Very flexible!
![Page 9: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/9.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%
![Page 10: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/10.jpg)
EC in Distributed StorageBlock Layout:
Data Locality 👍🏻Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128M
bloc
k 0
DataNode 0
128~256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
![Page 11: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/11.jpg)
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…bl
ock
5
DataNode 5 DataNode 6
…
parity
Striped Layout:0~1M 1~2M 5~6M6~7M
Data Locality 👎🏻
Small Files 👍🏻Parallel I/O 👍🏻
0~128M 128~256M
![Page 12: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/12.jpg)
EC in Distributed Storage
Spectrum:
Replication ErasureCoding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4Windows Azure
![Page 13: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/13.jpg)
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
![Page 14: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/14.jpg)
Choosing Block Layout• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38%2.03%
23.89%36.03% 40.08%
file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
![Page 15: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/15.jpg)
Choosing Block Layout
CurrentHDFS
![Page 16: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/16.jpg)
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
![Page 17: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/17.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator
![Page 18: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/18.jpg)
Client Parallel Reading
…
parity
![Page 19: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/19.jpg)
Reconstruction on DataNode Important to avoid delay on the critical path
- Especially if original data is lost Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks- New priority algorithms
New ErasureCodingWorker component on DataNode
![Page 20: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/20.jpg)
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
![Page 21: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/21.jpg)
Acceleration with Intel ISA-L 1 legacy coder
- From Facebook’s HDFS-RAID project 2 new coders
- Pure Java — code improvement over HDFS-RAID- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
![Page 22: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/22.jpg)
Microbenchmark: Codec Calculation
![Page 23: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/23.jpg)
Microbenchmark: Codec Calculation
![Page 24: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/24.jpg)
Microbenchmark: HDFS I/O
![Page 25: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/25.jpg)
Hive-on-Spark — locality sensitive
![Page 26: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/26.jpg)
Conclusion Erasure coding expands effective storage space by ~50%! HDFS-EC phase I implements erasure coding in striped block layout Upstream effort (HDFS-7285):
- Design finalized Nov. 2014- Development started Jan. 2015- 218 commits, ~25k LoC change- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
Phase II will support contiguous block layout for better locality
![Page 27: Less is More: 2X Storage Efficiency with HDFS Erasure Coding](https://reader035.vdocuments.us/reader035/viewer/2022062401/586fde9e1a28ab18428b6c73/html5/thumbnails/27.jpg)
Acknowledgements Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze Huawei
- Walter Su, Rakesh R, Xinwei Qin Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng