native erasure coding support inside hdfs presentation
TRANSCRIPT
![Page 2: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/2.jpg)
Replication is Expensive
![Page 3: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/3.jpg)
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
![Page 4: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/4.jpg)
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
§ 200% storage overhead
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
![Page 5: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/5.jpg)
§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust
§ 200% storage overhead§ Secondary replicas rarely accessed
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
![Page 6: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/6.jpg)
Erasure Coding Saves Storage
![Page 7: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/7.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 0Replication:XOR Coding: 1 0
![Page 8: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/8.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0
![Page 9: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/9.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0
2 extra bits
![Page 10: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/10.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits
![Page 11: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/11.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
![Page 12: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/12.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
![Page 13: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/13.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
§ Half the storage overhead
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
![Page 14: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/14.jpg)
Erasure Coding Saves Storage§ Simplified Example: storing 2 bits
§ Same data durability - can lose any 1 bit
§ Half the storage overhead§ Slower recovery
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
![Page 15: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/15.jpg)
Erasure Coding Saves Storage
![Page 16: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/16.jpg)
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC
![Page 17: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/17.jpg)
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days - All “sealed” data stored in EC
![Page 18: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/18.jpg)
Erasure Coding Saves Storage§ Facebook
- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days - All “sealed” data stored in EC
§ Google File System - Large portion of data stored in EC
![Page 19: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/19.jpg)
Roadmap
![Page 20: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/20.jpg)
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
![Page 21: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/21.jpg)
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
![Page 22: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/22.jpg)
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
![Page 23: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/23.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
![Page 24: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/24.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
![Page 25: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/25.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
![Page 26: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/26.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
![Page 27: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/27.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
![Page 28: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/28.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
redundant data
![Page 29: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/29.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
![Page 30: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/30.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
![Page 31: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/31.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
![Page 32: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/32.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Y = 0 ⊕ 1 = 1
![Page 33: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/33.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
Y = 0 ⊕ 1 = 1
![Page 34: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/34.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
![Page 35: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/35.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 00 1 11 0 11 1 0
![Page 36: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/36.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
![Page 37: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/37.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
![Page 38: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/38.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)
![Page 39: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/39.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
![Page 40: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/40.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
![Page 41: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/41.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
![Page 42: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/42.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica
![Page 43: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/43.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0
![Page 44: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/44.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%
![Page 45: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/45.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication
![Page 46: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/46.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2
![Page 47: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/47.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%
![Page 48: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/48.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells
![Page 49: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/49.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1
![Page 50: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/50.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%
![Page 51: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/51.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3)
![Page 52: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/52.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3
![Page 53: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/53.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%
![Page 54: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/54.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4)
![Page 55: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/55.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4
![Page 56: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/56.jpg)
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%
![Page 57: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/57.jpg)
EC in Distributed StorageBlock Layout:
128~256MFile 0~128M … 640~768M0~128M 128~256M
![Page 58: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/58.jpg)
EC in Distributed StorageBlock Layout:
128~256MFile … 640~768M
0~128M
bloc
k 0
DataNode 0
0~128M 128~256M
![Page 59: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/59.jpg)
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
![Page 60: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/60.jpg)
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5
![Page 61: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/61.jpg)
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
![Page 62: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/62.jpg)
EC in Distributed StorageBlock Layout:
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
![Page 63: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/63.jpg)
EC in Distributed StorageBlock Layout:
Data Locality !
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
![Page 64: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/64.jpg)
EC in Distributed StorageBlock Layout:
Data Locality !
Small Files "
File … 640~768M
0~128M
bloc
k 0
DataNode 0
128~ 256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~ 768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
![Page 65: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/65.jpg)
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~128M 128~256M
![Page 66: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/66.jpg)
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M
0~128M 128~256M
![Page 67: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/67.jpg)
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M6~7M
0~128M 128~256M
![Page 68: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/68.jpg)
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…
bloc
k 5
DataNode 5 DataNode 6
…
parity
Striped Layout:0~1M 1~2M 5~6M6~7M
Data Locality "
Small Files !
Parallel I/O !
0~128M 128~256M
![Page 69: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/69.jpg)
EC in Distributed Storage
Spectrum:
ReplicationErasureCoding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure
![Page 70: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/70.jpg)
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
![Page 71: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/71.jpg)
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
![Page 72: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/72.jpg)
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
![Page 73: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/73.jpg)
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%36.03%
23.89%
2.03%11.38%
86.59% file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
![Page 74: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/74.jpg)
Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%36.03%
23.89%
2.03%11.38%
86.59% file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
3.20%
20.75%
76.05%
0.00%0.36%
99.64%file count
space usage
Dominated by small files
small medium large
Cluster C Profile
![Page 75: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/75.jpg)
Choosing Block Layout
Striping
Contiguous
Replication Erasure Coding
Phase 1.1
Phase
1.2
Phase 2 (Future work)
Phase 3 (Future work)
CurrentHDFS
![Page 76: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/76.jpg)
Generalizing Block NameNode
![Page 77: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/77.jpg)
Generalizing Block NameNodeMapping Logical and Storage Blocks
![Page 78: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/78.jpg)
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
![Page 79: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/79.jpg)
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
![Page 80: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/80.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
![Page 81: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/81.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
![Page 82: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/82.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
![Page 83: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/83.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
![Page 84: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/84.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
![Page 85: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/85.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
![Page 86: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/86.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
![Page 87: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/87.jpg)
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
![Page 88: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/88.jpg)
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
![Page 89: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/89.jpg)
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
![Page 90: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/90.jpg)
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
![Page 91: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/91.jpg)
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
parity
![Page 92: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/92.jpg)
Reconstruction on DataNode§ Important to avoid delay on the critical path
- Especially if original data is lost § Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms
§ New ErasureCodingWorker component on DataNode
![Page 93: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/93.jpg)
Roadmap§ Background of EC
- Redundancy Theory - EC in Distributed Storage Systems
§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
![Page 94: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/94.jpg)
Acceleration with Intel ISA-L§ 1 legacy coder
- From Facebook’s HDFS-RAID project § 2 new coders
- Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
![Page 95: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/95.jpg)
Microbenchmark: Codec Calculation
![Page 96: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/96.jpg)
Microbenchmark: HDFS I/O
![Page 97: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/97.jpg)
Conclusion
![Page 98: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/98.jpg)
Conclusion§ Erasure coding expands effective storage space by ~50%!
![Page 99: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/99.jpg)
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout
![Page 100: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/100.jpg)
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
![Page 101: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/101.jpg)
Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
§ Phase II will support contiguous block layout for better locality
![Page 102: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/102.jpg)
Acknowledgements§ Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze § Huawei
- Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
![Page 103: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/103.jpg)
Just merged to trunk!
![Page 104: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/104.jpg)
Questions?
Just merged to trunk!
![Page 105: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/105.jpg)
Questions?
Just merged to trunk!
Erasure Coding: A type of Error Correction Coding
![Page 106: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/106.jpg)
EC in Distributed Storage
Spectrum:
![Page 107: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/107.jpg)
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
128~256MFile 0~128M … 640~768M
![Page 108: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/108.jpg)
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
128~256MFile 0~128M … 640~768M
![Page 109: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/109.jpg)
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile 0~128M … 640~768M
![Page 110: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/110.jpg)
EC in Distributed Storage
0~128M
128~256M
DataNode0
bloc
k 0
bloc
k 1 …
DataNode1
640~768M
DataNode5bl
ock
5
ContiguousDataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile … 640~768M
![Page 111: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/111.jpg)
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
![Page 112: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/112.jpg)
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
![Page 113: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/113.jpg)
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
![Page 114: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/114.jpg)
EC in Distributed Storage
0~1M……
1~2M……
DataNode0
bloc
k 0
DataNode15~6M…
127~128M
DataNode5
StripingDataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
Parallel I/O !
![Page 115: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/115.jpg)
Client Parallel Writing
blockGroup
DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4
DFSStripedOutputStream
dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4
blk_1009 blk_1010 blk_1011 blk_1012 blk_1013
Coordinator
allocate new blockGroup
![Page 116: Native erasure coding support inside hdfs presentation](https://reader031.vdocuments.us/reader031/viewer/2022020301/586fde9e1a28ab18428b6c75/html5/thumbnails/116.jpg)
Client Parallel Reading
Stripe 0
Stripe 1
Stripe 2
DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3
(parity blocks)(data blocks)
all zero all zero
requested
requested requested requested
requested
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read
recovery read