checksum strategies for data in volatile memory authors: humayun arafat(ohio state) sriram...
TRANSCRIPT
![Page 1: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/1.jpg)
Checksum Strategies for Data in Volatile Memory
Authors:Humayun Arafat(Ohio State)Sriram Krishnamoorthy(PNNL)P. Sadayappan(Ohio State)
1
![Page 2: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/2.jpg)
Motivation• In exascale systems, failures will further increase due to
increasing number of processors
• Typical current approach to fault tolerance is to checkpoint in stable storage
• Soft errors can affect individual data blocks
• Multiple data blocks might be corrupted before they can be efficiently detected
• We focus on developing an approach that can tolerate multiple hard errors and soft errors
2
![Page 3: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/3.jpg)
Fault Tolerant Data in Volatile Memory• Efficient checksum-based approach to fault tolerance
for data in volatile memory systems
• The developed scheme is applicable in multiple scenarios• Online recovery of large read-only data structures
with low storage overhead• Online recovery from soft errors in blocked data• Online recovery of read/write data via in-memory
checkpointing
• The approach uses a logical multi-dimensional view of the data to be protected
3
![Page 4: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/4.jpg)
Design
• Recover exact data• Inspiration from Algorithm Based Fault
Tolerance(ABFT)• Low overhead
4
![Page 5: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/5.jpg)
Checksum Design• Checksum Operator• XOR
• Multi-dimensional Checksums• Increase tolerance
• Checksum co-located with data• Reduce space overhead
• Distributed Checksum• Reduce overhead and increase tolerance
5
![Page 6: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/6.jpg)
One Dimensional Checksum
6
![Page 7: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/7.jpg)
One Dimensional Checksum
7
C
cccccc
cccc cc cc
![Page 8: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/8.jpg)
One Dimensional Checksum
8
Recover checksum
Recover data
![Page 9: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/9.jpg)
Two Dimensional Checksum
9
![Page 10: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/10.jpg)
Checksum and Data Distribution
10
![Page 11: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/11.jpg)
Two Dimensional Checksum
11Recovery
Checksum calculation
![Page 12: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/12.jpg)
Three Dimensional Checksum
12
![Page 13: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/13.jpg)
Three Dimensional Checksum Distribution
13
![Page 14: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/14.jpg)
Checksum Overhead
– One Dimension
– Two Dimension
– Three Dimension
– d Dimension
![Page 15: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/15.jpg)
Experiments• Cray XE6 system(NERSC Hopper)
• 6384 nodes with Gemini interconnect
• Peak bandwidth 8.3 GB/s per direction
• Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory
• Intel C++ compiler 13 and Cray MPI 6.0.1
![Page 16: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/16.jpg)
Checksum Calculation Time 1D, 2D and 3D
1D
3D
2D
16
![Page 17: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/17.jpg)
Fault Recovery
17
![Page 18: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/18.jpg)
Soft Error• Soft error can change the data in memory
• Unit of failure is a block of data inside the process not the entire process
• Low overhead compared to entire process failure
• Less number of tolerable failures
18
![Page 19: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/19.jpg)
Soft Error
19
![Page 20: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/20.jpg)
Soft Error Equations
20
1D block
2D block
![Page 21: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/21.jpg)
2D Soft Error Checksum
21
![Page 22: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/22.jpg)
2D Soft Error Recovery
22
![Page 23: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/23.jpg)
Summary• In memory checkpointing, low overhead
protection for read only data, recovery from soft errors
• XOR based checksum to recover exact data
• Multidimensional checksum calculation to increase fault tolerance
• Co-location of the checksums with the data
• Scalable design to ensure low space overhead23
![Page 24: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1](https://reader037.vdocuments.us/reader037/viewer/2022103100/56649ed15503460f94be0981/html5/thumbnails/24.jpg)
THANK YOUQuestions?
24