using global view resilience (gvr) to add resilience to exascale...

1
GVR Project: http://gvr.cs.uchicago.edu Using Global View Resilience (GVR) to add Resilience to Exascale Applications Hajime Fujita 1,2 , Nan Dun 1,2 , Aiman Fang 1 , Zachary A. Rubenstein 1 , Ziming Zheng 3 , Kamil Iskra 2 , Jeff Hammond 4 , Anshu Dubey 5 , Pavan Balaji 2 , Andrew A. Chien 1,2 1 University of Chicago 2 Argonne National Laboratory 3 HP Vertica 4 Intel Labs 5 Lawrence Berkeley National Laboratory Motivation Current high performance systems have achieved 10 15 FLOPS and progress towards 10 18 FLOPS. The exascale systems are comprised of millions of components, leading to higher error rates. It’s anticipated that the mean time between failures (MTBF) could be less than an hour [1]. Resilience becomes a major concern. Need for a new tool which address resilience issues. Global View Resilience A new library that exploits a global view data, and adds reliability to globally visible data [2, 3]. Key features: Multi-version, multi-stream distributed array: preserves critical application data with fine-grain manner, enables powerful recovery from complex errors such as latent errors Open resilience: maximizes recoverable errors with cross-layer partnership, leverages application-level error handling with unified error handlers Portable, flexible, application-controlled resilience. Demonstrated usable, scalable resilience with gentle slope and flexible forward error recovery. Implemented as a library, which can be used together with other libraries (e.g. MPI, Trilinos), allowing gradual migration to existing applications, or as a backend of other libraries/programming models Multi-version, Multi-stream, Distributed Arrays Global View Exploits a global-view data model, which enables irregular, adaptive algorithms and exascale variability Provides an abstraction of data representation which offers resilience and seamless integration of various components of memory/storage hierarchy Multi-version, Multi-stream Computation phases form "versions" of data GVR array can preserve multiple versions upon application’s request Application can retrieve arbitrary version for flexible recovery Having multiple versions is useful in many ways, e.g. rollback to old versions under presence of latent errors Non-uniform, Proportional Resilience Applications can specify which data are more important in order to manage reliability overheads Portable, controllable resilience Application-semantics based error detection and recovery Open Resilience Unified Error Signaling and Handling Various errors from different sources (e.g. HW, OS, runtime, application) are gathered into the GVR library, then dispatched to an application-defined error handler through the unified error signaling interface Allows applications to supply their own error checking and handling handlers Enables one error handler to be utilized for broader class of errors, leverages the effort spent for writing error handlers Framework for Flexible Cross-layer Error Handling Cross-layer collaboration among different components in the system maximizes the chance of error recovery, as application may be able to handle complex errors that cannot be handled in lower-level components (e.g. systems software or hardware) Error signals are dynamically matches with error handlers to enhance generalizability, specializability, composability, and flexibility of error handlers Rich Application Studies Molecular Dynamics: miniMD (SNL Mantevo Project), ddcMD (LLNL) Linear Solvers: miniFE (SNL Mantevo Project), PCG, GMRES Computation Library: Trilinos (SNL) Monte Carlo Neutron Transport: OpenMC (ANL CESAR co-design center) Adaptive Mesh Refinement Framework: Chombo (LBL) GVR-augmented ddcMD /* User defined error handler */ recovery_func(gds, error_descriptor) { GDS_get(local_data_structure , gds); /* Perform rollback */ GDS_resume_global(gds, error_desc); } main() { /* Create global array data structure */ GDS_alloc(&gds); GDS_create_error(&pred); ... /* Create error predicator */ /* Register the specific error handler */ GDS_register_global_error_handler(gds, pred, recovery_func); /* Molecular Dynamics Simulation Loop */ simulation_loop() { /* Actual computation work */ computation(); if (error_detected()) { /* Error detection & signaling */ /* Create error descriptor for the error */ error_descriptor = GDS_create_error_descriptor(...); ... /* Raise the global error */ GDS_raise_global_error(gds, error_descriptor); continue ; } GDS_put(local_data_structure , gds); if (snapshot_point) { /* Take snapshot of correct states */ GDS_version_inc(gds); } } Forward Error Recovery in OpenMC Tally A Batch B Tally having error but not detected Tally w/o error Corrected tally Execution Batch with error occurred Tally with error detected A B Recovery + - - APIs Creating Global View structures Create: GDS_alloc(), GDS_create() Global View Data Access Data: GDS_put(), GDS_get() Consistency: GDS_fence(), GDS_wait() Accumulate: GDS_acc(), GDS_compare_and_swap() Versioning Create: GDS_version_inc() Navigate: GDS_get_version_number(), GDS_move_to_next(), GDS_move_to_prev(), GDS_move_to_newest(). Error Signaling and Handling Application checking, signaling, correction: GDS_register_global_error_handler(), GDS_register_local_handler() System signaling, integrated recovery: GDS_raise_global_error(), GDS_raise_local_error() Put Get Put Check Error Repair Performance Study To measure the runtime overhead of GVR, experiments using OpenMC, ddcMD, and Chombo were conducted. Exper- iments for OpenMC and ddcMD were done on Midway high performance computing cluster installed in The University of Chicago Research Computing Center, whereas the experiments for Chombo was conducted on NERSC Edison. As for the MPI library, MVAPICH2-2.0 on Midway and Cray MPT 7.0.0 on Edison were used. OpenMC [7]: http://mit-crpg.github.io/openmc/ 8 64 256 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.4% -0.21% 0.27% -0.14% 0.8% 0.08% -0.48% 0.35% 0.55% Number of processes Overhead Native Overhead of 30m interval versioning Overhead of 15m interval versioning Overhead of 5m interval versioning ddcMD [8]: https://www.llnl.gov/ 8 64 256 512 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.21% -1.13% -1.23% 0.45% -0.29% -6.20% -1.42% 0.69% -0.14% -5.86% -0.98% 1.44% 0.06% -4.73% -1.63% 4.62% Number of processes Overhead Native GVR Overhead of 30m interval versioning Overhead of 15m interval versioning Overhead of 5m interval versioning Chombo [9]: https://commons.lbl.gov/display/chombo/ 128 256 1024 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.64% 0.22% 0.38% 0.59% 0.29% 0.89% 1.30% 2.12% 3.24% Number of processes Overhead Native Linked with GVR Overhead of 30m interval versioning Overhead of 5m interval versioning GVR Gentle Slope GVR can be easily applied to existing applications. No architectural changes Minimal (mostly <1%) code change Code/App Size (LOC) Changed (LOC) Leverage Global View Change SW Architecture Trilinos/PCG 300K <1% Yes No Trilinos/GMRES 300K <1% Yes No OpenMC 30K <2% Yes No ddcMD 110K <0.3% Yes No Chombo 500K <1% Yes No Summary GVR: Portable, flexible, application controlled resilience Established model: use cases, extensive application partnership studies Realized systems: several generations of prototypes, iteration informed by application studies Gentle slope: <1% code change, negligible overhead Scalable to exascale resilience: high error rates and latent and silent errors Application studies: Gentle slope, flexible forward error recovery Numerous studies: incremental adoption, useful today Compatible with existing software architectures Enables exploitation of knowledge from all levels (app semantics-based recovery) Enables all kinds of error recovery desired so far Maximize recoverable errors (open resilience) Defined unified signaling and handling framework Numerous examples of use “Open resilience” can catalyze a cross-layer resilience eco-system Future Work Efficient multi-version implementation, including efficient differences, compression, and efficient exploitation of NVRAM Work with community to establish Open Resilience APIs, infrastructure and portable error types/handling. Additional application studies, scalability Efficient portability studies, varying underlying hardware References [1] F. Cappello et al., Toward exascale resilience. International Journal of High Performance Computing Applications, 2009. [2] The GVR Team, Global View Resilience (GVR) Documentation, Release 1.0. Technical Report. University of Chicago. Oct 28, 2014. [3] The GVR Team, How Applications use GVR: Use Cases. Technical Report. University of Chicago. April 28, 2014. [4] H. Fujita et al., Log-Structured Global Array for Efficient Multi-Version Snapshots. Submitted for publication, 2014. [5] N Dun et al., Data Decomposition in Monte Carlo Neutron Transport Simulations using Global View Arrays, submitted for publication, 2014. [6] Z. Zheng et al., Fault Tolerance in an Inner-outer Solver: A GVR-enabled Case Study, 11th International Meeting High Performance Computing for Computational Science-VECPAR, 2014. [7] P. Romano and B. Forget, The OpenMC Monte Carlo Particle Transport Code. Annals of Nuclear Energy, 2013. [8] F. Streitz et al., Simulating Solidification in Metals at High Pressure: The Drive to Petascale Computing, Journal of Physics: Conference Series, 2006. [9] P. Colella et al., Chombo Software Package for AMR Applications Design Document, Lawrence Berkely National Laboratory, 2009. Acknowledgments We thank Mark Hoemmen, Mike Heroux, and Keita Teranishi for giving useful discussions on Trilinos and linear solvers, Brian van Straalen for advices on Chombo, Ignacio Laguna, David Richards for insights on ddcMD, and John R. Tramm, Andrew R. Siegel for supports on OpenMC. ASCR X-Stack Awards DE-SC0008603/57K68-00-145 We gratefully acknowledge the computing resources provided on Midway, high-performance computing cluster operated by the Research Computing Center at The University of Chicago. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. GVR 1.0 Release! Release Features Portable application-controlled resilience and recovery with incremental code change Versioned distributed arrays with global naming (a portable abstraction) Reliable storage of the versioned arrays in memory, local disk/SSD, or global file system Whole version navigation and efficient restoration Partial version efficient restoration (incremental "materialization") Independent array versioning (each at its own pace) Open Resilience framework to maximize cross-layer error handling C native APIs and Fortran bindings Easy install: MPI-3 compatible library, standard "autotools" preparation, requiring no root privilege Platforms: x86-64 Linux cluster, Cray XC30 and IBM Blue Gene/Q Applications: ddcMD, Trilinos, Chombo, OpenMC, and more in the future! For more information, please refer to http://gvr.cs.uchicago.edu

Upload: others

Post on 09-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Global View Resilience (GVR) to add Resilience to Exascale …balaji/pubs/2014/sc/sc14.gvr... · 2014-11-20 · GVR Project : http :// gvr .cs .uchicago .edu Using Global View

GVR Project: http://gvr.cs.uchicago.edu

Using Global View Resilience (GVR) to add Resilience to Exascale ApplicationsHajime Fujita1,2, Nan Dun1,2, Aiman Fang1, Zachary A. Rubenstein1, Ziming Zheng3, Kamil Iskra2, Jeff Hammond4, Anshu Dubey5, Pavan Balaji2, Andrew A. Chien1,2

1University of Chicago 2Argonne National Laboratory 3HP Vertica 4Intel Labs 5Lawrence Berkeley National Laboratory

Motivation

• Current high performance systems have achieved 1015 FLOPS and progress towards 1018 FLOPS.• The exascale systems are comprised of millions of components, leading to higher error rates. It’s anticipated that themean time between failures (MTBF) could be less than an hour [1].

• Resilience becomes a major concern.• Need for a new tool which address resilience issues.

Global View Resilience

• A new library that exploits a global view data, and adds reliability to globally visible data [2, 3].• Key features:

• Multi-version, multi-stream distributed array: preserves critical application data with fine-grain manner, enables powerful recovery fromcomplex errors such as latent errors

• Open resilience: maximizes recoverable errors with cross-layer partnership, leverages application-level error handling with unified errorhandlers

• Portable, flexible, application-controlled resilience.• Demonstrated usable, scalable resilience with gentle slope and flexible forward error recovery.• Implemented as a library, which can be used together with other libraries (e.g. MPI, Trilinos), allowing gradualmigration to existing applications, or as a backend of other libraries/programming models

Multi-version, Multi-stream, Distributed Arrays

Global View• Exploits a global-view data model, which enables irregular, adaptive algorithms and exascale variability• Provides an abstraction of data representation which offers resilience and seamless integration of various componentsof memory/storage hierarchy

Multi-version, Multi-stream• Computation phases form "versions" of data• GVR array can preserve multiple versions upon application’s request• Application can retrieve arbitrary version for flexible recovery• Having multiple versions is useful in many ways, e.g. rollback to old versions under presence of latent errors

Non-uniform, Proportional Resilience• Applications can specify which data are more important in order to manage reliability overheads• Portable, controllable resilience• Application-semantics based error detection and recovery

Open Resilience

• Unified Error Signaling and Handling• Various errors from different sources (e.g. HW, OS, runtime, application) are gathered into the GVR library, then dispatched to anapplication-defined error handler through the unified error signaling interface

• Allows applications to supply their own error checking and handling handlers• Enables one error handler to be utilized for broader class of errors, leverages the effort spent for writing error handlers

• Framework for Flexible Cross-layer Error Handling• Cross-layer collaboration among different components in the system maximizes the chance of error recovery, as application may be able tohandle complex errors that cannot be handled in lower-level components (e.g. systems software or hardware)

• Error signals are dynamically matches with error handlers to enhance generalizability, specializability, composability, and flexibility of errorhandlers

Rich Application Studies

• Molecular Dynamics: miniMD (SNL Mantevo Project), ddcMD (LLNL)• Linear Solvers: miniFE (SNL Mantevo Project), PCG, GMRES• Computation Library: Trilinos (SNL)• Monte Carlo Neutron Transport: OpenMC (ANL CESAR co-design center)• Adaptive Mesh Refinement Framework: Chombo (LBL)

GVR-augmented ddcMD

/* User defined error handler */recovery_func (gds , error_descriptor ) {

GDS_get ( local_data_structure , gds); /* Perform rollback */GDS_resume_global (gds , error_desc );

}

main () {/* Create global array data structure */GDS_alloc (& gds);

GDS_create_error (& pred); ... /* Create error predicator *//* Register the specific error handler */GDS_register_global_error_handler (gds , pred , recovery_func );

/* Molecular Dynamics Simulation Loop */simulation_loop () {

/* Actual computation work */computation ();if ( error_detected ()) { /* Error detection & signaling */

/* Create error descriptor for the error */error_descriptor = GDS_create_error_descriptor (...); .../* Raise the global error */GDS_raise_global_error (gds , error_descriptor );continue;

}GDS_put ( local_data_structure , gds);if ( snapshot_point ) { /* Take snapshot of correct states */

GDS_version_inc (gds);}

}

Forward Error Recovery in OpenMC

Tally

ABatch B

Tally having error but not detected

Tally w/o error

Corrected tally

Execution

Batch with error occurred

Tally with error detected

A

B

Recovery

+

-

-

APIs

http://gvr.cs.uchicago.edu

Using Global View Resilience (GVR) to add Resilience to Exascale ApplicationsHajime Fujitaú,†, Nan Dunú,†, Aiman Fangú, Zachary A. Rubensteinú, Ziming Zhengú, Kamil Iskra†, Je� Hammond†, Pavan Balaji†, Anshu Dubey‡, Andrew A. Chienú,†

úUniversity of Chicago †Argonne National Laboratory ‡Lawrence Berkeley National Laboratory

Motivation

• Current high performance systems have achieved 1015 FLOPS and progress towards 1018 FLOPS.• The exascale systems are comprised of millions of components, leading to higher error rates. It’s anticipated that the

mean time between failures (MTBF) could be less than an hour [1].• Resilience becomes a major concern.• Need for a new programming model and a tool which address resilience issues.

Global View Resilience Model

• A new programming model that exploits a global view data, and adds reliability to globally visible data [2, 3].• Portable, flexible, application-controlled resiliance.• Demonstrated useable, scalable resiliene with gentle slope and fexible forward error recovery.• Maximize recoverable errors (x-layer resilience)• Implemented as a library, which can be used together with other libraries (e.g. MPI, Trilinos), allowing gradual

migration to existing applications, or as a backend of other libraries/programming models

Multi-version, Multi-stream, Distributed Arrays

Global View• Exploits a global-view data model, which enables irregular, adaptive aglorithms and exascale variability• Provides an abstraction of data representation which o�ers resilience and seamless integration of various components

of memory/storage hierarchy

Multi-version, Multi-stream• Computation phases form "versions" of data• A program can obtain and recover from earlier versions

Non-uniform, Proportional Resilience• Applications can specify which data are more important in order to manage reliability overheads• Portable, controllable resilience• Application-semantics based error detection and recovery

APIs

Put Get Put Check Error Repair

• Creating Global View structures• Create: GDS_alloc(), GDS_create()

• Global View Data Access• Data: GDS_put(), GDS_get()• Consistency: GDS_fence(), GDS_wait()• Accumulate: GDS_acc(), GDS_compare_and_swap()

• Versioning• Create: GDS_version_inc()• Navigate: GDS_get_version_number(), GDS_move_to_next(), GDS_move_to_prev(), GDS_move_to_newest().

• Error Signaling and Handling• Application checking, signaling, correction: GDS_register_global_error_handler(), GDS_register_local_handler()• System signaling, integrated recovery: GDS_raise_global_error(), GDS_raise_local_error()

Open Resilience

• Unified Signaling and Recovery• Unified Signaling form HW, OS, Runtime, Application• Application-defined error checking and error handling• Custom x-layer error handling

• Cross-layer Partnership

Rich Application Studies

• Molecular Dyanmics: miniMD (SNL Mantevo Project), ddcMD (LLNL)• Linear Solvers: miniFE (SNL Mantevo Project), PCG, GMRES• Computation Library: Trilinos (SNL)• Monte Carlo Neutron Transport: OpenMC (ANL CESAR co-design center)• Adaptive Mesh Refinement Framework: Chombo (LBL)

GVR augemented ddcMD

main() { /* store essential data structures in gsd */ GDS_alloc(gds); /* specify recovery function for gds */ GDS_register_global_error_handler(gds, recovery_func); ! simulation_loop() { computation(); error = check_func(); if (error) { /* erro category: memory error */ error_descriptor = GDS_create_error_descriptor(GDS_ERROR_MEMORY); /* single error */ /* trigger the global error handler for gds */ GDS_raise_global_error(gds, error_descriptor); } if (snapshot point) { GDS_version_inc(gds); } GDS_put(local_data_structure, gds); } } /* Simple recovery function, rollback */ recovery_func(gds, error_desc) { /* Read the latest snapshot into the core data structure */ GDS_get(local_data_structure, gds); GDS_resume_global(gds, error_desc); }

Forward Error Recovery in OpenMC

Tally

ABatch B

Tally having error but not detected

Tally w/o error

Corrected tally

Execution

Batch with error occurred

Tally with error detected

A

B

Recovery

+

-

-

Performance Study

To measure the runtime overhead of GVR, experiments for OpenMC, ddcMD, and Chombo were done on the Midwayhigh performance computing cluster installed in The University of Chicago Research Computing Center, whereas theexperiments for Chombo was conducted on NERSC Edison. As for the MPI library, MVAPICH2-2.0 on Midway and CrayMPICH 7.0.0 on Edison.OpenMC [?]: http://mit-crpg.github.io/openmc/

8 64 2560500

1,0001,5002,0002,5003,0003,5004,0004,5005,0005,5006,0006,5007,000

0.4%

≠0.21%

0.27%

≠0.14%

0.8%

0.08%

≠0.48%

0.35%

0.55%

Number of processes

Exec

ution

Tim

e(se

cond

)

NativeOverhead of 30m interval versioningOverhead of 15m interval versioningOverhead of 5m interval versioning

ddcMD [?]: https://www.llnl.gov/

8 64 256 5120200400600800

1,0001,2001,4001,6001,8002,0002,2002,4002,6002,8003,000

0.21%

≠1.13%

≠1.23%0.45%

≠0.29%

≠6.20%

≠1.42%

0.69%≠0.14%

≠5.86%

≠0.98% 1.44%

0.06%

≠4.73%

≠1.63%

4.62%

Number of processes

Exec

ution

Tim

e(se

cond

)

NativeGVROverhead of 30m interval versioningOverhead of 15m interval versioningOverhead of 5m interval versioning

Chombo [?]: https://commons.lbl.gov/display/chombo/

128 256 10240100200300400500600700800900

1,0001,1001,2001,3001,4001,500

2.55%

2.32%

3.23%

2.57%

2.49%3.49%

Number of processes

Exec

ution

Tim

e(se

cond

)

NativeLinked with GVROverhead of 15m interval versioning

GVR Gentle Slope

Code/App Size (LOC) Changed (LOC) Leverage Global View Change SW ArchitectureTrilinos/PCG 300K <1% Yes NoTrilinos/GMRES 300K <1% Yes NoOpenMC 30K <2% Yes NoddcMD 110K <0.3% Yes NoChombo 500K <1% Yes No

Summary

• GVR Model: Protable, flexible, application controlled resilience• Established model: use cases, extensive application partnership studies• Realized systems: several generations of prototypes, iteration informed by application studies• Gentle slope: <1% code change, negliglible overhead• Scalable to exascale resilience: high error rates and latent and silent errors

• Application studies: Gentle slope, flexible forward error recovery• Numerous studies: incremental adoption, useful today• Compatible with existing software architectures• Enables exploitation of knowledge from all levels (app semantics-based recovery)• Enables all kinds of error recovery desired so far

• Maximize recoverable errors (cross-layer)• Defined unified signaling and handling framework• Numerous examples of use• “Open resilience” can catalyze a cross-layer resilience eco-system

Future Work

• E�cient multi-version implementation, including e�cient di�erences, compression, and e�cient exploitation ofNVRAM

• Work with community to establish Open Resilience APIs, infrastructure and portable error types/handling.• Additional application studies, scalability• E�cient portability studies, varying underlying hardware

References

[1] Cappello, Franck and Geist, Al and Gropp, Bill and Kale, Laxmikant and Kramer, Bill and Snir, Marc. Toward exascale resilience.International Journal of High Performance Computing Applications, 2009.

[2] The GVR Team, Global View Resilience, API Documentation 0.8.1-rc0. Technical Report. University of Chicago. April 28, 2014.[3] The GVR Team, How Applications use GVR: Use Cases. Technical Report. University of Chicago. April 28, 2014.[4] Hajime Fujita, Nan Dun, Zachary A. Rubenstein, Andrew A. Chien, Log-Structured Global Array for E�cient Multi-Version Snapshots.

Submitted for publication, 2014.[5] Nan Dun, Hajime Fujita, John R. Tramm, Andrew A. Chien, Andrew R. Siegel, Data Decomposition in Monte Carlo Neutron Transport

Simulations using Global View Arrays, submitted for publication, 2014.[6] Ziming Zheng, Andrew A. Chien, Keita Teranishi, Fault Tolerance in an Inner-outer Solver: A GVR-enabled Case Study, 11th International

Meeting High Performance Computing for Computational Science-VECPAR, 2014.

[7] Paul K. Romano, Benoit Forget, The OpenMC Monte Carlo Particle Transport Code. Annals of Nuclear Energy,2013.

[8] Frederick H. Streitz, James N. Glosli, Mehul V. Patel, Bor Chan, Robert K. Yates, Bronis R. de Supinski, JamesSexton, John A. Gunnels, Simulating Solidification in Metals at High Pressure: The Drive to Petascale Computing,Journal of Physics: Conference Series, 2006.

[9] P. Colella, DT Graves, ND Keen, TJ Ligocki, DF Martin, PW McCorquodale, D. Modiano, PO Schwartz, TDSternberg, B. Van Straalen, Chombo Software Package for AMR Applications Design Document, Lawrence BerkelyNational Laboratory, 2009.

Acknowledgements

Trilinos: Mark Hoemmen, Mike Heroux, Keita Teranishi (SNL), Chombo: Brian van Straalen (LBL), ddcMD: IgnacioLaguna, David Richards (LLNL), OpenMC: John R. Tramm, Andrew R. Siegel (ANL)

ASCR X-Stack Awards DE-SC0008603/57K68-00-145We gratefully acknowledge the computing resources provided on Midway, high-performance computing cluster operated by the Research

Computing Center at The University of Chicago.This research used resources of the National Energy Research Scientific Computing Center, a DOE O�ce of Science User Facility supported by

the O�ce of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Put Get Put

Check Error Repair

Performance Study

To measure the runtime overhead of GVR, experiments using OpenMC, ddcMD, and Chombo were conducted. Exper-iments for OpenMC and ddcMD were done on Midway high performance computing cluster installed in The Universityof Chicago Research Computing Center, whereas the experiments for Chombo was conducted on NERSC Edison. As forthe MPI library, MVAPICH2-2.0 on Midway and Cray MPT 7.0.0 on Edison were used.

OpenMC [7]: http://mit-crpg.github.io/openmc/

8 64 25600.10.20.30.40.50.60.70.80.9

11.11.21.31.41.5

0.4% −0.21% 0.27%−0.14% 0.8% 0.08%

−0.48% 0.35% 0.55%

Number of processes

Overhead

Native Overhead of 30m interval versioningOverhead of 15m interval versioning Overhead of 5m interval versioning

ddcMD [8]: https://www.llnl.gov/

8 64 256 51200.10.20.30.40.50.60.70.80.9

11.11.21.31.41.5

0.21% −1.13% −1.23% 0.45%−0.29% −6.20% −1.42% 0.69%

−0.14% −5.86% −0.98% 1.44%0.06% −4.73% −1.63% 4.62%

Number of processes

Overhead

Native GVROverhead of 30m interval versioning Overhead of 15m interval versioningOverhead of 5m interval versioning

Chombo [9]: https://commons.lbl.gov/display/chombo/

128 256 102400.10.20.30.40.50.60.70.80.9

11.11.21.31.41.5

0.64% 0.22% 0.38%0.59% 0.29% 0.89%

1.30% 2.12% 3.24%

Number of processes

Overhead

Native Linked with GVROverhead of 30m interval versioning Overhead of 5m interval versioning

GVR Gentle Slope

GVR can be easily applied to existing applications.• No architectural changes• Minimal (mostly <1%) code change

Code/App Size (LOC) Changed (LOC) Leverage Global View Change SW ArchitectureTrilinos/PCG 300K <1% Yes NoTrilinos/GMRES 300K <1% Yes NoOpenMC 30K <2% Yes NoddcMD 110K <0.3% Yes NoChombo 500K <1% Yes No

Summary

• GVR: Portable, flexible, application controlled resilience• Established model: use cases, extensive application partnership studies• Realized systems: several generations of prototypes, iteration informed by application studies• Gentle slope: <1% code change, negligible overhead• Scalable to exascale resilience: high error rates and latent and silent errors

• Application studies: Gentle slope, flexible forward error recovery• Numerous studies: incremental adoption, useful today• Compatible with existing software architectures• Enables exploitation of knowledge from all levels (app semantics-based recovery)• Enables all kinds of error recovery desired so far

• Maximize recoverable errors (open resilience)• Defined unified signaling and handling framework• Numerous examples of use• “Open resilience” can catalyze a cross-layer resilience eco-system

Future Work

• Efficient multi-version implementation, including efficient differences, compression, and efficient exploitation ofNVRAM

• Work with community to establish Open Resilience APIs, infrastructure and portable error types/handling.• Additional application studies, scalability• Efficient portability studies, varying underlying hardware

References

[1] F. Cappello et al., Toward exascale resilience. International Journal of High Performance Computing Applications, 2009.[2] The GVR Team, Global View Resilience (GVR) Documentation, Release 1.0. Technical Report. University of Chicago. Oct 28, 2014.[3] The GVR Team, How Applications use GVR: Use Cases. Technical Report. University of Chicago. April 28, 2014.[4] H. Fujita et al., Log-Structured Global Array for Efficient Multi-Version Snapshots. Submitted for publication, 2014.[5] N Dun et al., Data Decomposition in Monte Carlo Neutron Transport Simulations using Global View Arrays, submitted for publication,

2014.[6] Z. Zheng et al., Fault Tolerance in an Inner-outer Solver: A GVR-enabled Case Study, 11th International Meeting High Performance

Computing for Computational Science-VECPAR, 2014.[7] P. Romano and B. Forget, The OpenMC Monte Carlo Particle Transport Code. Annals of Nuclear Energy, 2013.[8] F. Streitz et al., Simulating Solidification in Metals at High Pressure: The Drive to Petascale Computing, Journal of Physics: Conference

Series, 2006.[9] P. Colella et al., Chombo Software Package for AMR Applications Design Document, Lawrence Berkely National Laboratory, 2009.

Acknowledgments

We thank Mark Hoemmen, Mike Heroux, and Keita Teranishi for giving useful discussions on Trilinos and linear solvers,Brian van Straalen for advices on Chombo, Ignacio Laguna, David Richards for insights on ddcMD, and John R. Tramm,Andrew R. Siegel for supports on OpenMC.

ASCR X-Stack Awards DE-SC0008603/57K68-00-145We gratefully acknowledge the computing resources provided on Midway, high-performance computing cluster operated by the Research

Computing Center at The University of Chicago.This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by

the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

GVR 1.0 Release!

• Release Features• Portable application-controlled resilience and recovery with incremental code change• Versioned distributed arrays with global naming (a portable abstraction)• Reliable storage of the versioned arrays in memory, local disk/SSD, or global file system• Whole version navigation and efficient restoration• Partial version efficient restoration (incremental "materialization")• Independent array versioning (each at its own pace)• Open Resilience framework to maximize cross-layer error handling• C native APIs and Fortran bindings

• Easy install: MPI-3 compatible library, standard "autotools" preparation, requiring no root privilege• Platforms: x86-64 Linux cluster, Cray XC30 and IBM Blue Gene/Q• Applications: ddcMD, Trilinos, Chombo, OpenMC, and more in the future!• For more information, please refer to http://gvr.cs.uchicago.edu