Download - Keith D. Ball, PhD Evergrid, Inc
![Page 1: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/1.jpg)
11
User-Friendly Checkpointing and Stateful Preemption in HPC Environments Using
Evergrid Availability Services
User-Friendly Checkpointing and Stateful Preemption in HPC Environments Using
Evergrid Availability Services
Keith D. Ball, PhD
Evergrid, Inc.
Keith D. Ball, PhD
Evergrid, Inc.
Oklahoma Supercomputing SymposiumOctober 3, 2007
Oklahoma Supercomputing SymposiumOctober 3, 2007
![Page 2: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/2.jpg)
2
Overview
• Challenges for Large HPC Systems
• Availability Services (AVS) Checkpointing
• AVS Performance
• Preemptive Scheduling
• AVS Integration with LSF on Topdawg
• Conclusions
• About Evergrid
• Challenges for Large HPC Systems
• Availability Services (AVS) Checkpointing
• AVS Performance
• Preemptive Scheduling
• AVS Integration with LSF on Topdawg
• Conclusions
• About Evergrid
![Page 3: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/3.jpg)
3
Challenges for Large HPC Systems
Robustness & fault tolerance: Long runs + many nodes = Increased likelihood of failure Need to insure real-time and compute-time “investment” in long computations
Scheduling: need “stateful preemption” capability for efficient and optimal fair-share schedulingWithout stateful preemption:
High priority jobs terminate low priority jobs, forcing them to restart from the beginning Increases average throughput time, decreases utilization rate
Maintenance: long time to quiesce system;hard to do scheduled (or emergency) maintenance without killing jobs
Robustness & fault tolerance: Long runs + many nodes = Increased likelihood of failure Need to insure real-time and compute-time “investment” in long computations
Scheduling: need “stateful preemption” capability for efficient and optimal fair-share schedulingWithout stateful preemption:
High priority jobs terminate low priority jobs, forcing them to restart from the beginning Increases average throughput time, decreases utilization rate
Maintenance: long time to quiesce system;hard to do scheduled (or emergency) maintenance without killing jobs
![Page 4: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/4.jpg)
4
Relentless Effect of Scale
020406080
100120140
2048 4096 8192 16384 32768 65536 131072
System Size N
MTBF (hour)
0.9999 0.99999 0.999999R
€
MTBF =1
(1− RN )
![Page 5: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/5.jpg)
5
What Happens in the Real World?
System #CPUs Reliability
ASCI Q 8,192 MTBI: 6.5 hrs (114 Unplanned outages/month)
ASCI White 8,192 MTBF: 5 hrs (01), 40 hrs (05)
PSC Lemieux 3,016 MTBI: 9.7 hrs
Google 15,000 20 reboots/day, 2-3% replaced/yr
Source: D. Reed, High-end computing: The challenge of scale, May 2004
![Page 6: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/6.jpg)
6
Solution Requirements
How about a checkpoint /restart (CP/R) capability?
Need the following features to be useful in HPC systems:
• “Just works”: allows users to do their research (and not
more programming!)
• No recoding or recompiling: allows application developers to
focus on their domain (and not system programming)
• Requires transparent, standardized CP/R: restart and/or
migrate application between machines without side effects
• CP/R must be automatic and integrate with existing use of
resource managers and schedulers to fully realize potential
![Page 7: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/7.jpg)
7
Evergrid Availability Services (AVS)
• Implemented via dynamically-linked library libavs.so
– Uses LD_PRELOAD env. variable = no recompiling!
• Completely asynchronous and concurrent CP/R
• Incremental checkpointing
– Fixed upper bound on checkpoint file size
– Tunable “page” size
• Application/OS Transparent
– Migration capable
– Stateful preemption for both serial and parallel jobs
• Integrates with commercial and open-source queuing
systems: LSF, PBS, Torque/Maui, etc.
![Page 8: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/8.jpg)
8
Technology: OS Abstraction Layer
AVS
User Space
System Space
Interconnect
Server/ OS Pool
OS
App Lib
Application
OS Abstraction
OS
App Lib
Application
OS Abstraction
OS
App Lib
Application
OS Abstraction
OS Abstraction Layer• Decouples applications from the
operating system• Transparent fault tolerance for
stateful applications• Pre-emptive scheduling
Key FeaturesDistributed: N nodes running the
same/different appsTransparent: No modifications to OS or
applicationPerformance: <5% Overhead
![Page 9: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/9.jpg)
9
What Do We Checkpoint?
AVS virtualizes the following resources used by applications to ensure transparent CP/R:
Memory•Heap•mmap()’d pages•Stack•Registers•Selected shared libs
Files•Open descriptors•STDIO streams•STDIN, STDOUT•File contents (COW)•Links,Directories
Network•BSD Sockets•IP Addresses•MVAPICH 0.9.8•OFED, VAPI
Process•Process ID•Process group ID•Thread ID•fork(), Parent/Child•Shared Memory•Semaphores
![Page 10: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/10.jpg)
10
Shared-filesystem checkpointing• Best for jobs using fewer ( < 16 ) processors• Works with NFS, Lustre, GPFS, SAN, ….
Local-disk checkpointing• More efficient for large distributed computations• Provides for “mirrored checkpointing”
– Backup of checkpoint in case checkpointing fails or ruins local copy
– Provides redundancy: checkpoint automatically recovered from the mirror if local disk/machine fails
Checkpoint Storage Modes
~~
![Page 11: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/11.jpg)
11
Local Disk Checkpointing & Mirroring
![Page 12: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/12.jpg)
12
Local Disk Checkpointing & Mirroring
![Page 13: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/13.jpg)
13
Interoperability
Application types:• Parallel/distributed• Serial• Shared-memory (testing)• Stock MPICH, MVAPICH (customized), OpenMPI (underway)
Interconnect fabrics:• Infiniband, Ethernet (p4, “GigE”), 10GigE • Potential Myrinet support via OpenMPI
Operating Systems:• RHEL 4, 5 (+ CentOS, Fedora)• SLES 9, 10
Architecture: 64-bit Linux (x86_64)
Supported platforms, apps, etc. are customer-driven
![Page 14: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/14.jpg)
14
Tested Codes
QA-certified codes and compilers, with many more in the pipeline
Compilers
• Pathscale
• Intel Fortran
• Portland Group
• GNU Compilers
Benchmarks• Linpack• NAS• STREAM• IOzone• TI-06 (DoD) apps
Academic Codes
• LAMMPS, Amber, VASP
• MPIBlast, ClustalW-MPI
• ARPS, WRF
• HYCOMM
Commercial Codes
• LS-DYNA
• StarCD (CFD)• Cadence and other EDA apps underway
![Page 15: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/15.jpg)
15
Virtualization and checkpoint overheads are negligible ( < 5%) with most workloads
Virtualization and checkpoint overheads are negligible ( < 5%) with most workloads
Runtime & Checkpoint Overhead
With AVS 0.2% 1.4% 0.9% 1.2% 0.5%
W/ 1 chkpt/hr 2.6% 3.3% 1.3% 2.1% 3.5%
![Page 16: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/16.jpg)
16
Memory Overhead
On a per node basis, the RAM overhead is constant:On a per node basis, the RAM overhead is constant:
![Page 17: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/17.jpg)
17
Preemptive Scheduling
Checkpoints
17
High PriorityQueue
Low PriorityQueue
Running Jobs
Increases server utilization & job throughput by 10-50% based on priority mixIncreases server utilization & job throughput by 10-50% based on priority mix
![Page 18: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/18.jpg)
18
Integration with LSF: Topdawg @ OU
Topdawg cluster at OSCER• 512 dual-core Xeon 3.20 GHz, 2MB cache, 4GB RAM• RHEL 4.3, kernel 2.6.9-55.EL_lustre-1.4.11smp• Using Platform LSF 6.1 for resource manager and scheduler
Objective: Set up two queues with preemption (“lowpri” and “hipri”)• lowpri Long/unlimited run time, but preemptable by hipri • hipri Time-limited, but can preempt lowpri jobs
e.g.: Have long-running (24-hour) low-priority clustalw-mpi job,
which can be preempted by 4-6 hour ARPS and WRF jobs
![Page 19: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/19.jpg)
19
Integration with LSF: Topdawg @ OU
Checkpointing and preemption under LSF• Uses echkpnt and erestart for checkpointing/preempting and
restarting• Allows use of custom methods “echkpnt.xxx” and “erestart.xxx”• Checkpoint method defined as environment variable, or in lsf.conf• Checkpoint directory, interval, and preemption defined in lsb.queues
Evergrid integration of AVS into LSF• Introduces methods echkpnt.evergrid and erestart.evergrid
to handle start, checkpointing, and restart under AVS• Uses Topdawg variables MPI_INTERCONNECT, MPI_COMPILER to
determine parallel vs. serial, IB vs. p4, run-time compiler libs• User sources only one standard Evergrid script from within bsub script!
![Page 20: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/20.jpg)
20
Integration with LSF: Topdawg @ OU
In environment (/etc/bashrc, /etc/csh.cshrc):
export EVERGRID_BASEDIR=/opt/evergrid
Before starting job:
export MPI_COMPILER=gcc
export MPI_INTERCONNECT=infiniband
At the top of your bsub script:
## Load the Evergrid and LSF integration:
source $EVERGRID_BASEDIR/bsub/env/evergrid-avs-lsf.src
Submitting a long-term preemtable job:
bsub -q lowpri < clustalw-job.bsub
Submitting a high-priority job:
bsub -q hipri < arps-job.bsub
![Page 21: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/21.jpg)
21
What’s Underway
• Working with OpenMPI- Support for Myrinet
• Growing list of supported applications-EDA, simulation, …
• Configure LSF for completely transparent integration
• Working with OpenMPI- Support for Myrinet
• Growing list of supported applications-EDA, simulation, …
• Configure LSF for completely transparent integration
![Page 22: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/22.jpg)
22
Conclusions
Evergrid’s Availability Services provides:
• Transparent, scalable checkpointing for HPC applications
• Compute time overhead of < 5% for most applications
• Bounded, nominal memory overhead
• Eliminates impacts of hardware faults
• Ensures jobs run to completion
• Seamless integration into resource managers and schedulers for preemptive scheduling, maintenance and job recovery
Evergrid’s Availability Services provides:
• Transparent, scalable checkpointing for HPC applications
• Compute time overhead of < 5% for most applications
• Bounded, nominal memory overhead
• Eliminates impacts of hardware faults
• Ensures jobs run to completion
• Seamless integration into resource managers and schedulers for preemptive scheduling, maintenance and job recovery
![Page 23: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/23.jpg)
23
Reference
Ruscio, J.F., Heffner, M.A., and Srinidhi Varadarajan,IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, 26-30 March 2007, pp. 1 - 10.
Ruscio, J.F., Heffner, M.A., and Srinidhi Varadarajan,IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, 26-30 March 2007, pp. 1 - 10.
![Page 24: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/24.jpg)
24
About Evergrid
Founded: Feb. 2004 by Dr. Srinidhi Varadarajan (VA Tech, “SystemX”), B. J. Arun
Team: 50+ employees: R&D in Blacksburg, VA and Pune, India.HQ in Fremont, CA.
Patents: 1 patent pending, 6 patents filed, 2 in process
Vision: Build a vertically integrated management system that makes
multi-datacenter scale-out infrastructures behave as a single
managed entity
HPC: Cluster Availability Management Suite (CAMS):
Availability Services (AVS), Resource Manager (RSM)
Enterprise: DataCenter Management Suite: AVS, RSM Enterprise,
Live Migration, Load Manager, Provisioning
![Page 25: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/25.jpg)
25
Acknowledgements
• NSF (funding for original research)
• OSCER (Henry Neeman , Brett Zimmerman, David Akin,
Jim White)
• NSF (funding for original research)
• OSCER (Henry Neeman , Brett Zimmerman, David Akin,
Jim White)
![Page 26: Keith D. Ball, PhD Evergrid, Inc](https://reader036.vdocuments.us/reader036/viewer/2022062314/56813cc9550346895da6721d/html5/thumbnails/26.jpg)
26
Finding Out More
To find out more about Evergrid Software, contact:
Keith Ball [email protected]
Sales: Partnering opportunities:
Natalie Van Unen Mitchell Ratner
617-784-8445 510-668-0500 ext. 5058
[email protected] [email protected]
http://www.evergrid.com
Note: Evergrid will be at booth 2715 at the SC07 conference in Reno, Nevada Nov 13-15. Come by for
a demo and presentation on other products
To find out more about Evergrid Software, contact:
Keith Ball [email protected]
Sales: Partnering opportunities:
Natalie Van Unen Mitchell Ratner
617-784-8445 510-668-0500 ext. 5058
[email protected] [email protected]
http://www.evergrid.com
Note: Evergrid will be at booth 2715 at the SC07 conference in Reno, Nevada Nov 13-15. Come by for
a demo and presentation on other products