ecmwf 2014 craig tierney final1 · ncar (lsf) noaa (moab/torque) noaa/ornl (moab/torque) noaa...
TRANSCRIPT
![Page 1: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/1.jpg)
Craig Tierney1
Nathan Dauchy2
Chris Harrop1
Forrest Hobbs3
1 Cooperative Institute for Research in Environmental Sciences, University of Colorado at Boulder2 Computer Sciences Corporation3 National Oceanic and Atmospheric Administration, Earth Science Research Laboratory, Global Systems Division
![Page 2: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/2.jpg)
What is Deadline Driven Science?
• Deadline for completion is critical to value of
workflow completion
– Real‐time experiments
– Guidance products
• Similar to operational, except
– No guarantees provided to product users
– No impact to life and property when runs are missed
![Page 3: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/3.jpg)
What Are the Challenges?
• Most R&D HPC Systems
– FIFO queue, possibly with fair‐share
– Large mix of users, job sizes, varying operating modes
• Complex time, file, and job dependencies
• Need guarantees to meet deadlines
• Need reliable/resilient/robust workflow management
• No operational staff to monitor job completion
Solutions needs to meet our philosophy of Portability
![Page 4: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/4.jpg)
Standing Reservations
Workflow Management
Distributed CRON
![Page 5: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/5.jpg)
Workflow Management with RocotoChris Harrop
![Page 6: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/6.jpg)
What is Workflow Management?
![Page 7: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/7.jpg)
What is Workflow Management?
Describe and manage the execution of a collection of tasks in a scientific application.
![Page 8: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/8.jpg)
What is Workflow Management?
Describe and manage the execution of a collection of tasks in a scientific application.
That’s Easy!!!
![Page 9: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/9.jpg)
What is Workflow Management?
![Page 10: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/10.jpg)
What is Workflow Management?
Ensure completion of workflows with complex dependencies on tasks, files, and times on
systems when, not if, component failures happen with no human active job monitoring.
![Page 11: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/11.jpg)
What is Workflow Management?
Ensure completion of workflows with complex dependencies on tasks, files, and times on
systems when, not if, component failures happen with no human active job monitoring.
That’s Not So Easy…
![Page 12: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/12.jpg)
Rocoto
• Supports weather and climate community
modeling paradigms
• Runs in user‐space
• Portable across many different batch
systems
– Moab/Torque, LSF, Grid Engine, SLURM
ROCOTO manages most all work by the Development Testbed Centerhttp://www.dtcenter.org/
![Page 13: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/13.jpg)
Rocoto – Key Features
• Real‐time and retrospective modes
• Fault Tolerance
• Complex dependencies based on Time, File and Task
• Generic and portable batch specifications
• Multi‐threaded job submission
• Workflow throttling
• Meta tasks conveniently describe multiple, similar, tasks
![Page 14: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/14.jpg)
NCAR (LSF)
NOAA (Moab/Torque)
NOAA/ORNL (Moab/Torque)
NOAA (Moab/Torque) NOAA/WCOSS (LSF)
Sites Running RocotoU. of Miami (LSF)
Coastal Carolina U. (SLURM)
U. of Wisconsin (SLURM, Grid Engine)
Presidency of Meteorology and Environment, Saudi Arabia (Torque)
U. Of Maryland (SLURM)
NREL (Moab/Torque)
Thomas J. Watson Research Center, IBM
(SLURM)
IBM Research Laboratory, China (SLURM)
U. of Colorado at Boulder(SLURM)
![Page 15: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/15.jpg)
A Typical Workflow
DataInput Data
DataInput Data
DataInput Data
Pre-processing
DataOutput Data
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Model
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Post-processing
Post-processing
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Grid Interpolation
Grid Interpolation
Pre-processing
Pre-processing
Pre-processing
Pre-processingVerificationVerification
Pre-processing
Pre-processing
Pre-processing
Pre-processingGraphicsGraphics
One set of post tasks per output file
One to many cores
One to several cores
Many output
files
![Page 16: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/16.jpg)
A Typical Workflow
DataInput Data
DataInput Data
DataInput Data
Pre-processing
DataOutput Data
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Model
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Post-processing
Post-processing
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Grid Interpolation
Grid Interpolation
Pre-processing
Pre-processing
Pre-processing
Pre-processingVerificationVerification
Pre-processing
Pre-processing
Pre-processing
Pre-processingGraphicsGraphics
One set of post tasks per output file
One to many cores
One to several cores
Many output
files
![Page 17: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/17.jpg)
A Typical Workflow
DataInput Data
DataInput Data
DataInput Data
Pre-processing
DataOutput Data
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Model
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Post-processing
Post-processing
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Grid Interpolation
Grid Interpolation
Pre-processing
Pre-processing
Pre-processing
Pre-processingVerificationVerification
Pre-processing
Pre-processing
Pre-processing
Pre-processingGraphicsGraphics
One set of post tasks per output file
One to many cores
One to several cores
Many output
files
![Page 18: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/18.jpg)
A Typical Workflow
DataInput Data
DataInput Data
DataInput Data
Pre-processing
DataOutput Data
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Model
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Post-processing
Post-processing
Pre-processing
Pre-processing
Pre-processing
Pre-processing
Grid Interpolation
Grid Interpolation
Pre-processing
Pre-processing
Pre-processing
Pre-processingVerificationVerification
Pre-processing
Pre-processing
Pre-processing
Pre-processingGraphicsGraphics
One set of post tasks per output file
One to many cores
One to several cores
Many output
files
![Page 19: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/19.jpg)
CASE: High Resolution Rapid Refresh
• 15 hour forecast, runs every hour• 3km resolution• Continental U.S. domain• Used in Aviation, Severe
Weather, Renewable Energy, Forecasting
• Up to 263 different per run– Data Preparation– Data Assimilation– Model Execution– Post Processing and Visualization
![Page 20: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/20.jpg)
CASE: High Resolution Rapid Refresh
• Dependency trees vary depending on start time
• Uses meta‐tasks to describe each forecast hour
• Complex dependencies allow workflow to advance in absence of timely data arrival
HRRR was transition to Operations at the National Weather Service in September 2014
![Page 21: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/21.jpg)
Standing Reservations
Workflow Management
Distributed CRON
![Page 22: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/22.jpg)
Distributed, Highly‐Available, CRON ServicesCraig Tierney
Nathan Dauchy
![Page 23: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/23.jpg)
Why we use CRON
• Weather forecasting is driven by the clock!
• Model cycles start every 1‐6 hours
• Workflow management scripts run every
1‐5 minutes
• Input/output data pull/push/sync
• Systems management scripts
![Page 24: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/24.jpg)
D‐CRON – Distributed CRON System
• Provide a unified crontab across the
system
• Distribute cron tasks multiple systems
• Peer‐to‐peer reliability daemon
• Functionality is transparent to the users
?
![Page 25: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/25.jpg)
D‐CRON Secondary Benefits
• Less help tickets about why their workflows did not start
or complete
• No more questions about “lost” crontabs
• No longer need to monitor and maintain individual
front‐end nodes by operations staff
![Page 26: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/26.jpg)
How D‐CRON Works
![Page 27: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/27.jpg)
How D‐CRON Works
User creates a crontab entry.
![Page 28: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/28.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
How D‐CRON Works
![Page 29: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/29.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
How D‐CRON Works
The user crontab is transparently modified to work with the D‐CRON system
![Page 30: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/30.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Prior to each scheduling iteration, status of all service nodes is checked.
Service1 Service2 Service3 ServiceN…..
How D‐CRON Works
![Page 31: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/31.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Work is distributed to all service nodes.
Service1 Service2 Service3 ServiceN…..
How D‐CRON Works
![Page 32: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/32.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Hash function is used to determine which node does the work.
Service1 Service2 Service3 ServiceN…..
Local CRON daemon executes the work on a single node.
How D‐CRON Works
![Page 33: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/33.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Hash function is used to determine which node does the work.
Service1 Service2 Service3 ServiceN…..
Local CRON daemon executes the work on a single node.
Work will always be scheduled on the same node unless there is a issue with the service node.
How D‐CRON Works
![Page 34: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/34.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Prior to each scheduling iteration, status of all service nodes is checked.
Service1 Service2 Service3 ServiceN…..
How D‐CRON Works
![Page 35: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/35.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Prior to each scheduling iteration, status of all service nodes is checked.
Service1 Service2 Service3 ServiceN…..
How D‐CRON Works
![Page 36: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/36.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Work is farmed to all service nodes.
Service1 Service2 Service3 ServiceN…..
How D‐CRON Works
![Page 37: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/37.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Hash function is used to determine which node does the work.
Service1 Service2 Service3 ServiceN…..
Local CRON daemon executes the work on a single node.
How D‐CRON Works
![Page 38: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/38.jpg)
# Realtime FIM run0-57/3 * * * * rocotorun –w FIMG8UJET.xml –d FIMG8UJET.db
Hash function is used to determine which node does the work.
Service1 Service2 Service3 ServiceN…..
Local CRON daemon executes the work on a single node.
Work will always be scheduled on the same node unless there is a issue with the service node.
How D‐CRON Works
![Page 39: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/39.jpg)
How D‐CRON is used
• 81522 CRON tasks launched daily on Jet
– Versus 48140 batch jobs (Sept. 2014)
• 123785 CRON tasks launched daily on Zeus
– versus 80239 batch Jobs (Sept. 2014)
![Page 40: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/40.jpg)
Standing Reservations
Workflow Management
Distributed CRON
![Page 41: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/41.jpg)
Guaranteeing Resources for Real‐Time ExperimentsCraig Tierney, CIRES
Christopher Harrop, CIRES
![Page 42: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/42.jpg)
Standing Reservations
• Pre‐allocated blocks of system that guarantee availability
• Finite reservation
– Can be release by user when not needed
• Infinite reservation
![Page 43: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/43.jpg)
Typical Standing Reservations
Pre‐Processing
Model Run(s)
Post‐Processing
Epoch
Time
1 to several cores
10s‐1000s of cores
1 to several cores
Nod
es
![Page 44: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/44.jpg)
Infinite Reservations (IR)
• No end time
• Required when models cannot cold‐start
• On our system, IR are stressful on the system
– Causes problems with the scheduler
– Often blocks unused resources to non‐realtime jobs
• In 2014, we moved to a system based on preemption
– Reduce stress on the system
– Allowed for more non‐realtime work
![Page 45: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/45.jpg)
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22
Bac
klog
(Hou
rs)
Util
izat
ion
(pct
)
Simulation Time (hours)Util, NoRes Backlog,NoRes
![Page 46: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/46.jpg)
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22
Bac
klog
(Hou
rs)
Util
izat
ion
(pct
)
Simulation Time (hours)Util, NoRes Res Usage Backlog,NoRes
![Page 47: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/47.jpg)
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22
Bac
klog
(Hou
rs)
Util
izat
ion
(pct
)
Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes
![Page 48: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/48.jpg)
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22
Bac
klog
(Hou
rs)
Util
izat
ion
(pct
)
Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes
When there are reservations utilization drops because no jobs can be backfilled just before the reservation starts.
![Page 49: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/49.jpg)
0
2
4
6
8
10
12
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20 22
Bac
klog
(Hou
rs)
Util
izat
ion
(pct
)
Simulation Time (hours)Util, NoRes Util,WithRes Res Usage Backlog,NoRes Backlog,WithRes
Backlog is larger with reservations, especially during the reservation and afterwards as the system tries to drain the backlog.
![Page 50: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/50.jpg)
Current Reservation Usage
• 2014 Hurricane Season
– 25196 total cores
– 105 reservations per day, 50% of total core hours
– Maximum of 8332 cores available via preemption
(33% of available core hours)
83% of total resources under reservation/preemption
![Page 51: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/51.jpg)
Summary
• Portable and resilient workflow management allows
us to reliably complete experiments
• Extending CRON services to be distributed improves
fault‐tolerance and reduces support requirements
• Using standing reservations allows real‐time
experiments to reliably finish in traditional R&D HPC
environments
![Page 53: ECMWF 2014 Craig Tierney final1 · NCAR (LSF) NOAA (Moab/Torque) NOAA/ORNL (Moab/Torque) NOAA (Moab/Torque) NOAA/WCOSS (LSF) Sites Running Rocoto U. of Miami (LSF) Coastal Carolina](https://reader031.vdocuments.us/reader031/viewer/2022021904/5ba3f94709d3f2af168c8851/html5/thumbnails/53.jpg)
Backup Slides