experiences running seismic hazard workflows

Experiences Running Seismic Hazard Workflows

Scott CallaghanSouthern California Earthquake CenterUniversity of Southern CaliforniaSC13 Workflow BoF

Overview of SCEC Computational Problem

• Domain: Probabilistic Seismic Hazard Analysis (PSHA)

• PSHA computation defined as workflow with two parts1. “Strain Green Tensor Workflow”

• ~10 jobs, largest are 4000 core/hrs• Output: Two (2) twenty (20) GB files = 40GB Total

2. “Post-processing Workflow”• High throughput• ~410,000 serial jobs, runtimes less than 1 min• Output: 14,000 files totaling 12 GB

• Use Pegasus-WMS, HTCondor, Globus• Calculates hazard for 1 location

Recent PSHA Research Calculation Using Workflows –

CyberShake Study 13.4• PSHA calculation exploring impact of alternative

velocity models and alternative codes was defined as 1144 workflows

• Parallel Strain Green Tensor (SGT) workflow on Blue Waters– No remote job submission support in Spring 2013– Created basic workflow system using qsub dependencies

• Post-processing workflow on Stampede– Pegasus-MPI-Cluster to manage serial jobs

• Single job is submitted to Stampede• Master-worker paradigm; work is distributed to workers• Relatively little output, so writes were aggregated in

master

CyberShake 13.45 Workflow Performance Metrics

• April 17, 2013 – June 17, 2013 (62 days)– Running jobs 60% of the time

• Blue Waters Usage – Parallel SGT calculations:– Average of 19,300 cores, 8 jobs– Average queue time: 401 sec (34% job runtime)

• Stampede Usage – Many-task post-processing:– Average of 1,860 cores, 4 jobs– 470 million tasks executed (177 tasks/sec)– 21,912 jobs total– Average queue time: 422 sec (127% job runtime)

Constructing Workflow Capabilities using HPC System Services

• Our workflows rely on multiple services– Remote gatekeeper– Remote gridftp server– Local gridftp server

• Currently issues are detected via workflow failure

• Helpful if there were automated checks that everything is operational

• Additional factors– Valid X509 proxy– Available disk space

Why Define PSHA Calculations as Multiple Workflows ?

• CyberShake Study 13.4 used 1144 workflows• Too many jobs to merge into 1 workflow• Currently, we create 1144 workflows and

monitor submission via cronjob

Improving Workflow Capabilities

• Workflow tools could provide higher-level organization– Specify set of workflows as a group– Workflows are submitted and monitored– Errors detected and group paused– Easy to obtain status, progress– Statistics are aggregated over the group

Summary of SCEC Workflows Needs• Workflows tools for large calculations on leadership-class

systems that do not support remote job submission– Need capabilities to define job dependencies, manage

failure and retry, track files using dbms

• Light-weight workflow tools that can be included in scientific software releases– For unsophisticated users, the complexity of workflow

tools should not outweigh their advantages

• Methods to define workflows with interactive (people-in-the-loop) stages– Needed to automate computational pathways that have

manual stages

• Methods to gather both scientific and computational metadata into databases– Needed to ensure reproducibility and support user

interfaces

experiences running seismic hazard workflows

Documents