experiences running seismic hazard workflows

8
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF

Upload: beau

Post on 05-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Experiences Running Seismic Hazard Workflows. Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF. Overview of SCEC Computational Problem. Domain: Probabilistic Seismic Hazard Analysis (PSHA) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Experiences Running Seismic Hazard Workflows

Experiences Running Seismic Hazard Workflows

Scott CallaghanSouthern California Earthquake CenterUniversity of Southern CaliforniaSC13 Workflow BoF

Page 2: Experiences Running Seismic Hazard Workflows

Overview of SCEC Computational Problem

• Domain: Probabilistic Seismic Hazard Analysis (PSHA)

• PSHA computation defined as workflow with two parts1. “Strain Green Tensor Workflow”

• ~10 jobs, largest are 4000 core/hrs• Output: Two (2) twenty (20) GB files = 40GB Total

2. “Post-processing Workflow”• High throughput• ~410,000 serial jobs, runtimes less than 1 min• Output: 14,000 files totaling 12 GB

• Use Pegasus-WMS, HTCondor, Globus• Calculates hazard for 1 location

Page 3: Experiences Running Seismic Hazard Workflows

Recent PSHA Research Calculation Using Workflows –

CyberShake Study 13.4• PSHA calculation exploring impact of alternative

velocity models and alternative codes was defined as 1144 workflows

• Parallel Strain Green Tensor (SGT) workflow on Blue Waters– No remote job submission support in Spring 2013– Created basic workflow system using qsub dependencies

• Post-processing workflow on Stampede– Pegasus-MPI-Cluster to manage serial jobs

• Single job is submitted to Stampede• Master-worker paradigm; work is distributed to workers• Relatively little output, so writes were aggregated in

master

Page 4: Experiences Running Seismic Hazard Workflows

CyberShake 13.45 Workflow Performance Metrics

• April 17, 2013 – June 17, 2013 (62 days)– Running jobs 60% of the time

• Blue Waters Usage – Parallel SGT calculations:– Average of 19,300 cores, 8 jobs– Average queue time: 401 sec (34% job runtime)

• Stampede Usage – Many-task post-processing:– Average of 1,860 cores, 4 jobs– 470 million tasks executed (177 tasks/sec)– 21,912 jobs total– Average queue time: 422 sec (127% job runtime)

Page 5: Experiences Running Seismic Hazard Workflows

Constructing Workflow Capabilities using HPC System Services

• Our workflows rely on multiple services– Remote gatekeeper– Remote gridftp server– Local gridftp server

• Currently issues are detected via workflow failure

• Helpful if there were automated checks that everything is operational

• Additional factors– Valid X509 proxy– Available disk space

Page 6: Experiences Running Seismic Hazard Workflows

Why Define PSHA Calculations as Multiple Workflows ?

• CyberShake Study 13.4 used 1144 workflows• Too many jobs to merge into 1 workflow• Currently, we create 1144 workflows and

monitor submission via cronjob

Page 7: Experiences Running Seismic Hazard Workflows

Improving Workflow Capabilities

• Workflow tools could provide higher-level organization– Specify set of workflows as a group– Workflows are submitted and monitored– Errors detected and group paused– Easy to obtain status, progress– Statistics are aggregated over the group

Page 8: Experiences Running Seismic Hazard Workflows

Summary of SCEC Workflows Needs• Workflows tools for large calculations on leadership-class

systems that do not support remote job submission– Need capabilities to define job dependencies, manage

failure and retry, track files using dbms

• Light-weight workflow tools that can be included in scientific software releases– For unsophisticated users, the complexity of workflow

tools should not outweigh their advantages

• Methods to define workflows with interactive (people-in-the-loop) stages– Needed to automate computational pathways that have

manual stages

• Methods to gather both scientific and computational metadata into databases– Needed to ensure reproducibility and support user

interfaces