the parallel system for integrating impact models and...

8
The Parallel System for Integrating Impact Models and Sectors (pSIMS) Joshua Elliott Computation Institute University of Chicago and Argonne National Laboratory [email protected] David Kelly Computation Institute University of Chicago and Argonne National Laboratory [email protected] Neil Best Computation Institute University of Chicago and Argonne National Laboratory [email protected] Michael Wilde Computation Institute University of Chicago and Argonne National Laboratory [email protected] Michael Glotter Department of Geophysical Sciences University of Chicago [email protected] Ian Foster Computation Institute University of Chicago and Argonne National Laboratory [email protected] ABSTRACT We present a framework for massively parallel simulations of climate impact models in agriculture and forestry: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting large amounts of data from various sources and standardizing them to a versatile and compact data type; b) tools for translating this standard data type into the custom formats required for point-based impact models in agriculture and forestry; c) a scalable parallel framework for performing large ensemble simulations on various computer systems, from small local clusters to supercomputers and even distributed grids and clouds; d) tools and data standards for reformatting outputs for easy analysis and visualization; and d) a methodology and tools for aggregating simulated measures to arbitrary spatial scales such as administrative districts (counties, states, nations) or relevant environmental demarcations such as watersheds and river-basins. We present the technical elements of this framework and the results of an example climate impact assessment and validation exercise that involved large parallel computations on XSEDE. Categories and Subject Descriptors I.6.8 [Simulation and Modeling]: Parallel parallel simulation of climate vulnerabilities, impacts, and adaptations. General Terms Algorithms, Performance, Design, Languages. Keywords Climate change Impacts, Adaptation, and Vulnerabilities (VIA); Parallel Computing; Data processing and standardization; Crop modeling; Multi-model; Ensemble; Uncertainty 1. INTRODUCTION AND DESIGN Understanding the vulnerability and response of human society to climate change is necessary for sound decision-making in climate policy. However, progress on these important research questions is made difficult by the fact that science and information products must be integrated across vastly different spatial and temporal scales. Biophysical and environmental responses to global change generally depend strongly on environmental (e.g., soil type), socioeconomic (e.g., farm management), and climatic factors that can vary substantially over regions at high spatial resolution. Global Gridded Crop Models (GGCMs) are designed to capture this spatial heterogeneity and simulate crop yields and climate impacts at continental or global scales. Site-based GGCMs, like those described here, aggregate local high-resolution simulations, and are often limited by data availability and quality at the scales required by a large-scale campaign. Obtaining the data inputs necessary for a comprehensive high-resolution assessment of crop yields and climate impacts typically requires tremendous effort by researchers, who must catalog, assimilate, test, and process multiple data sources with vastly different spatial and temporal scales and extents. Accessing, understanding, scaling, and integrating diverse data typically involves a labor-intensive and error-prone process that creates a custom data processing pipeline. A comparably complex set of transformations must often be performed once impact simulations have been completed to produce information products for a wide range of stakeholders, including farmers, policy-makers, markets, and agro-business interests. Thus, simulation outputs, like inputs, must be available at all scales and in familiar and easy to use formats. To address these challenges and thus facilitate access to high- resolution climate impact modeling we are developing a suite of tools, data, and models called the parallel System for Integrating Impacts Models and Sectors (pSIMS). Using an integrated multi- model multi-sector simulation approach, pSIMS leverages automated data ingest and transformation pipelines and high- performance computing to enable researchers to address key challenges. We present in this paper the pSIMS structure, data, and methodology; describe a prototype use case, validation methodology, and key input datasets; and summarize features of the software and computational architecture that enable large- scale simulations. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. XSEDE13, July 22 - 25 2013, San Diego, California, USA Copyright 2013 ACM 978-1-4503-2170-9/13/07 ...$15.00.

Upload: lenhi

Post on 14-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

The Parallel System for Integrating Impact Models and Sectors (pSIMS)

Joshua Elliott Computation Institute

University of Chicago and Argonne National Laboratory

[email protected]

David Kelly Computation Institute

University of Chicago and Argonne National Laboratory

[email protected]

Neil Best Computation Institute

University of Chicago and Argonne National Laboratory

[email protected]

Michael Wilde Computation Institute

University of Chicago and Argonne National Laboratory

[email protected]

Michael Glotter Department of Geophysical Sciences

University of Chicago

[email protected]

Ian Foster Computation Institute

University of Chicago and Argonne National Laboratory

[email protected]

ABSTRACT

We present a framework for massively parallel simulations of

climate impact models in agriculture and forestry: the parallel

System for Integrating Impact Models and Sectors (pSIMS). This

framework comprises a) tools for ingesting large amounts of data

from various sources and standardizing them to a versatile and

compact data type; b) tools for translating this standard data type

into the custom formats required for point-based impact models in

agriculture and forestry; c) a scalable parallel framework for

performing large ensemble simulations on various computer

systems, from small local clusters to supercomputers and even

distributed grids and clouds; d) tools and data standards for

reformatting outputs for easy analysis and visualization; and d) a

methodology and tools for aggregating simulated measures to

arbitrary spatial scales such as administrative districts (counties,

states, nations) or relevant environmental demarcations such as

watersheds and river-basins. We present the technical elements of

this framework and the results of an example climate impact

assessment and validation exercise that involved large parallel

computations on XSEDE.

Categories and Subject Descriptors

I.6.8 [Simulation and Modeling]: Parallel – parallel simulation

of climate vulnerabilities, impacts, and adaptations.

General Terms

Algorithms, Performance, Design, Languages.

Keywords

Climate change Impacts, Adaptation, and Vulnerabilities (VIA);

Parallel Computing; Data processing and standardization; Crop

modeling; Multi-model; Ensemble; Uncertainty

1. INTRODUCTION AND DESIGN Understanding the vulnerability and response of human society to

climate change is necessary for sound decision-making in climate

policy. However, progress on these important research questions

is made difficult by the fact that science and information products

must be integrated across vastly different spatial and temporal

scales. Biophysical and environmental responses to global change

generally depend strongly on environmental (e.g., soil type),

socioeconomic (e.g., farm management), and climatic factors that

can vary substantially over regions at high spatial resolution.

Global Gridded Crop Models (GGCMs) are designed to capture

this spatial heterogeneity and simulate crop yields and climate

impacts at continental or global scales. Site-based GGCMs, like

those described here, aggregate local high-resolution simulations,

and are often limited by data availability and quality at the scales

required by a large-scale campaign. Obtaining the data inputs

necessary for a comprehensive high-resolution assessment of crop

yields and climate impacts typically requires tremendous effort by

researchers, who must catalog, assimilate, test, and process

multiple data sources with vastly different spatial and temporal

scales and extents. Accessing, understanding, scaling, and

integrating diverse data typically involves a labor-intensive and

error-prone process that creates a custom data processing pipeline.

A comparably complex set of transformations must often be

performed once impact simulations have been completed to

produce information products for a wide range of stakeholders,

including farmers, policy-makers, markets, and agro-business

interests. Thus, simulation outputs, like inputs, must be available

at all scales and in familiar and easy to use formats.

To address these challenges and thus facilitate access to high-

resolution climate impact modeling we are developing a suite of

tools, data, and models called the parallel System for Integrating

Impacts Models and Sectors (pSIMS). Using an integrated multi-

model multi-sector simulation approach, pSIMS leverages

automated data ingest and transformation pipelines and high-

performance computing to enable researchers to address key

challenges. We present in this paper the pSIMS structure, data,

and methodology; describe a prototype use case, validation

methodology, and key input datasets; and summarize features of

the software and computational architecture that enable large-

scale simulations.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

XSEDE13, July 22 - 25 2013, San Diego, California, USA

Copyright 2013 ACM 978-1-4503-2170-9/13/07 ...$15.00.

Our goals in developing this framework are fourfold, to: 1)

develop tools to assimilate relevant data from arbitrary sources; 2)

enable large ensemble simulations, at high resolution and with

continental/global extent, of the impacts of a changing climate on

primary industries; 3) produce tools that can aggregate simulation

output consistently to arbitrary boundaries; and 4) characterize

uncertainties inherent in the estimates by using multiple models

and simulating with large ensembles of inputs.

Figure 1 shows the principal elements of pSIMS including both

automated steps (the focus of this paper) and those that require an

interactive component (such as specification and aggregation).

The framework is separated into four major components: data

ingest and standardization, user initiated campaign specification,

parallel implementation of specified simulations, and aggregation

to arbitrary decision-relevant scales.

pSIMS is designed to support integration of any point-based

climate impact model that requires standard daily weather data as

inputs. We developed the framework prototype with versions 4.0

and 4.5 of the Decision Support System for Agrotechnology

Transfer (DSSAT [15]). DSSAT 4.5 supports models of 28

distinct crops using two underlying biophysical models (CERES

and CropGRO). Additionally, we have prototyped and tested a

parallel version of the CenW [16] forest growth simulation model

and are actively integrating additional crop yield and climate

impact models starting with the widely used Agricultural

Production Systems Simulator (APSIM) [23]. As of this writing,

continental- to global-scale simulation experiments ranging in

resolution from 3–30 arcminutes have been conducted on four

crop and 1 tree (pinus radiata) species using a variety of

weather/climate inputs and scenarios.

2. THE pSIMS DATA INPUT PIPELINE The minimum weather data requirements for crop and climate

impact models such as DSSAT, APSIM, and CenW are typically:

Daily maximum temperature (degrees C at 2m above the ground surface)

Daily minimum temperature (degrees C at 2m)

Daily average downward shortwave radiation flux (W/m2

measured at the ground surface)

Total daily Precipitation (mm/day at the ground surface)

In addition, it is also sometimes desirable to include daily average

wind speeds and surface humidity if available.

Hundreds of observational datasets, model-based reanalyses, and

multi-decadal climate model simulation outputs at regional and

global scales exist that can be used to drive climate impact

simulations. For some use-cases, one or another data product may

be demonstrably superior, but most often a clear understanding of

the range of outcomes and uncertainty requires simulations with

an ensemble of inputs. For this reason, input data requirements for

high-resolution climate impact experiments can be large and data

processing and management can be challenging.

One major challenge in dealing with long daily time-series of

high-resolution weather data is that input products are typically

stored one or more variables at a time in large spatial raster files

segmented into annual, monthly, or sometimes even daily or sub-

daily time-slices. Point-based impact models on the other hand,

typically require custom ASCII data types that encode long time-

series of daily data for all required variables for a single point into

one file. The process of accessing hundreds or thousands of

spatial raster files O(105–106) times to extract the time-series for

each point is time-consuming and expensive [20]. To ameliorate

this challenge, we have established a NetCDF4-based data-type

for the pSIMS framework—identified by a .psims.nc4

extension—and tools for translating existing archives into this

format (Figure 2).

Each .psims.nc4 file represents a single grid cell or simulation site

within the study area. Its contents are one or more 1 x 1 x T

arrays, one per variable, where T is the number of time steps. The

time coordinates of each array are explicit in the definition of their

dimensions, a strategy that facilitates near-real-time updates from

upstream sources. The spatial coordinates of each array are also

explicit in the definition of their dimensions, which facilitates

downstream spatial merging. Variables are grouped and named so

as to avoid namespace collisions with variables from other

sources in downstream merging, e.g.. “narr/tmax” refers to daily

maximum temperature variable extracted from the North

American Regional Reanalysis (NARR) dataset and “cfsrr/tmax”

to the daily minimum temperature from the NOAA Climate

Forecast System Reanalysis and Reforecast (CFSRR) dataset.

Because a given .psims.nc4 file contains information about a

single point location, the longitude and latitude vectors used as

Figure 1: Schematic of the pSIMS workflow. 1) Data ingest takes data from numerous publicly available archives in arbitrary file formats and data types. 2) The standardization step reconstitutes each of these datasets into the highly portable point-based .psims.nc4 format. 3) The specification step defines a

simulation campaign for pSIMS by choosing a .psims.nc4 dataset or subset and one or more climate impact models (and the requisite custom input and

output data translators) from the code library. 4) The translation step individual .psims.nc4 files (each representing one grid-cell in the spatial simulation) are converted into the custom file formats required by the models. 5) In the simulation step, individual simulation calls within the campaign are managed

by Swift on the available high-performance resource. 6) In the output reformatting step, model outputs (dozens or even hundreds of time-series variable

from each run) are extracted from model-specific custom output formats and translated into a standard .psims.out format. 7) Finally, in the aggregation

step, output variables are masked as necessary and aggregated to arbitrary decision-relevant spatial or temporal scales.

array dimensions are both of length one. Therefore the typical

weather variable is represented as a 1x1xT array where T is the

number of time steps in the source data. In contrast, arrays

containing forecast variables have two time dimensions, the time

the forecast is made and the time of the prediction. This use of

two time dimensions makes it possible to follow the evolution of a

series of forecasts made over a period of time for a particular

future date, as for example when forecasts of crop yields at a

particular location are refined over time as more information

becomes available..

pSIMS input files are named according to (row, col) tuple in the

global grid and organized in a file structure similarly (e.g.,

/narr/361/1168/361_1168.psims.nc4) so that the terminal

directories hold data for a single site. This improves the

performance of parallel reads and writes on shared filesystems

like GPFS and minimizes clutter while browsing the tree. Because

files represent points, the resolution is only set by the organization

of the enclosing directory tree and is recorded in an archive

metadata file, as well as in the metadata of each .psims.nc4 file.

Climate data from observations and simulations is widely

available in open and easily accessible archives, such as those

accessible by Earth System Grid (ESG) [30], NCAR’s NOMADS

servers [26], and NASA’s Modeling and Assimilation Data

Information Services Center (MDISC). These datasets have

substantial metadata—often based on Climate and Forecasting

(CF) conventions [8]—and are frequently identified by a unique

Digital Object Identifier (DOI) [21], making provenance and

tracking relatively straightforward. Such data is often archived at

uniform grid resolutions, but scales can vary from a few

kilometers to one or more degrees and frequencies from an hour

to a month. Furthermore, different map projections mean that

transformations are sometimes necessary before data can be used.

3. AN EXAMPLE ASSESSMENT STUDY We now present an example of a maize yield assessment and

validation exercise conducted in the conterminous US at five-

arcminute spatial resolution. For each grid cell in the study region,

we simulate eight management configurations (detailing fertilizer

and irrigation application) from 1980-2009 using the CERES-

maize model (Figure 3).

Simulations are driven by observation and reanalysis-based data

products comprised of NCEP Climate Forecast System Reanalysis

(CFSR) temperatures [13], NOAA Climate Prediction Center

(CPC) US Unified Precipitation [23], and NASA Surface

Radiation Budget solar radiation (SRB) dataset [27].

Figure 3: Time-series yields for a single site for four fertilizer

application rates w/ (solid) and w/o (dashed) irrigation.

The gridded yield for crop i in grid cell x at time t takes the form

where we denote explicitly the dependence of yield on local

climate (Q; including temperature, precipitation, solar radiation,

and CO2), soils (S), and farm management (M; including planting

dates, cultivars, fertilizers, and irrigation). Figure 4 shows

summary measures (median and standard deviation) over the

historical simulation period for a single management

configuration (rainfed with high nitrogen fertilizer application

rates) from this campaign.

High-resolution yield maps allow researchers to visualize the

changing patterns of crop yields. However, such measures must

typically be aggregated at some decision-relevant environmental

or political scale, such as county, watershed, state, river basin,

nation, or continent, for use in models or decision-maker analyses.

To this end, we are developing tools to aggregate yields and

climate impact measures to arbitrary spatial scales. For a given

individual region R, crop i, and time t, the basic problem of re-

scaling local yields (Y) to regional production (P) takes the form

where the weighting function for aggregation, Wi(x,t), is the area

in location x at time t that is engaged in the production of crop i.

Figure 2: Expanded schematic of data types and processing pipeline from Figure 1. The steps in the pipeline are as follows. 1) Data is ingested from

arbitrary sources in various file formats and data types. 2) If necessary data is transformed to a geographic projection. 3) For each land grid-cell, the full

time series of daily data is extracted for each cell (in parallel) and converted to the .psims.nc4 format. 4) Each set of .psims.nc4 files extracted from a dataset are then organized into an archive for long-term storage. 5) If the input dataset is still being updated, we ingest and process the updates at regular

intervals and 6) we append the updates to the existing .psims.nc4 archive.

A further use of aggregation is to produce measures of yield that

can be compared with survey data at various spatial scales for

model validation and uncertainty quantification. For example,

yield measures are aggregated to county and state level and

compared with survey data from the US National Agricultural

Statistics Service (NASS) [3] by calculating time-series

correlations between simulated and observed values (Figure 5).

Next we consider RMSE measures for the simulated vs. observed

yields at the various scales. Figure 6 plots the simulated vs.

observed yields at the county and state level: in total, 60423

observations over 30 years and 2373 counties and 1225

observations over 41 states, respectively. The root mean square

error (RMSE) between observed yields and prototype simulated

yields is 25% of the sample mean, and an unconstrained linear fit

of the simulation vs. observation results has an R2 of 0.6. At state

level, the RMSE is 15.7% of the sample mean and at national

level its reduced to 10.8% (13.6 bushels/Acre) over the test

period. Given the simplicity of the experiment, these results

compare favorably with other recent spatial simulation/validation

exercises at the state level.

Figure 4: Simulation outputs of a single pDSSAT campaign. Maps of

the median (top) and standard deviation (bottom) of simulated

potential rainfed/high-input maize yield.

4. USE OF Swift PARALLEL SCRIPTING Each pSIMS model run requires the execution of 10,000 to

120,000 fairly small serial jobs, one per geographic grid cell, and

each using one CPU core for a few seconds to a few minutes. The

Swift [29] parallel scripting language has made it straightforward

to write and execute pSIMS runs, using highly portable and

system-independent scripts such the example in Listing 1. Space

does not permit a detailed description of this program, but in brief,

it defines a machine-independent interface to the DSSAT

executable, loads a list of grids on which that executable is to be

run, and then defines a set of DSSAT tasks, one per grid.

The Swift language is implicitly parallel, high-level, and

functional. It automates the difficult and science-distracting tasks

of distributing tasks and data across multiple remote systems, and

of retrying failing tasks and restarting failing workflow runs. Its

runtime automatically manages the execution of tens of thousands

of small single-core or multi-core jobs, and dynamically packs

those jobs tightly onto multiple nodes of the allocated computing

resources to maximize system utilization. Swift automates the

acquisition of nodes; inter-job data dependencies; throttling;

scheduling and dispatch of work to cores; and retry of failing jobs.

Swift is used in many science domains, including earth systems,

energy economics, theoretical chemistry, protein science and

graph analysis. It is a simple scripting system for users to install

and use and enables many new classes of users to get on board and run productively in a remarkably short time.

Figure 5: Time-series correlations between county level aggregates of

simulated maize yields and NASS statistics from 1980-2009 for

pDSSAT simulations using CFSR temperatures, CPC precipitation,

and SRB solar. Only counties for which NASS records a minimum of

six years of data with an average of more than 500 maize cultivated

hectares per year are shown.

The key to Swift’s ability to execute large numbers of small tasks

efficiently on large parallel computers is its use of a two-level

scheduling strategy. Internally, Swift launches a pilot job called a

“coaster” on each node within a resource pool [12]. The Swift

runtime then manages the dispatch of application invocations,

plus any data that these tasks require, to those coasters, which

manage their execution on compute nodes. As tasks finish, Swift

schedules more work to those nodes, achieving high CPU

utilization even for fine-grained workloads. Individual tasks can

be serial, OpenMP, MPI, or other parallel applications.

Swift makes computing location independent, allowing us to run

pSIMS on a variety of grids, supercomputers, clouds, and clusters,

with the same pSIMS script used on multiple distributed sites and

diverse computing resources. Figure 7 shows a typical execution

scenario, in which pSIMS is run across two University of Chicago

campus resources: the UChicago Campus Computing Cooperative

(UC3) [6] and the UChicago Research Computing Center cluster

“Midway.” Over the past two years the RDCEP team has

performed a large amount of impact modeling using this

framework. In the past 10 months alone, more than 80 pSIMS

impact runs have been performed, totaling over 5.6 million

DSSAT runs and a growing number of runs with other models

such as CeNW: see Table 1.

Figure 6: Simulated county-level (top) and state-level (bottom)

average corn yields plotted against USDA survey data. Over 1980-

2009, there are 60423 data points from 2373 counties and 1225

datapoints from 41 states. The red line is y = x with a RMSE between

the simulated and observed yields of 26 bu/A (25% of sample mean)

at the county level and 18.6 bu/A (15.7% of the sample mean) at the

state level. The black line is the best-fit linear model with R2 = 0.60 at

county and 0.73 at state level.

We have run pSIMS over the UC3 campus computing collective

[6], Open Science Grid [22], and UChicago Midway research

cluster; the XSEDE clusters Ranger and its successor Stampede;

and, in the past, on Beagle, PADS, and Raven. Through Swift’s

interface to the Nimbus cloudinit.d [5], we have run on the NSF

FutureGrid and can readily run on commercial cloud services.

Figure 7: Typical Swift configuration for pSIMS execution.

Listing 1: Swift script for the pSIMS ensemble simulation study

Table 1: Summary of campaign execution by project, including the total number of jobs in each campaign, the total number of simulation units (jobs x scenarios x years), the total model CPU time, and the total size of the outputs generated.

Project Campaigns

Sim

Units

(Billion)

CPU

Hours

(K)

Jobs

(M)

Output

data

(TBytes)

NARCCAP USA 16 1.3 13 1.9 .47

ISI-MIP Global 80 11.8 216 4.38 4.14

Prediction 2012 2 0.2 2 0.24 0.5

Qualitatively, our experience with the pSIMS framework and

Swift has been a striking confirmation of the fact that portable

scripts can indeed enable the use of diverse and distributed HPC

resources for many-task computing by scientists with no prior

experience in HPC and with little time to learn and apply such

skills. After the scripts were developed and solidified initially by

developers on the Swift team, execution and extension of the

framework was managed by the project science team, who had no

prior experience with Swift and only modest prior experience in

shell and Perl scripting.

Whether due to bugs, human error, machine interruption, or other

issues, science workflows must often be restarted or re-run. Swift

enables a simple and intuitive re-execution mechanism to pick-up

interrupted simulations where they left off and complete the

// 1) Define RunDSSAT function as an interface to the DSSAT program

// a) Specify function prototype app (file tar_out, file part_out)

RunDSSAT (string x_file,

file scenario[],

file weather[],

file common[],

file exe[],

file perl[],

file wrapper)

// b) Specify how to invoke the DSSAST executable {

bash "-c"

@strcats(" ","chmod +x ./RunDSSAT.sh ; ./RunDSSAT.sh ",

x_file,

@tar_out, @arg("years"), @arg("num_scenarios"),

@arg("variables"), @arg("crop"),

@arg("executable"), @scenarios, @weather,

@comdata, @exe, @perl);

}

// 2) Read the list of grids over which computation is to be performed string gridLists[] = readData("gridList.txt");

string xfile = @arg("xfile");

// 3) Invoke RunDSSAT once for each grid; different calls are concurrent foreach g,i in gridLists {

// a) Map the various input and output files to Swift variables file tar_output <single_file_mapper; file=@strcat("output/",

gridLists[i], "output.tar.gz")>;

file part_output <single_file_mapper; file=@strcat("parts/",

gridLists[i], ".part")>;

file scenario[] <filesys_mapper;

location=@strcat(@arg("scenarios"), "/",

gridLists[i]), pattern="*">;

file weather[] <filesys_mapper;

location=@strcat(@arg("weather"), "/",

gridLists[i]), pattern="*">;

file common[] <filesys_mapper; location=@arg("refdata"),

pattern="*">;

file exe[] <filesys_mapper; location=@arg("bindata"),

pattern="*.EXE">;

file perl[] <filesys_mapper; location=@arg("bindata"),

pattern="*.pl">;

file wrapper <single_file_mapper; file="RunDSSAT.sh">;

// b) Call RunDSSAT (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3,

in4, in5, wrapper);

}

simulation campaign. And pSIMS has proven its value in enabling

entire projects (millions of jobs) to be re-executed when

application code errors were discovered.

Swift enables us to take advantage of new systems and systems

with idle opportunities (e.g., over holidays). We recently

demonstrated such powerful multi-system runs using a similar

geospatially gridded application for analyzing MODIS land use

data. The user was able to dispatch 3,000 jobs from a single

submit host (on Midway) to five different computer systems

(UC3, Midway, Beagle, MWT2, and OSG). The user initially

validated correct execution by processing 10 files on a basic login

host, and then scaled up to a 3,000-file dataset, changing only the

dataset name and a site-specification list to get to the resources.

Swift’s transparent multi-site capability expanded the scope of

their computations from one node to hundreds or thousands of

cores. Swift’s transparent multi-site capability expanded the scope

of their computations from one node to hundreds or thousands of

cores. (About 500 cores were used for most production runs, and

tests of full scale runs were executed at up to 4,096 cores on the

XSEDE "Ranger" system).The user did not need to look at what

sites were busy or adjust arcane scripts to get to these resources.

The throughput achieved by Swift varies according to the nature

of the system on which it is run. For example, Figure 8 shows an

execution profile for a heavily loaded University of Chicago UC3

system. The “load” (number of concurrently running tasks) varies

greatly over time, due to other users’ activity. At the other end of

the scale, on systems such as the (former) TACC Ranger and its

successor Stampede, where local login hosts are not suited for

running even modest multi-core jobs such as the Swift Java client,

Swift provides the flexibility of running both the client and all its

workers within a single multi-node HPC job. The performance of

such a run is shown in Figure 9 below. This performance profile

reflects the absence of competition for cluster resources at the

time of execution. Because the allocated partition was almost

completely idle, Swift was able to quickly ramp up the system and

keep it fully utilized for a significant portion of the run. Similar

flexibility exists to run Swift from an interactive compute node

and to send application invocation tasks to the compute nodes

composed from one or more separate HPC jobs.

Swift has also been instrumental in the preparation of input data

for our DSSAT simulation campaigns. In the case of the NARR

data it was necessary to process hundreds of thousands of six-hour

weather reports in individual gridded binary (GRB) files. Swift

managed the format conversion, spatial resampling, aggregation

to daily statistics, and collation into annual netCDF files as pre-

cursors to the individual point-based time series used in the

simulation. By exposing generic command line utilities for

manipulating the various file types as Swift functions it is possible

to express the transformations required in an intuitive fashion.

5. FUTURE OBJECTIVES Ongoing efforts related to the development of this framework fall

into two general themes: facilitating the applications of other

biophysical crop production models and enabling the management

of simulation campaigns through web-based portals.

The process of transforming a new weather input dataset is a

painstaking one that has been diffcult to generalize so far, so our

approach has been to decouple the data preparation phase of the

processing pipeline from the simulation and summarizing phases.

The point of contact between these activities is the .psims.nc4

format. As an intermediate representation of the data that drives

our simulations it gives us a target to implement translations from

both upstream data archives and into the formats expected by the

various crop modelling packages.

Figure 8: Task execution progress for the ~120,000-task simulation

campaign described in Section 3. This particular run was performed

on the University of Chicago’s UC3 system. Time is in seconds.

Figure 9: Results from a large scale pSIMS run on XSEDE’s Ranger

supercomputer. Using 256 nodes (4096 cores), the full run completes

in 16 minutes.

We are integrating pSIMS into two different web-based eScience

portals to improve usability and expand access to this powerful

tool. The Framework to Advance Climate, Economics, and

Impacts Investigations with Information Technology (FACE-IT)

is a prototype eScience portal and workflow engine based on

Galaxy [11]. As a first step towards integration of pSIMS with

FACE-IT, we have prototyped a pSIMS workflow in Galaxy: see

Figure 10. Working with researchers from iPlant and TACC, we

prototyped pSIMS as an application in the iPlant Discovery

Environment [18]: see Figure 11. The latter application has been

run on Stampede and Ranger.

6. DISCUSSION The parallel System for Integrating Impacts Models and Sectors

(pSIMS) is a new framework for efficient implementation of

large-scale assessments of climate vulnerabilities, impacts, and

adaptations across multiple sectors and at unprecedented scales.

pSIMS delivers two advances in software and applications for

climate impact modeling: 1) a high-performance data ingest and

processing pipeline that generates a standardized and evolving

input dataset of socio-economic, environmental, and climate

datasets based on a portable and efficient new data format; and 2)

a code base to enable large-scale simulations at high resolution of

the impacts of a changing climate on primary production

(agriculture, livestock, and forestry) using arbitrary point-based

climate impact models.

Figure 10: Prototype Galaxy-based FACE-IT portal

Figure 11: Prototype pSIMS integration in iPlant.

Both advances are enabled by high-performance computing

leveraged with the Swift parallel scripting language. pSIMS also

contains tools for data translation of both the input and output

formats from various models; specifications for integrating

translators developed in AgMIP; and tools for aggregation and

scaling of the resulting simulation outputs to arbitrary spatial and

temporal scales relevant for decision-support, validation, and

downstream model coupling. This framework has been used for

high-resolution crop yield and climate impact assessments at the

US [10] and global levels [9, 25].

ACKNOWLEDGMENTS We thank Pierre Riteau and Kate Keahey for help running pSIMS

on cloud resources with Nimbus; Stephen Welch of AgMIP, Dan

Stanzione of iPlant and the Texas Advanced Computing Center

(TACC), and John Fonner and Matthew Vaughn of TACC for

help executing Swift workflows on Ranger and Stampede, and

under the iPlant portal; and Ravi Madduri of Argonne and

UChicago for help placing pSIMS under the Galaxy portal. This

work was supported in part by the National Science Foundation

under grants SBE-0951576 and GEO-1215910. Swift is supported

in part by NSF grant OCI-1148443. Computing for this project is

provided by a number of sources including the XSEDE Stampede

machine at TACC, University of Chicago Computing

Cooperative, the University of Chicago Research Computing

Center, and through the NIH with resources provided by the

Computation Institute and the Biological Sciences Division of the

University of Chicago and Argonne National Laboratory under

grant S10 RR029030-01. NASA SRB data was obtained from the

NASA Langley Research Center Atmospheric Sciences Data

Center NASA/GEWEX SRB Project. CPC US Unified

Precipitation data obtained from the NOAA/OAR/ESRL PSD,

Boulder, Colorado, USA [2].

REFERENCES 1. AgMIP source code repository. [Accessed April 1, 2013];

Available from: https://github.com/agmip.

2. Earth System Research Laboratory, Physical Sciences

Division. [Accessed April 1, 2013]; Available from:

http://www.esrl.noaa.gov/psd/.

3. National Agricultural Statistics Service, County Data Release

Schedule. [Accessed April 1, 2013]; Available from:

http://1.usa.gov/bJ4VQ6.

4. Bondeau, A., Smith, P.C., Zaehle, S., Schaphoff, S., Lucht,

W., Cramer, W., Gerten, D., Lotze-Campen, H., MÜLler, C.,

Reichstein, M. and Smith, B. Modelling the role of agriculture

for the 20th century global terrestrial carbon balance. Global

Change Biology, 13(3):679-706, 2007.

5. Bresnahan, J., Freeman, T., LaBissoniere, D. and Keahey, K.,

Managing Appliance Launches in Infrastructure Clouds.

TG'11: 2011 TeraGrid Conference: Extreme Digital

Discovery, Salt Lake City, UT, USA, 2011.

6. Bryant, L., UC3: A Framework for Cooperative Computing at

the University of Chicago Open Science Grid Computing

Infrastructures Community Workshop,

http://1.usa.gov/ZXum6q, University of California Santa

Cruz, 2012.

7. Deryng, D., Sacks, W.J., Barford, C.C. and Ramankutty, N.

Simulating the effects of climate and agricultural management

practices on global crop yield. Global Biogeochemical Cycles,

25(2), 2011.

8. Eaton, B., Gregory, J., Drach, B., Taylor, K., Hankin, S.,

Caron, J., Signell, R., Bentley, P., Rappa, G., Höck, H.,

Pamment, A. and Juckes, M. NetCDF Climate and Forecast

(CF) Metadata Conventions, version 1.5. Lawrence Livermore

National Laboratory, http://cf-pcmdi.llnl.gov/documents/cf-

conventions/1.5/cf-conventions.html, 2010.

9. Elliott, J., Deryng, D., Muller, C., Frieler, K., Konzmann, M.,

Gerten, D., Glotter, M., Florke, M., Wada, Y., Eisner, S.,

Folberth, C., Foster, I., S. Gosling, Haddeland, I., Khabarov,

N., Ludwig, F., Masaki, Y., Olin, S., Rosenzweig, C., Ruane,

A., Satoh, Y., Schmid, E., Stacke, T., Tang, Q. and Wisser, D.

Constraints and potentials of future irrigation water

availability on global agricultural production under climate

change. Proceedings of the National Academy of Sciences,

Submitted to ISI-MIP Special Issue, 2013.

10. Elliott, J., Glotter, M., Best, N., Boote, K.J., Jones, J.W.,

Hatfield, J.L., Rosenzweig, C., Smith, L.A. and Foster, I.

Predicting Agricultural Impacts of Large-Scale Drought: 2012

and the Case for Better Modeling. RDCEP Working Paper

No. 13-01, http://ssrn.com/abstract=2222269, 2013.

11. Goecks, J., Nekrutenko, A., Taylor, J. and The Galaxy Team

Galaxy: a comprehensive approach for supporting accessible,

reproducible, and transparent computational research in the

life sciences. Genome Biol, 11(8):R86, 2010.

12. Hategan, M., Wozniak, J. and Maheshwari, K., Coasters:

Uniform Resource Provisioning and Access for Clouds and

Grids. Fourth IEEE International Conference on Utility and

Cloud Computing (UCC '11), Washington, DC, USA, 2011,

IEEE Computer Society, 114-121.

13. Higgins, R.W. Improved US Precipitation Quality Control

System and Analysis. NCEP/Climate Prediction Center

ATLAS No. 6 (in preparation), 2000.

14. Izaurralde, R.C., Williams, J.R., McGill, W.B., Rosenberg,

N.J. and Jakas, M.C.Q. Simulating soil C dynamics with

EPIC: Model description and testing against long-term data.

Ecological Modelling, 192(3–4):362-384, 2006.

15. Jones, J.W., Hoogenboom, G., Porter, C.H., Boote, K.J.,

Batchelor, W.D., Hunt, L.A., Wilkens, P.W., Singh, U.,

Gijsman, A.J. and Ritchie, J.T. The DSSAT cropping system

model. European Journal of Agronomy, 18(3–4):235-265,

2003.

16. Kirschbaum, M.U.F. CenW, a forest growth model with

linked carbon, energy, nutrient and water cycles. Ecological

Modelling, 118(1):17-59, 1999.

17. Leemans, R. and Solomon, A.M. Modeling the potential

change in yield and distribution of the earth's crops under a

warmed climate. Environmental Protection Agency, Corvallis,

OR (United States). Environmental Research Lab., 1993.

18. Lenards, A., Merchant, N. and Stanzione, D., Building an

environment to facilitate discoveries for plant sciences. 2011

ACM Workshop on Gateway Computing Environments (GCE

'11), 2011, ACM, 51-58.

19. Liu, J., Williams, J.R., Zehnder, A.J.B. and Yang, H. GEPIC –

modelling wheat yield and crop water productivity with high

resolution on a global scale. Agricultural Systems, 94(2):478-

493, 2007.

20. Malik, T., Best, N., Elliott, J., Madduri, R. and Foster, I.,

Improving the efficiency of subset queries on raster images.

HPDGIS '11: Second International Workshop on High

Performance and Distributed Geographic Information

Systems, Chicago, Illinois, USA, 2011, ACM, 34-37.

21. Paskin, N. Digital Object Identifiers for scientific data. Data

Science Journal, 4:12-20, 2005.

22. Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M.,

Roy, A., Avery, P., Blackburn, K., Wenaus, T., Würthwein,

F., Foster, I., Gardner, R., Wilde, M., Blatecky, A., McGee, J.

and Quick, R., The Open Science Grid. Scientific Discovery

through Advanced Computing (SciDAC) Conference, 2007.

23. Probert, M.E., Dimes, J.P., Keating, B.A., Dalal, R.C. and

Strong, W.M. APSIM's water and nitrogen modules and

simulation of the dynamics of water and nitrogen in fallow

systems. Agricultural Systems, 56(1):1-28, 1998.

24. Rosenzweig, C., Jones, J.W., Hatfield, J.L., Ruane, A.C.,

Boote, K.J., Thorburn, P., Antle, J.M., Nelson, G.C., Porter,

C., Janssen, S., Asseng, S., Basso, B., Ewert, F., Wallach, D.,

Baigorria, G. and Winter, J.M. The Agricultural Model

Intercomparison and Improvement Project (AgMIP):

Protocols and pilot studies. Agricultural and Forest

Meteorology, 170(0):166-182, 2013.

25. Rosenzweig, C. and others Assessing agricultural risks of

climate change in the 21st century in a global gridded crop

model intercomparison. Proceedings of the National Academy

of Sciences, Submitted to ISI-MIP Special Issue, 2013.

26. Rutledge, G.K., Alpert, J. and Ebisuzaki, W. NOMADS: A

Climate and Weather Model Archive at the National Oceanic

and Atmospheric Administration. Bulletin of the American

Meteorological Society, 87(3):327-341, 2006.

27. Stackhouse, P.W., Gupta, S.K., Cox, S.J., Mikovitz, J.C.,

Zhang, T. and Hinkelman, L.M. The NASA/GEWEX Surface

Radiation Budget Release 3.0: 24.5-Year Dataset. GEWEX

News, 21(1):10-12, 2011.

28. Waha, K., van Bussel, L.G.J., Müller, C. and Bondeau, A.

Climate-driven simulation of global crop sowing dates.

Global Ecology and Biogeography, 21(2):247-259, 2012.

29. Wilde, M., Foster, I., Iskra, K., Beckman, P., Zhang, Z.,

Espinosa, A., Hategan, M., Clifford, B. and Raicu, I. Parallel

Scripting for Applications at the Petascale and Beyond. IEEE

Computer, 42(11):50-60, 2009.

30. Williams, D.N., Ananthakrishnan, R., Bernholdt, D.E.,

Bharathi, S., Brown, D., Chen, M., Chervenak, A.L.,

Cinquini, L., Drach, R., Foster, I.T., Fox, P., Fraser, D.,

Garcia, J., Hankin, S., Jones, P., Middleton, D.E., Schwidder,

J., Schweitzer, R., Schuler, R., Shoshani, A., Siebenlist, F.,

Sim, A., Strand, W.G., Su, M. and Wilhelmi, N. The Earth

System Grid: Enabling Access to Multi-Model Climate

Simulation Data. Bulletin of the American Meteorological

Society, 90(2):195-205, 2009.