high performance computing of hydrologic models using htcondor
TRANSCRIPT
High Performance Computing of Hydrologic
Models Using HTCondor
Spencer Taylor
A project report submitted to the faculty of Brigham Young University
in partial fulfillment of the requirements for the degree of
Master of Science
Norman L. Jones, Chair Everett James Nelson
Gus P. Williams
Department of Civil and Environmental Engineering
Brigham Young University
April 2013
Copyright © 2013 Spencer Taylor
All Rights Reserved
ABSTRACT
High Performance Computing of Hydrologic
Models Using HTCondor
Spencer Taylor
Department of Civil and Environmental Engineering, BYU
Master of Science
“Big Iron” super computers and commercial cloud resources (Amazon, Google, Microsoft, etc.) are some of the most prominent resources for high-performance computing (HPC) needs. These resources have many advantages such as scalability and computational speed. However the limited access to supercomputers and the cost associated with cloud systems may prohibit water resources engineers and planners from using HPC in water applications. The goal of this project is to demonstrate an alternative model of HPC for water resource stakeholders who would benefit from an autonomous pool of free and accessible computing resources. In order to demonstrate this concept, a system called HTCondor was used at Brigham Young University in conjunction with the scripting language, Python, to parallelize intensive stochastic computations with a hydrological model called Gridded Surface Subsurface Hydrologic Analyst (GSSHA). HTCondor is open-source software developed by the University of Wisconsin - Madison, which provides access to the processors of idle computers for performing computational jobs on a local network. We found that performing stochastic simulations with GSSHA using HTCondor system significantly reduces overall computational time for simulations involving multiple model runs and improves modeling efficiency. Hence, HTCondor is an accessible and free solution that can be applied to many projects under different circumstances using various water resources modeling software.
Keywords: cloud computing, daemon, GDM, GSSHA, high-performance computing, high-throughput computing, HTCondor, parallelization, pool, Python
ACKNOWLEDGEMENTS
It is with immense gratitude that I acknowledge the support and help of my advisor Dr.
Norm L. Jones of the Civil and Environmental Engineering Department at Brigham Young
University for guiding me through this project. I am also grateful for Dr. Everett James Nelson
and Dr. Gus P. Williams for their input and enthusiasm for my project. Further thanks must also
be given to my associates Dr. Kris Latu, Nathan Swain, and Scott Christensen for working
closely with me on this project.
This material is based upon work supported by the National Science Foundation under
Grant No. 1135482.
v
TABLE OF CONTENTS
LIST OF TABLES ...................................................................................................................... vii
LIST OF FIGURES ..................................................................................................................... ix
1 Introduction ......................................................................................................................... 11
1.1 Problem Statement ........................................................................................................ 13
2 Literature Review ............................................................................................................... 15
3 Stochastic Simulations Using Gssha and Htcondor ......................................................... 21
3.1 HTCondor ..................................................................................................................... 21
3.1.1 Master Computer ...................................................................................................... 23
3.1.2 Worker Computers .................................................................................................... 24
3.1.3 The BYU HTCondor Pool ........................................................................................ 24
3.1.4 HTCondor Workflow ................................................................................................ 25
3.1.5 Universe Environments ............................................................................................. 26
3.2 External HTCondor Resources ..................................................................................... 28
3.2.1 Cloud Resources ....................................................................................................... 28
3.2.2 Supercomputing Resources ....................................................................................... 29
3.2.3 External HTCondor Pools ......................................................................................... 29
3.3 GSSHA ......................................................................................................................... 30
3.4 Stochastic Simulations and Statistical Convergence .................................................... 32
3.5 Using Python to Connect GSSHA and HTCondor ....................................................... 35
4 Case Study ........................................................................................................................... 41
5 Conclusions .......................................................................................................................... 45
5.1 HTCondor ..................................................................................................................... 45
5.2 Mary Lou ...................................................................................................................... 46
vi
5.3 Amazon Cloud .............................................................................................................. 46
5.4 Future Work .................................................................................................................. 47
REFERENCES ............................................................................................................................ 48
Appendix A. Python Scripts .................................................................................................... 50
A.1 Scripts for Method One – Statistical Convergence .......................................................... 50
A.2 Scripts for Method Two – Static Number of Jobs ........................................................... 58
vii
LIST OF TABLES
Table 1. List of HTCondor universes and descriptions .........................................................27
Table 2. Comparison of the two stochastic GSSHA simulation methods .............................40
ix
LIST OF FIGURES
Figure 1. HTCondor network structure ..................................................................................22
Figure 2. Stochastic convergence of average peak discharge ................................................34
Figure 3. Diagram of how statistical convergence is determined for this project .................35
Figure 4. Python script interaction with HTCondor ..............................................................38
Figure 5. Plot of the time it took to complete certain numbers of iterations .........................42
Figure 6. A-D are graphs of statistical convergence for a tolerance of 0.001 .......................42
Figure 7. Graph of statistical convergence for a tolerance of 0.0001 ....................................44
11
1 INTRODUCTION
Water resource applications are often modeled one instance at a time on a single desktop
computer. In cases of smaller models, such circumstances provide enough computing power to
complete a model in a reasonable time frame. Some water resource applications, on the other
hand, are composed of multiple model instances and larger model domains requiring more
computing power than the average desktop computer alone can provide. For example,
researchers at Brigham Young University (BYU) are in the process of creating web application
for generating stochastic simulations of hydrologic models as a part of a larger National Science
Foundation (NSF) initiative known as the Cyber-Infrastructure to Advance High Performance
Water Modeling (CI-WATER) project. An ongoing and computationally intensive modeling
project such as this requires high-performance computing (HPC) which provides access to large
amounts of computing resources that can manage multiple jobs at the same time. Common
methods of HPC include traditional super computers which are designed to process as many
floating-point operations per second (FLOPS) as possible. The problems with these resources are
that they can be prohibitively expensive and therefore their accessibility is limited to a small
number of users within the hydrological community. While FLOPS is a good measure of
computational speed, it may not translate directly as a good metric for computational
performance when considering actual jobs completed. High-throughput computing (HTC) is a
form of HPC that uses a large number of relatively slower processors over a long period of time
12
to complete massive amounts of computations. HTC measures computational performance in
terms of jobs completed during a long period of time such as a week or month. HTCondor is a
scheduling and networking software developed by the University of Wisconsin-Madison (UW-
Madison) that accomplishes this task by linking large numbers of existing desktop computers
together to create one computing resource (Thain, Tannenbaum, & Livny, 2006).
This project was not an attempt to make a comparison of scheduling software however a
brief justification for the selection of HTCondor will be given. There are many other open source
queuing programs that are similar to HTCondor that could be used to meet various scheduling
and networking needs (Epema, Livny, van Dantzig, Evers, & Pruyne, 1996). What sets
HTCondor apart is its ability to manage many different types of computing resources and
operating systems while also facilitating job creation as well as job scheduling. Its unique
strengths include the ability to utilize opportunistic computing as well as checkpointing jobs to
dynamically match jobs and computational resources and allow jobs to be interrupted and
resumed at a later time (Thain, Tannenbaum, & Livny, 2005). It is difficult to gage how many
instances of HTCondor have been installed since its creation in 1984. Because of its open source
nature it is not sold and licensed like proprietary software and therefore harder to track. It had
been confirmed at one point that HTCondor had been installed on over 60,000 computational
processing units (CPUs) in 39 countries demonstrating how trusted and reliable its performance
is (Thain et al., 2006). Being created at UW-Madison, HTCondor, has undergone extensive
testing and continues to be the subject of intensive research and development (Litzkow & Livny,
1990). HTCondor also offers a system that can be scaled to fit the needs of almost any collection
of computing resources from a half dozen desktop computers to a mix of hundreds of HPC
clusters and cloud resources as well as desktop CPUs (Bradley et al., 2011).
13
1.1 Problem Statement
The objective of this project is to demonstrate how HTCondor can be implemented at
BYU to provide a robust and economical computational environment that can support web-based
applications that performs hydrologic stochastic simulations.
15
2 LITERATURE REVIEW
There are a substantial number of queuing and scheduling software available as the
literate suggests. This review will focus on literature that explains HTCondor’s strengths and
utility as it pertains to the specific instance of HTCondor at BYU. Although the literature
explains HTCondor’s utility is a variety of settings and applications this project report will apply
each insight to how HTCondor can benefit hydrologic web application at BYU.
Epema et al. (1996) provide evidence and explanation of how HTCondor can help to
provide a robust computational environment by connecting several pools of resources at BYU to
process jobs from a web application. They discuss the ability and usefulness of connecting
HTCondor pools, thus creating a flock of Condor resources that, in some cases, may span
continents while providing increased computational power to all connected. A flock was created
in that connected resources in Madison (USA), Amsterdam and Delft (Netherlands), Geneva
(Switzerland), Warsaw (Poland), and Dubna (Russia). The Flock mechanism could make use of
the massive amounts of idle resources that sweep around the globe every 24 hours. The
HTCondor philosophy places emphasis on maintaining the control of the owners over their
workstations throughout the flock of resources. The three guiding principles of the HTCondor
system are as follows:
1. Condor should have no impact on owners.
16
2. Condor should be fully responsible for matching jobs to resources and informing
users of job progress.
3. Condor should preserve the environment on the machine on which the job was
submitted.
Bradley et al. (2011) provide examples of how HTCondor can be scaled to increase its
usefulness as an organization expands. It also examines the optimal scalability and deployment
of HTCondor. Each new version of Condor increases the system’s ability to schedule, match, and
collect more jobs in a pool of computing resources while decreasing wasted time during data
transfer. This information instills confidence that HTCondor will be able to meet the needs of
BYU web applications as the computing need grows.
Wright (2001) explains how the opportunistic nature of HTCondor’s scheduling
paradigm can benefit throughput which will be needed to maximize the throughput of any web
application that BYU researchers will create. HTCondor assumes that resources are not
constantly available and responds accordingly by continuously updating the recourses being
matched to jobs. Other queuing systems assume that all resources are dedicated when that is not
always the case due to CPU maintenance and connections failures. This, as well as the ability to
suspend jobs when owners reclaim their computers, makes HTCondor more robust and reliable
than some alternatives. Also, discussed in the this paper is how the parallel “universe” is used to
support message passing interfaces (MPI) jobs with a combination of opportunistic and dedicated
scheduling. The parallel “universe” is used to run each job over multiple machines and
processors, which could be extremely useful in running instances of large parallel models in
conjunction with web applications.
17
Chervenak et al. (2007) provide insights on the types of jobs that would run most
efficiently on an HTCondor system helping us to diagnose which of BYU’s web applications are
good candidates for the different computational environments or “universes” in HTCondor.
HTCondor can be slowed down by transferring large data sets with each job’s submission.
Increases in the wasted overhead time occur when submitting massive amounts of jobs to
HTCondor. The Open Science Grid (OSG) relies on HTCondor to “launch workflow tasks and
manage dependencies between operations” for a large flock of HTCondor pools at academic
institutions.
Thain et al. (2005) explain the philosophy behind the HTCondor workflow and show how
its developers have kept up with the data and computing industry to produce a relevant and
reliable system. These facts are useful when considering the future of HTCondor’s
implementation at BYU to make sure that HTCondor will continue to be supported and grow
with its user’s needs. Much discussion is given in this paper about how HTCondor has interacted
with other projects in the past and how it has evolved along with the field of distributed
computing. A few of HTCondor’s unique strengths include High Throughput Computing (HTC)
and Opportunistic Computing which are accomplished, in great part, due to Class Ads, Job
Checkpoint and Migration, and Remote System Calls. The Condor philosophy consists of four
fundamental points:
1. Let communities grow naturally.
2. Leave the owner in control, whatever the cost.
3. Plan without being picky.
4. Lend and borrow.
18
Bradley et al. (2011) explain how well suited HTCondor systems are to handling
“embarrassingly parallel” computations such as the stochastic simulations that are outlined in
this paper. Monte Carlo is a procedure for estimating the value of some parameter. By randomly
sampling a feasible distribution of input parameters a probable output parameter can be
generated. With a Monte Carlo simulation we can keep an average of the equally likely outputs
to produce a mean value and survey the collection of outputs to produce a description of their
distribution. Such simulations are a good candidate for parallel distribution of computations
because each iteration can run independent of all the others. The master worker paradigm
consists of one central or master computer which manages the work load on a number of worker
computers. HTCondor creates a large environment of heterogeneous resources that can be very
efficient it computing large amounts of data. They also define and discuss the importance of
checkpoints in HTCondor. Checkpoints are images made by HTCondor of its jobs that make it
possible to vacate a job from one worker computer and then finish it on another if the first
computer’s owner reclaims it. Checkpoints allow for high-throughput by reducing the amount of
computational work that is lost by evacuating jobs once an owner reclaims their computer.
Thain et al. (2006) provides useful facts and figures about the history of HTCondor that
help in providing a reason for using it instead of other scheduling software for our web
applications. It is very difficult to measure the reach of open source software that is used to the
scale that HTCondor is. The figures in this report are given based on data taken from 2000 to
2005. The following figures give reasonable limits to the size and spread to which HTCondor
can and has been deployed. The median size of an HTCondor pool is only ten hosts. On 2
February 2005, Condor was deployed on at least 60,000 CPUs in 1,265 pools in 39 distinct
countries. The largest HTCondor pool that has been created as of 2005 was 5,608 CPUs.
19
Couvares, Kosar, Roy, Weber, and Wenger (2007) explain the ability of HTCondor’s to
handle complexly interdependent computational processes that may prove useful when
generating statistics for particularly large or involved stochastic simulations. The directed acyclic
graph manager (DAGMan) provides a means for users to express job dependencies in HTCondor
and allows the queue to manage the order of execution. This paper also provides insight into how
jobs are submitted and run with HTCondor using daemons.
21
3 STOCHASTIC SIMULATIONS USING GSSHA AND HTCONDOR
This project consisted the following three parts: the scheduling software HTCondor, the
Gridded Surface Subsurface Hydrologic Analyst (GSSHA) model, and the interpreted language
Python. The following sections give an explanation of each component and explain its role in the
overall structure of the hydrologic stochastic simulations that were generated at BYU.
3.1 HTCondor
HTCondor is a scheduling and networking program developed by the University of
Wisconsin-Madison in 1984 that provides a high-throughput computing environment by linking
various types of computational resources, including desktop computers, to create one unified
computational entity (Thain et al., 2005). In this way computational jobs can be sent and run on
multiple computers. Figure 1 shows a simplified diagram on how an HTCondor pool may be
connected. The pool of idle resources can contain potentially hundreds or thousands of
processors.
22
Figure 1. HTCondor network structure
This linked network of computational resources, or pool, is connected together by a central
manager or master computer. The master computer manages the job queue and assigns jobs to
the idle processors the rest of the computers on the network. All of the machines that are
configured to receive jobs from or submit jobs to the master computer are referred to as worker
computers. In addition to the locally connected pool of resources, the master computer may be
temporarily connected to external computational resources such as supercomputer nodes, cloud
resources or other HTCondor pools (Epema et al., 1996). HTCondor provides many settings and
environments to meet the specialized needs of users through its job scripts and universe
environments. Moreover, HTCondor can operate with any combination of operating systems
including Windows, UNIX, and Linux, etc. What makes this system so appealing is that no
monetary investment needs to be made by an organization to utilize HTCondor. The
computational resources are already in place in the form of personal desktop computers and the
23
software is open source and easy to download and install (Center for High Throughput
Computing, 2013).
3.1.1 Master Computer
There is one master computer per pool or cluster of HTCondor resources. The master
computer uses small programs calls daemons which continually run in the background on the
computer’s processors. Using the collector daemon the master computer communicates to each
of the worker computers connected to its pool and keeps track of the working status of each of
the computer’s processors. The negotiator daemon matches jobs in the queue to available
resources found by the collector and the scheduling daemon employs a similar strategy to assign
and keep track of the status of each of the jobs that have been submitted to the master computer.
The master computer matches jobs that have been submitted to the available idle resources based
on the requirements specified in the job script. Master computers may also communicate with
external computing resources and match them to compatible jobs as desired (Goux, Linderoth,
Yoder, & Goux, 2000). The configuration of the master computer must be altered to have it
connect to external resources. To create a flock of HTCondor pools one master computer must be
connected to another by using the “FLOCK_TO” and “FLOCK_FROM” variables in their
respective configuration files. To connect a master computer to cloud computing resources the
appropriate “GRID” variable for the desired server type must be set in the master computer’s
configuration file.
24
3.1.2 Worker Computers
Worker computers are what provide the computational power behind the HTCondor pool.
To keep computers in the pool, HTCondor follows that philosophy that owners must not feel
burdened. This is accomplished by given careful consideration to owners freedom and when to
view a processor as idle. Rules defining when to consider a processor idle and ready to accept
jobs from the master computer are included in the configuration of each worker computer. The
rules that govern the status of each worker computer which is configured to accept jobs may vary
throughout the pool. It may accept jobs under one of the following conditions: always accept
jobs, accept jobs once the CPU has been idle for 15 minutes, accept jobs once the keyboard has
been idle for 15 minutes, or accept jobs once both the CPU and keyboard have both been idle for
15 minutes (Litzkow & Livny, 1990). Worker computers can be configured to ether submit
and/or execute jobs in the pool. Worker computers use daemons to periodically send updates on
their status to the master computer in preparation to receive jobs (Wright, 2001). A worker
computer that can only submit jobs will rely on a scheduling daemon to forward the job on to the
master computer to be matched with a worker computer configured to execute jobs. A worker
computer that can only execute jobs will rely on the start daemon to obtain and process jobs from
the master computer and return resulting files back through the master computer to the original
submission computer.
3.1.3 The BYU HTCondor Pool
In our small pilot pool I set up my desktop workstation to be the master computer for
seven other workstations in our graduate office. This pilot pool is a good representation of the
flexibility of the HTCondor philosophy. We had a heterogeneous collection of resources that
25
were set up with varying configurations and system environments to test the capabilities of this
system for our research needs and potential needs in the future. Six of the worker computers had
Windows operating systems. Five of these Windows machines had four 2GB processors and
were configured to accept jobs after the keyboard and CPU were idle for 15 minutes. The sixth
Windows computer had eight 1GB processors and was configured to always accept jobs. The last
computer was a Mac laptop.
After we experienced success computing GSSHA models with our pilot HTCondor pool
we worked to expand the resources that were available to us. We were able to create another pool
with 36 Windows desktop machines in one of the computer labs on campus. We connected this
new larger pool with our pilot pool using the flock configuration options in the two master
computers. This 500% increase in our computing resources significantly increased our capacity
to run stochastic simulations. This project caught the attention of other researchers that would
like to join in on the opportunity of free distributed computing in the future. We currently have
plans to convert an additional computer lab in the Civil and Environmental Engineering
Department to the BYU HTCondor network.
3.1.4 HTCondor Workflow
The HTCondor workflow revolves around the mechanism of job scripts. Jobs scripts
contain a list of specifications that define a job’s computational requirements, its input files, and
a batch executable that will get the job running once it is matched to a worker computer. Job
scripts are submitted to HTCondor for scheduling by using the condor_submit command in the
command line or terminal of the operating system that your job will be submitted from.
HTCondor then interprets the job script and makes a request to the master computer to make a
26
survey of the available resources match the job to the appropriate idle resources for processing.
Once a match is made the master computer identifies the executable that will run on the
receiving worker computer as well as additional input files and either references or copies them
over to the receiving worker computer depending on the “universe” designated in the job script
(Couvares et al., 2007). After the master computer assigns the job to an idle resource it marks the
resource’s status as “busy” and also shows the job’s status as “running” in the queue. Once the
job finishes HTCondor copies any new files that were created by the job back to the computer
that submitted it. The job is removed from the queue and the resource that processed the job
regains an “idle” status.
3.1.5 Universe Environments
Different jobs may lend themselves better to different computing environments
depending on runtime, available resources, and the relationship of interdependencies between
jobs. HTCondor currently provides eight different environments called “universes”, each of
which enables the user to capitalize on opportunistic computing in unique ways (Center for High
Throughput Computing, 2013). The “universe” that a job is run on is determined by declaring it
in the job script that is used to submit each job to HTCondor. Table 1below lists the HTCondor
universe environments and provides a brief description its functionality.
27
Table 1. List of HTCondor universes and descriptions
Universe Description of Computational Environment
standard Allows for job checkpointing and Remote System Calls (same for most universes)
vanilla Does not provide job checkpointing nor Remote System Calls
grid Connects to Cloud Resources such as Amazon’s EC2, the OSG, etc.
java Runs Java jobs on any operating system using a Java virtual machine
scheduler Runs jobs that will not be preempted if the owner reclaims the worker machine
local Runs jobs directly on submission computer
parallel Allows for MPI on multiple worker machines
VM Creates Virtual Machines as needed for HTCondor jobs
The features of the “standard universe” are available with some variation to most other
“universes”. One important standard feature is the ability to checkpoint jobs. Checkpoints are
created periodically on the worker computer as a job runs providing an image of the job’s
progress. The checkpoint image allows a job to be transferred and finished on another worker
computer in the case that it is preempted by the owner reclaiming the current worker computer
where the job is being processed. Checkpoints slow down jobs but may reduce system wide
runtime in the case of longer running jobs that have a greater likely hood of wasting computing
time if they get preempted. Another useful standard feature known as Remote System Calls
(RSC) allows worker computers to reference files and protocols on the submission computer.
This is useful when each job iteration requires the same input files or must reference a program
on the submission computer. Although RSC is convenient it can put stress on a network’s
connection in some unique cases (Chervenak et al., 2007).
28
I used the “vanilla universe” for this research because it ran fast for my specific job needs
and is simple to set up. The jobs that I tested stochastic simulations on complete in about 15
minutes which is relatively fast and have good chance of completing before preemption. I
sacrificed the option of having checkpoints to gain computational speed with my short jobs.
Also, due to the small (600 KB total) inputs required by my test models of GSSHA there was no
trouble copying them over from the submission computer to the receiving worker computer.
3.2 External HTCondor Resources
External resources may include supercomputer or cloud computing nodes as well as links
to other master computers and their HTCondor pool of processors. Providing this flexibility in
the system is a powerful feature of HTCondor because it allows users to combine and employ
several computing methods as desired.
3.2.1 Cloud Resources
It is also possible to include commercial cloud resources as part of an HTCondor pool.
This requires specific settings in the job script and configuration file of the master computer. In
the configuration file the cloud resource types and server locations must be explicitly identified
in the “GRID” variable so that the master computer knows where to look to match and send jobs.
The “Universe” or environment of the job must be set to “GRID” in the job script and its
“Grid_Resource” option must be set to the specific cloud resource that you wish to use. Also, an
optional variable can be set in the job script to set a price threshold for jobs sent to the cloud, so
that jobs will only be run if the price per central processing unit (CPU) for an hour falls below
the threshold. For example, if you were using Amazon’s Elastic Compute Cloud (EC2) you
29
could set the “ec2_spot_price” variable to “0.011” so that HTCondor would send jobs to the
cloud only if the cost per CPU hour was $0.011 or less. One problem with creating price
threshold for EC2 spot resources is that if the price per CPU hour goes above the threshold while
the job is still running it may be terminated before the jobs is finished. The spot resource setting
is another feature that makes HTCondor ideal for high throughput computing on a budget.
(Center for High Throughput Computing, 2013)
3.2.2 Supercomputing Resources
Supercomputer nodes are treated exactly like worker computers when connected to an
HTCondor pool. HTCondor would be installed on the HPC nodes to replace any previous
scheduling manager and they would become available in the master computer’s list of resources.
In the case of a pool of mixed HPC and desktop resources a user can have a job computed on a
certain selection of resources by designating “Class Ad” requirements. Class Ads are a record of
attributes that HTCondor keeps for each of its available resources (Thain et al., 2005). These
attributes include operating system type, processor size, IP address, and network name, among
others.
3.2.3 External HTCondor Pools
Other HTCondor pools can also be linked to the local pool as external resources. Jobs are
sent first to the local master computer and then any superfluous idle jobs will be sent to the
external pool if it has sufficient resources available that fit the job criteria. External HTCondor
pools can be linked together with the FLOCK_FROM and FLOCK_TO variables in the condor
configuration file on each master computer. A master computer may be set up to send jobs to an
30
external pool, while not receiving additional jobs from any external pools. To connect two
HTCondor pools they do not need to be on the same sub-network, but they must share a reliable
network connection. There are many examples of single HTCondor flocks that span several
countries and thousands of miles with reliable connections (Epema et al., 1996).
A prime example of how a flock of HTCondor pools can be a powerful resource was
created by Epema et al. (1996). This flock facilitated a connection to several separate pools in
five different countries that can be utilized by all its participants by submitting jobs to
HTCondor. One advantage of such a widespread system of linked computing resources it that as
the work day rolls to a close in one time zone it frees up massive amounts of computing
resources for other areas that are just beginning their work day. One disadvantage of have a large
flock of computing resources is that the overhead of communicating between pools may increase
job times (Chervenak et al., 2007).
3.3 GSSHA
In conjunction with the computational power of HTCondor this project used the Gridded
Surface Subsurface Hydrologic Analyst (GSSHA) model. The GSSHA model is a fully
distributed physics-based model that can predict runoff in a watershed (Downer, Ogden, & Byrd,
2008). Stochastic simulations done with GSSHA lend themselves very well to the high-
throughput computing environment because they requires a great number of completely
independent computations (Basney, Raman, & Livny, 1999).
To produce a stochastic simulation with a GSSHA model some of the input files much be
altered. The files that I focused on for this project where the mapping table (CMT) file, project
(PRJ) file, and the outlet hydrograph (OTL) file. The CMT file contains most of the important
31
physical parameters that could be stochastically models with BYU’s web applications (Downer
et al., 2008). The PRJ file is the file that the GSSHA executable uses to find and reference all of
the other input files including the CMT file. The OTL file is the GSSHA output file that contains
a list of times and discharge rates that make up a hydrograph. For my project I used the peak
discharge from the OTL file as my parameter of interest to determine statistical convergence.
Because I used the HTCondor the “vanilla universe” to test my scripts and complete
stochastic simulations with GSSHA I had to send the GSSHA executable file (gssha.exe) with
each job. If the GSSHA executable would have been located in the same directory on ever
worker computer I could have referenced it locally, but sense it was not in the same location and
because is small (900 KB) I copied if over with each job. Along with the executable, GSSHA
requires several input files that also had to be copied over to each worker computer using. To
copy files from one computer to another for a job in HTCondor they must be referenced in the
job script. As GSSHA runs it generates several output files, a few of which are large files that
were not needed in the statistical calculations. These larger, unnecessary outputs were deleted
from the executing worker computer before the job completed so to eliminate the time required
to copy them back to the submitting computer. Most universes in HTCondor use one processor
per job which is suits the version of GSSHA that I used in my tests. However, the parallel
universe has potential to utilize the message passing interface (MPI) that is available in a recent
release of GSSHA and distribute the jobs over multiple processors. This would reduce runtimes
on larger models.
32
3.4 Stochastic Simulations and Statistical Convergence
Stochastic simulations will be the most computationally intensive service of our CI-
WATER web applications, and therefore computing such simulations would potentially be a
bottle neck in our system. Using HTCondor helps us to optimize our potential computing
bottleneck and make our web application design more robust and useful. HTCondor will provide
the job queuing and submission engine for the back end of these web services. To perform
stochastic simulations we needed to find a computing interface that could schedule and distribute
hundreds of individual jobs. Stochastic computations are a perfect candidate for HTCondor
because they are considered to be “embarrassingly parallel”. This means that each stochastic job
is 100% independent of the other stochastic jobs, and therefore easily distributed among separate
processors. In my test runs of GSSHA I performed stochastic simulations with the overland
roughness parameters that are found in the CMT file. I assumed that the uncertainty of the
roughness value occurred according to a normal distribution with a certain standard deviation
about a mean value. Using Python I generated a random value for each roughness and rewrote
the CMT file. The new CMT file was then copied over with all of the original inputs and
GSSHA was run on HTCondor. The varied roughness values in the CMT file produced variation
in the outlet hydrograph or OTL file of the GSSHA model. I was then able to generate statistics
based on all of the different OTL files that represented the possibilities of discharge in a
watershed based on the uncertainty inherent in the overland roughness input parameter. This
same type of simulation could be reproduced for and input parameter or combination of
parameters for any GSSHA model.
An important part of running stochastic simulations efficiently with limited resources is
to testing for statistical convergence. In the context of a Monte Carlo simulation statistical
33
convergence is reached when we continue randomly sampling the solution space but the average
value of the model result of interest (peak discharge for example) ceases to change significantly.
Since the statistic that we are interested in has stopped changing more samples in the solution
space are not needed and we can stop the simulation. One of the Python scripts that I used to do
stochastic simulations with GSSHA required a specific number of iterations to perform, all of
which must run before statistics could be calculated on the whole job. The other Python script
calculated statistics during the job as each iteration completed. Using this method I was able to
monitor the last five changes to the average peak discharge for the whole job and once the
average of those changes fell below a set tolerance value I assumed statistical convergence had
been reached for the simulation and canceled the rest of the simulations in the HTCondor queue.
A sample graph of the change in average peak discharge per number of jobs completed is shown
in Figure 2 below.
O
from the
calculate
allows us
using the
jobs to
converge
Python s
final stoc
statistica
Once statistic
HTCondor
d statistics.
s to reduce w
ese Python s
run. The id
ence is achie
cript will te
chastic resul
l convergenc
Figure 2
cal converge
queue by th
Testing for s
wasted CPU
scripts and H
dea is to g
eved. If by ch
erminate nor
lts file. Figu
ce.
2. Stochastic co
ence was rea
he Python sc
statistical co
U usage and c
HTCondor th
uess high a
hance the us
rmally once
ure 3 shows
34
onvergence of
ached the rem
cript and a r
onvergence a
complete job
he user woul
and let the
ser does not
all of the sc
the logic em
f average peak
maining unn
results file w
and removin
bs faster. To
ld put a con
Python scr
guess a high
cheduled job
mployed by
k discharge
necessary job
was written
ng excess job
o start a stoc
servative gu
ript abort th
h enough nu
bs have run
the Python
bs were rem
summarizin
bs from the q
chastic simul
uess at how m
he process
umber of job
and still cre
script to ach
moved
ng the
queue
lation
many
once
s, the
eate a
hieve
35
Figure 3. Diagram of how statistical convergence is determined for this project
3.5 Using Python to Connect GSSHA and HTCondor
Python has been chosen as the interpreted language of choice for the CI-WATER project.
The web-based applications, database queries, and input/output tools are being constructed in
Python. To support consistency with the CI-WATER project, Python was the scripting language
used to interact with HTCondor as well as generate the varied inputs for GSSHA to create
stochastic simulations. The Python scripts that I developed have four purposes.
1. Make and organize a file structure that can dynamically manage inputs and outputs
for the jobs sent to and run on HTCondor.
2. Generate randomized GSSHA input files and HTCondor job and executable files in
their respective directories.
36
a. The Python script creates a job file that defines the environments of the
HTCondor job including which files the current job will need as inputs, how
many iterations of the job to put the queue, and how to handle the outputs
upon completion of the job.
b. The script also creates the executable file that HTCondor will use to run the
job on the receiving worker computer. For this application in a Windows
operating system this file is a batch script which simply starts another Python
script that randomizes the specified GSSHA inputs and then runs the GSSHA
model.
c. The CMT or mapping table file is an input file for GSSHA where most of the
physical parameters of a watershed are located. The Python script rewrites
this file from the calibrated GSSHA inputs and randomly selects new
parameter values within a normal distribution. For my tests I selected the
overland roughness parameter to vary randomly.
d. My Python script also makes alterations to the PRJ or project file. This
GSSHA file references all of the other inputs and therefore had to be
rewritten to include the correct path to the new CMT file.
3. Using the subprocess module in Python my scripts submit the job file to HTCondor
for processing as well as start the GSSHA model running on the receiving worker
computer. Having Python function as the middleware for starting various processes
and programs allows for a simpler user interactions with the system.
4. Another role that Python plays in these stochastic GSSHA simulations is that it
parses the results folder on the submission computer and generates statistics as
37
outputs are copied back from the worker computers through HTCondor. While the
Python script sniffs the results folder and processes new outputs that arrive it keeps a
running calculation of the average peak discharge from the outlet hydrograph that
GSSHA produces. If the average stops changing significantly then the Python script
considers the simulation to have reached statistical convergence and sends a
command to HTCondor to remove the remaining iterations of that specific job from
the queue. Testing for statistical convergence and removing superfluous jobs allows
us to use computing time more efficiently.
I developed and tested two methods for parallelizing GSSHA and running stochastic
simulations. The first let the randomization of the inputs be performed with in each job iteration
that ran on the worker computers and it used statistical convergence to cut the stochastically
simulation short if possible. The second method performed all of the input randomization on the
submission computer before the job was submitted to HTCondor and it simply executed a user
defined number of job iterations without statistical convergence.
The first method required two separate Python scripts called “sendOneRndCondor.py”
and “oneRndGSSHA.py”. The “sendOneRndCondor.py” script set up and submitted the
stochastic simulation to HTCondor from the submission computer. The “oneRndGSSHA.py”
script was submitted along with the job and produced the randomized CMT file and ran GSSHA
on the executing worker computers with the new input. A limitation of this method was that
Python was required on all of the worker computers to produce the random CMT file. A benefit
of this method was that it was more computationally efficient because it computed the stochastic
results as jobs finished and outputs returned from the queue. The “sendOneRndCondor.py” script
constantly parsed the results folder and generated a dynamic average peak discharge each time a
38
new OTL file was written back from HTCondor. It also calculated a convergence test value by
averaging the previous five changes to the average peak discharge and compared it to a the
convergence tolerance set by the user. Once the test value became less than the tolerance the
Python script considered statistical convergence to be reached and then removed the remaining
jobs from the queue and generated a final statistical results file. Figure 4, below, shows how the
Python scripts from the first method interact with HTCondor and GSSHA to efficiently generate
stochastic simulation. HTCondor acts as a link between the submission and worker computers by
following the job description in the job file.
Accepting ComputerSubmission Computer
Job Script
sendOneRndCondor.py
Results Folder
oneRndGSSHA.py
gssha.exe
GSSHA Inputs
GSSHA Inputs
gssha.exe
oneRndGSSHA.py
New GSSHA Inputs
Copy Over
Copy OverGSSHA Outputs
Run
RunBatch File
Batch File
SubmitJob Script
Figure 4. Python script interaction with HTCondor
39
The second method used one Python script called “multRndCondorGSSHA.py” which
created all of the randomized CMT files on the submission computer before submitting each job
iteration individually to HTCondor. This method was limited in that it did not test for statistical
convergence, but instead ran the exact number of jobs the user requested. One possible benefit of
this method it that is did not require the receiving worker computers to have Python installed on
them. The GSSHA executable was the only thing that had to be run on the receiving computers.
Once the last job finished and the final outputs were written back to the submission computer the
Python script generated a final results file. This method is less complicated than the first, but it
does not utilize HTCondor to its full potential nor does it use compute time as efficiently by
varying the number of job iterations that get completed based on statistical convergence.
Table 2 summarizes a comparison of the two methods that were developed to test
stochastic simulations of GSSHA on HTCondor. The comparisons are based on the four
functions of Python that were outlined as necessary for completing this project.
40
Table 2. Comparison of the two stochastic GSSHA simulation methods
Function Method 1 Method 2
Manage File Structure
Creates a new results folder on the submission computer
Creates a new results folder on the submission computer
Creates one job that resides with all of its inputs and outputs directly in the results folder.
Creates a new folder for each job’s inputs and outputs within the results folder
Write Files Writes one job file that produces multiple iterations
Writes a job file for each job iteration
Writes one batch executable referenced by every job iteration
Writes a batch executable for each job iteration
Writes a new project (prj) file for each job iteration on the receiving computer
Writes a new project file for each job iteration on the submission computer
Writes a new mapping table (cmt) file for each job iteration on the receiving computer
Writes a new mapping table file for each job iteration on the submission computer
Submit Jobs to HTCondor
Single job script queues multiple jobs in HTCondor
Each job script queues one job in HTCondor
Sends original GSSHA files with job to receiving computers
Sends original and altered GSSHA files with job to receiving computers
Sends gssha.exe with job Sends gssha.exe with job Sends a second Python script with job to receiving computers
Does not send any Python scripts with job to receiving computer
Batch executable calls the second python script on the receiving computer
Batch executable calls the gssha.exe directly on the receiving computer
Parse Outputs-Generate Statistics
Parses outputs as they are written back to the submission computer
Only parses outputs after all job iterations have completed
Calculates statistics continually and terminates remaining job iterations if statistics converge
Calculates statistics once after all job iterations have completed
41
4 CASE STUDY
In developing my Python scripts for this project I began with very simplistic GSSHA
models that had few inputs and only required a minute or so to run. Once the scripts were ready I
tested them on models of various sizes and configurations to be sure that the scripts were robust
enough to ingest any GSSHA model that I had encountered. For testing stochastic simulations
with the finalized scripts I used the long term advanced groundwater model from the GSSHA
tutorials. This particular GSSHA model had six precipitation events and used hydro-
meteorological (HMET) data to simulate a two week period. This GSSHA model took an
average of 14 minutes to run on a normal desktop computer using version 4.0 of the GSSHA
executable which does not support MPI. To run 150 instances of this model in series it would
take roughly 2,100 minutes or 35 hours.
Because of the nature of HTCondor each stochastic simulation ran on a different number
of processors ranging from about 80 to 140. As expected, with about 100 times the
computational power of normal circumstances I was able to essentially reduce the runtime by
factor of 100. Figure 5 shows the relationship of the time in minutes that it took several
stochastic simulations to complete versus the number of iterations required to reach statistical
convergence. This plot shows a rough linear relationship between simulation run time and
iterations number. It is difficult to confirm this linear relationship, however, due to the varied
42
number of available computing resources for each simulation. More testing would be required to
establish a concrete relationship between the runtime and number of jobs that may exist on our
specific HTCondor pool.
Figure 5. Plot of the time it took to complete certain numbers of iterations
Using a convergence tolerance of 0.001 this model required from 15.9 to 33.3 minutes to
run between 115 and 185 jobs. Below, Figure 6 shows four graphs of different stochastic
simulations that were run on the HTCondor pool at BYU using that tolerance. The number of
processors available for these simulations fluctuated between 80 and 140 processors and the
average runtime was about 22 minutes. Using the convergence tolerance of 0.001 the average
peak discharge ranged from 9.54 to 9.72 cms.
R² = 0.9382
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200
Time [min]
Number of job iterations completed
43
The spread of average peak discharge values obtained from a tolerance of 0.001seems too
wide to demonstrate true convergence so I ran simulaitons with a taolerance of 0.0001. Figure 7
shows the graph of statistical convergence for one of the stochastic simulations generated with a
tolerance of 0.0001. To make sure that I was going to run enough jobs to achieve convergence
with this tolerance I sent 10,000 job iterations to HTCondor for processing. This simulation
shows how using convergence can save on compute time because only 1,181 iterations had to
complete to achieve the tolerance that I was aiming for. About 8,800 jobs were removed from
HTCondor before they ever had to run, which saved potentially hundreds of CPU hours from
‐5
‐4
‐3
‐2
‐1
0
1
2
3
0 50 100 150 200
Chan
ge in
average
peak discharge
Number of job iterations completed
‐5
‐4
‐3
‐2
‐1
0
1
2
3
0 50 100 150 200
Chan
ge is average
peak discharge
Nubmer of job iterations copleted
‐5
‐4
‐3
‐2
‐1
0
1
2
3
0 50 100 150 200
Chan
ge in
average
peak discharge
Number of job iterations completed
‐5
‐4
‐3
‐2
‐1
0
1
2
3
0 50 100 150 200
Chan
ge in
aberage
peak discharge
Number of job interations completed
A B
C D
Figure 6. A-D are graphs of statistical convergence for a tolerance of 0.001
44
being wasted on this simulation. Using a convergence tolerance of 0.0001 the average peak
discharge ranged from 9.54 to 9.67 cms.
Figure 7. Graph of statistical convergence for a tolerance of 0.0001
The spread of average peak discharge values obtained from a tolerance of 0.0001 also
seems too wide to demonstrate confidence in the convergence algorithm. Instead of decreasing
the tolerance, we need to increase the number of previous changes in average peak discharge that
are averaged to determine convergence. To achieve a suitable convergence algorithm multiple
simulations would need to be run using a greater number of previous change values. By doing
this we will be able to improve the stochastic simulation and have more confidence in the
resulting statistics.
‐6
‐5
‐4
‐3
‐2
‐1
0
1
2
3
0 200 400 600 800 1000 1200
Chan
ge in
avergae
peak discharge
Number of job iterations completed
45
5 CONCLUSIONS
The conclusions from this project are as follows:
5.1 HTCondor
HTCondor is a useful computing resource for running stochastic simulations. One of the
main benefits of HTCondor is that it works with both internal and external computing resources.
HTCondor provides a robust queuing environment that has been well developed since 1984. New
research is being conducted with HTCondor every year and it continues to gain more and more
functionality while maintaining its original purpose.
HTCondor is well developed scheduling software that contains much of the functionality
that researchers associated with the CI-WATER project need to process the jobs that will be
generated from web-based hydrologic tools. It has enough computing power to handle intense
stochastic simulations and the simplicity to manage single computational jobs as well. This
project also serves as a successful case study of how two ways Python might be utilized to
automate a great deal of HTCondor’s functionality. Using the scripts developed in this project as
a pattern, HTCondor could be used for many other applications besides GSSHA jobs.
46
5.2 Mary Lou
BYU’s supercomputer, Mary Lou, is a powerful collection of HPC clusters that is useful
for many computational applications. For a majority of the computational needs for our web
applications , however, we have found that the environment that HTCondor provides is much
more suited to our research. The supercomputer does not have the ideal operating system nor the
Python environment for the applications that we would like to support with our hydrologic web
interfaces which include options for stochastic simulations. Also, the supercomputer’s queue is
completely controlled by the BYU’s Fulton Supercomputing Lab (FSL), which makes it difficult
to connect it to our web applications for quick turnaround of jobs.
Supercomputing with the FSL produces slower job turnaround than HTCondor because,
as researchers, we share the super computer with dozens of other departments. We have to
compete for run time on the super computer, and as we use more time than other research
accounts our priority falls in comparison. A single job that we send to Mary Lou can take hours
to reach the queue while similar jobs only have to wait 30 minutes at most with HTCondor.
5.3 Amazon Cloud
Commercial cloud computing services are innovative and can be very useful, but because
of the cost involved they are probably best-suited as a flexible extension of HTCondor. In this
way cloud resources could provide vast amounts of computational power on short notice in the
case of a fluctuation in computational need while HTCondor accessing local computers would
serve as the main source of computational power for our web applications a majority of the time.
47
5.4 Future Work
With the successful completion of this preliminary assessment of HTCondor’s potential
there are several tasks that will naturally follow. First, more testing could be done using different
HTCondor universes to provide even more flexibility and stronger fit for various types of jobs.
Also, the Texas Water Mapper could be connected to the HTCondor pool of resources to begin
testing its computational performance on an existing web-based application. Tests need to be
conducted on the ease and reliability of connecting to and running jobs on cloud resources based
on cost per CPU hour. The Python code for this project could be rewritten to be object oriented
so that it fits in with CI-WATER’s coding standards and provides a more general coding model
for interfacing with HTCondor.
48
REFERENCES
Basney, Jim, Raman, Rajesh, & Livny, Miron. (1999). High throughput monte carlo. Paper presented at the Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing.
Bradley, D, St Clair, T, Farrellee, M, Guo, Z, Livny, M, Sfiligoi, I, & Tannenbaum, T. (2011). An update on the scalability limits of the condor batch system. Paper presented at the Journal of Physics: Conference Series.
Center for High Throughput Computing, University of Wisconsin-Madison. (2013). HTCondorTM Version 7.8.8 Manual. from http://research.cs.wisc.edu/htcondor/manual/v7.8/ref.html
Chervenak, Ann, Deelman, Ewa, Livny, Miron, Su, Mei-Hui, Schuler, Rob, Bharathi, Shishir, . . . Vahi, Karan. (2007). Data placement for scientific applications in distributed environments. Paper presented at the Grid Computing, 2007 8th IEEE/ACM International Conference on.
Couvares, Peter, Kosar, Tevfik, Roy, Alain, Weber, Jeff, & Wenger, Kent. (2007). Workflow Management in Condor. In I. Taylor, E. Deelman, D. Gannon & M. Shields (Eds.), Workflows for e-Science (pp. 357-375): Springer London.
Downer, Charles W, Ogden, Fred L, & Byrd, A.R. (2008). Gridded Surface Subsurface Hydrologic Analysis Version 4.0 for WMS 8.1 GSSHAWIKI User’s Manual: ERDC Technical Report, Engineer Research and Development Center, Vicksburg, Mississippi.
Epema, D. H. J., Livny, M., van Dantzig, R., Evers, X., & Pruyne, J. (1996). A worldwide flock of Condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12(1), 53-65. doi: http://dx.doi.org/10.1016/0167-739X(95)00035-Q
Goux, Jean-Pierre, Linderoth, Jeff, Yoder, Michael, & Goux, Jean-pierre. (2000). Metacomputing and the master-worker paradigm. Paper presented at the Preprint MCS/ANL-P792-0200, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne.
Litzkow, Mike, & Livny, Miron. (1990). Experience with the condor distributed batch system. Paper presented at the Experimental Distributed Systems, 1990. Proceedings., IEEE Workshop on.
Thain, Douglas, Tannenbaum, Todd, & Livny, Miron. (2005). Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2-4), 323-356. doi: 10.1002/cpe.938
Thain, Douglas, Tannenbaum, Todd, & Livny, Miron. (2006). How to measure a large open‐source distributed system. Concurrency and Computation: Practice and Experience, 18(15), 1989-2019.
49
Wright, Derek. (2001). Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor. Paper presented at the Conference on Linux clusters: the HPC revolution.
50
APPENDIX A. PYTHON SCRIPTS
The following entries are scripts that I developed and tested for generating stochastic
simulation of GSSHA on HTCondor. These scripts are designed to be placed in to a folder that
contains only one pre-set up GSSHA model. These scripts may be run from any drive on the
submission computer, including external drives, as long as the drive remains connected until the
simulation is finished so that HTCondor can continually write back results to that drive. As far as
I have been able to test these scripts are not capable of running any GSSHA model.
A.1 Scripts for Method One – Statistical Convergence
oneRndGSSHA.py – This script generates one run of a GSSHA with a randomized CMT file.
import math, os, random, subprocess, sys def pargssha(argv): baseFilePath = argv[0] gssha = argv[1] project = argv[2] projectName = argv[3] b = argv[4] #set altered .prj filepath and remove if it exists projectNew = (baseFilePath + "project{0}.prj").format(b) if os.path.exists(projectNew): os.remove(projectNew) #write altered .prj file arg1 = [baseFilePath,gssha,project,projectName,b,projectNew] readCmt(arg1) writePrj(arg1)
51
subprocess.call([gssha, projectNew]) def writePrj(argv): baseFilePath = argv[0] project = argv[2] b = argv[4] projectNew = argv[5] for file in os.listdir(baseFilePath): if file.endswith('.prj'): project = baseFilePath + file newPrjFile = open(projectNew, 'a') for line in open(project): if line.split()[0] == "SUMMARY": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t\t\t " + '"' + "project{0}.sum" + '"\n').format(b)) elif line.split()[0] == "OUTLET_HYDRO": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}.otl" + '"\n').format(b)) elif line.split()[0] == "MAPPING_TABLE": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}.cmt" + '"\n').format(b)) else: newPrjFile.write(line) newPrjFile.close() def readCmt(argv): baseFilePath = argv[0] b = argv[4] for file in os.listdir(baseFilePath): if file.endswith('.cmt'): cmtFilePath = baseFilePath + file newCmt = (baseFilePath + "project{0}.cmt").format(b) cmt = open(cmtFilePath) new = open(newCmt, 'a') #this creates a new file nextLine = cmt.readline() new.write(nextLine) nextLine = cmt.readline() while nextLine.split()[0] == "INDEX_MAP": new.write(nextLine) nextLine = cmt.readline()
52
new.write(nextLine) nextLine = cmt.readline() numIDs = int(nextLine.split()[-1]) new.write(nextLine) nextLine = cmt.readline() new.write(nextLine) for i in range(0,numIDs): nextLine = cmt.readline() lineBy = nextLine.split() mean = float(lineBy[-1]) min1 = mean*0.8 max1 = mean*1.2 stdev = (max1-min1)/6 num = normalDist(mean,stdev,min1,max1) repLine =nextLine.replace(lineBy[-1], str(num)[:7]) new.write(repLine) for i in range(0,500): nextLine = cmt.readline() new.write(nextLine) new.close() def normalDist(mean,stdev,min1,max1): x= (min1-5.0) while max1 > x < min1: x = gauss()*stdev + mean return x def gauss(): fac = 0.0 r = 1.5 V1 = 0.0 V2 = 0.0 rnd = random.random while r>=1: V1 = 2*rnd() - 1 V2 = 2*rnd() - 1 r = V1**2 + V2**2 fac =(-2*math.log(r)/r)**(1/2.0) gauss = V2*fac return gauss #set basepath baseFilePath = os.getcwd() + "\\"
53
#set gssha.exe path gssha = baseFilePath + "gssha.exe" #set main .prj file path allFiles = os.listdir(baseFilePath) for file in allFiles: if file.endswith('.prj'): project = baseFilePath + file projectName = file.split(".")[0] b = random.randrange(0,1000000000) arg1 = [baseFilePath,gssha,project,projectName,b] pargssha(arg1)
sendOneRndCondor.py – Sends multiple instances of “oneRndGSSHA.py to HTCondor to be
processed and parses the outputs from those jobs to generate statistics on peak discharge. This
script terminates the simulations based on statistical convergence.
import os, subprocess, time, re def parseResults(resultFilePath,numRuns): numRuns = float(numRuns) maxPeak = [0.0,0.0] minPeak = [100000000000000000000000000000000.0,0.0] sumPeak = [0.0,0.0] sqrsumPeak = [0.0,0.0] avePeak = [0.0,0.0] stdPeak = [0.0,0.0] varPeak = [0.0,0.0] listPeak = [] listTime = [] for file in os.listdir(resultFilePath): if file.endswith('.otl'): oltFilePath = resultFilePath + file peak = extractPeakFlow(oltFilePath) sumPeak[0] = sumPeak[0] + peak[0] sumPeak[1] = sumPeak[1] + peak[1] listPeak.append(peak[0]) listTime.append(peak[1]) if peak[0] > maxPeak[0]: maxPeak = peak elif peak[0] < minPeak[0]: minPeak = peak avePeak[0] = sumPeak[0]/numRuns avePeak[1] = sumPeak[1]/numRuns
54
for i in range(0,len(listPeak)): varPeak[0]= varPeak[0] + ((listPeak[i]-avePeak[0])**2) varPeak[1]= varPeak[1] + ((listTime[i]-avePeak[1])**2) stdPeak[0] = (varPeak[0]/numRuns)**(0.5) stdPeak[1] = (varPeak[1]/numRuns)**(0.5) stats = [avePeak,stdPeak] return stats def extractPeakFlow(otlFilePath): openOtl = open(otlFilePath) peak = [0.0,0.0] flow = 0.0 for line in openOtl: lineBy = line.split() flow = float(lineBy[1]) if flow > peak[0]: peak[0] = flow # include the time for comparison peak[1] = float(lineBy[0]) return peak def quitCondor(jobNum): #This will only remove the current job subprocess.call(["condor_rm",jobNum]) def getIdleJobNum(resulsFilePath, jobQcommand): subprocess.call([jobQcommand]) for jobfile in os.listdir(resulsFilePath): if jobfile == "jobQ.out": for line in open(resulsFilePath+jobfile): if 'jobs' in line: idleJobNum = int(re.search('(\d+) jobs',line).group(0).split()[0]) else: idleJobNum = 1 return idleJobNum def fileCollector(resultFilePath,numJobs): resultFileList = set() oldStats = 0.0 delta = [] oldNumFiles = len(resultFileList) eps = 1 #this pulls the job number (cluster #) of the job out of the submit.out file for jobfile in os.listdir(resultFilePath): if jobfile == "submit.out": for line in open(resultFilePath+jobfile): if 'cluster' in line:
55
jobNum = re.search('cluster (\d+)',line).group(0).split()[1] for jobfile in os.listdir(resultFilePath): if jobfile.endswith(".job"): for line in open(resultFilePath+jobfile): if 'Queue' in line: qNum = int(re.search('Queue (\d+)',line).group(0).split()[1]) """drive = resultFilePath.split(':')[0] jobQcommand = resultFilePath + "jobQ.bat" file = open(jobQcommand, 'a') file.write(drive + ": \n") file.write("cd " + resultFilePath + "\n") file.write("condor_q " + jobNum + " > jobQ.out\n") file.close()""" #test for convergnece and then remove the remaining jobs from HTCondor while eps > 0.0001: resultsFiles = os.listdir(resultFilePath) for file in resultsFiles: if file.endswith('.otl'): resultFileList.add(file) numFiles = len(resultFileList) if numFiles == oldNumFiles: #if you didnt send enough jobs for convergence this should get you out of the loop if numFiles == numJobs: eps = 0 else: pass elif oldNumFiles < numFiles: newStats = parseResults(resultFilePath,numFiles) delta.append((oldStats-newStats[0][0])) if len(delta) > 5: eps = (abs(delta[-1])+abs(delta[-2])+abs(delta[-3])+abs(delta[-4])+abs(delta[-5]))/5 oldStats = newStats[0][0] oldNumFiles = numFiles quitCondor(jobNum) results = [newStats,delta,numFiles,eps] return results def pargssha(baseFilePath,resultFilePath,b,numJobs): #write job, exe, and submit files fileList = fileFinder(baseFilePath) writeJob(resultFilePath,b,numJobs,fileList)
56
writeExe(resultFilePath,b) writeSubmit(resultFilePath,b) submitCommand = resultFilePath + "submit.bat" subprocess.call([submitCommand]) def writeJob(resultFilePath,b,numJobs,fileList): pr="project%s." %(b) newJobFile = resultFilePath + pr + "job" fileStr = "" file = open(newJobFile, 'a') file.write("Universe = vanilla\n") file.write(("Executable = {0}bat\n").format(pr)) file.write('Requirements = Arch == "X86_64" && OpSys == "WINDOWS"\n') file.write("Request_Memory = 1200 Mb\n") file.write(("Log = {0}log.txt\n").format(pr)) file.write(("Output = {0}out.txt\n").format(pr)) file.write(("Error = {0}err.txt\n").format(pr)) file.write("transfer_executable = TRUE\n") file.write("transfer_input_files = " + fileList[0]) for i in range(1,len(fileList)): fileStr = fileStr + ","+ fileList[i] file.write(fileStr + "\n") file.write("should_transfer_files = YES\n") file.write("when_to_transfer_output = ON_EXIT_OR_EVICT\n") file.write(("Queue {0} \n").format(numJobs)) file.close() def writeExe(resultFilePath,b): newExe = resultFilePath + "project%i.bat" %(b) file = open(newExe, 'a') file.write("C:\Python26\ArcGIS10.0\python.exe oneRndGSSHA.py > cmdout.txt\n") file.close() def writeSubmit(resultFilePath,b): drive = resultFilePath.split(':')[0] # this allows the jobs to be sent from a thumb drive newBat = resultFilePath + "submit.bat" file = open(newBat, 'a') file.write(drive + ": \n") file.write("cd " + resultFilePath + "\n")
57
file.write(("condor_submit " + resultFilePath + "project{0}.job > submit.out\n").format(b)) file.close() def fileFinder(baseFilePath): fileList = [] files = os.listdir(baseFilePath) for file in files: if file.endswith('.prj'): project = baseFilePath + file projectName = file.split(".")[0] for file in files: if file.startswith(projectName): if file.endswith(".ohl") or file.endswith(".rec") or file.endswith(".gmh") or file.endswith(".dep"): pass else: fileList.append(baseFilePath + file) elif file.endswith(".idx"): fileList.append(baseFilePath + file) elif file.endswith("gssha.exe"): fileList.append(baseFilePath + file) elif file.endswith("oneRndGSSHA.py"): fileList.append(baseFilePath + file) elif file.startswith("HMET"): fileList.append(baseFilePath + file) return fileList def main(): t1 = time.clock() #set basepath baseFilePath = os.getcwd() + "\\" #set and clear results folder resultFilePath = baseFilePath + "Results1\\" b = 2 while os.path.exists(resultFilePath): resultFilePath = (baseFilePath + "Results{0}\\").format(b) b = b + 1 b=b-1 #"b" is the number of the current Results folder #Make current Results path if not os.path.exists(resultFilePath): os.makedirs(resultFilePath) numJobs = 500 pargssha(baseFilePath,resultFilePath,b, numJobs)
58
results = fileCollector(resultFilePath,numJobs) dt = (time.clock()) -t1 resultsOutFile = open((resultFilePath + "result{0}.out").format(b),'a') resultsOutFile.write("main Results = \n") for i in range(0,len(results)): resultsOutFile.write(str(results[i]) + "\n") resultsOutFile.write(("Finished in {0} seconds.\n").format(int(round(dt,0)))) resultsOutFile.close csv = open((resultFilePath + "result{0}.csv").format(b),'a') for i in range(0,len(results[1])): csv.write(str(i+1) + "," + str(results[1][i]) + "\n") csv.close() if __name__ == '__main__': main()
A.2 Scripts for Method Two – Static Number of Jobs
multRndCondorGSSHA.py – This script runs a set number of randomized GSSHA models on
HTCondor and parses the outputs to generate a statistical results file. All of the randomized CMT
files are created on the submission computer prior to submitting jobs to HTCondor.
import math, os, random, subprocess, sys, threading, time t1 = time.clock() def pargssha(argv): baseFilePath = argv[0] project = argv[1] projectName = argv[2] stuff = argv[3] b = argv[4] i = argv[5] gssha = argv[6] resultFilePath = argv[7] run = argv[8] cmt = argv[9] fileList = argv[10] #create a new work folder to copy altered .prj file into
59
newPath = (resultFilePath + "NewGSSHAfolder{0}_{1}\\").format(b,i) if not os.path.exists(newPath): os.makedirs(newPath) #set altered .prj filepath and remove if it exists projectNew = (newPath + "project{0}_{1}.prj").format(b,i) if os.path.exists(projectNew): os.remove(projectNew) rain = "" #write altered .prj file arg1 = [baseFilePath,newPath,project,projectNew,projectName,stuff,rain,b,i,gssha,resultFilePath,run,cmt,fileList] readCmt(arg1) writePrj(arg1) writeJob(arg1) writeExe(arg1) writeSubmit(arg1) submitCommand = newPath + "submit.bat" subprocess.call([submitCommand]) def writeJob(argv): baseFilePath = argv[0] newPath = argv[1] projectName = argv[4] b = argv[7] i = argv[8] gssha = argv[9] pr="project%i_%i." %(b,i) newJobFile = newPath + pr + "job" fileStr = "" file = open(newJobFile, 'a') file.write("Universe = vanilla\n") file.write(("Executable = {1}bat\n").format(newPath,pr)) file.write('Requirements = Arch == "X86_64" && OpSys == "WINDOWS"\n') file.write(("Request_Memory = 1200 Mb\n").format(pr)) file.write(("Log = {0}log.txt\n").format(pr)) file.write(("Output = {0}out.txt\n").format(pr)) file.write(("Error = {0}err.txt\n").format(pr)) file.write("transfer_executable = TRUE\n") file.write(("transfer_input_files = " + gssha + ",{0}prj,{0}cmt").format(pr)) for i in range(0,len(fileList)): fileStr = fileStr + ","+ fileList[i] file.write(fileStr + "\n")
60
file.write("should_transfer_files = YES\n") file.write("when_to_transfer_output = ON_EXIT_OR_EVICT\n") file.write("Queue\n") file.close() def writeExe(argv): newPath = argv[1] projectName = argv[4] b = argv[7] i = argv[8] pr="project%i_%i." %(b,i) newExe = newPath + pr + "bat" file = open(newExe, 'a') file.write(("gssha.exe {0}prj > cmdout.txt\n").format(pr)) file.write(("del {0}.dep /F /Q\n").format(projectName)) file.write(("del {0}.rec /F /Q\n").format(projectName)) file.write(("del {0}.ghm /F /Q\n").format(projectName)) file.write("del maskmap /F /Q\n") #file.write("del out_time.out /F /Q\n") file.close() def writeSubmit(argv): newPath = argv[1] b = argv[7] i = argv[8] drive = newPath.split(':')[0] newBat = newPath + "submit.bat" file = open(newBat, 'a') file.write(drive + ": \n") file.write("cd " + newPath + "\n") file.write(("condor_submit " + newPath + "project{0}_{1}.job\n").format(b,i)) file.close() def writePrj(argv): baseFilePath = argv[0] newPath = argv[1] project = argv[2] projectNew = argv[3] projectName = argv[4] stuff = argv[5] rain =argv[6] b = argv[7] i = argv[8] idx = argv[13] for file in os.listdir(baseFilePath): if file.endswith('.prj'): project = baseFilePath + file
61
newPrjFile = open(projectNew, 'a') for line in open(project): if line.split()[0] == "SUMMARY": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t\t\t " + '"' + "project{0}_{1}.sum" + '"\n').format(b,i)) elif line.split()[0] == "OUTLET_HYDRO": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}_{1}.otl" + '"\n').format(b,i)) elif line.split()[0] == "MAPPING_TABLE": lineBy =line.strip().split() newPrjFile.write((lineBy[0] + "\t\t\t " + '"' + "project{0}_{1}.cmt" + '"\n').format(b,i)) else: newPrjFile.write(line) newPrjFile.close() def readCmt(argv): baseFilePath = argv[0] newPath = argv[1] b = argv[7] i = argv[8] for file in os.listdir(baseFilePath): if file.endswith('.cmt'): cmtFilePath = baseFilePath + file newCmt = (newPath + "project{0}_{1}.cmt").format(b,i) cmt = open(cmtFilePath) new = open(newCmt, 'a') #this creates a new file nextLine = cmt.readline() new.write(nextLine) nextLine = cmt.readline() while nextLine.split()[0] == "INDEX_MAP": new.write(nextLine) nextLine = cmt.readline() new.write(nextLine) nextLine = cmt.readline() numIDs = int(nextLine.split()[-1]) new.write(nextLine) nextLine = cmt.readline() new.write(nextLine) for i in range(0,numIDs): nextLine = cmt.readline()
62
lineBy = nextLine.split() mean = float(lineBy[-1]) min1 = mean*0.8 max1 = mean*1.2 stdev = (max1-min1)/6 num = normalDist(mean,stdev,min1,max1) repLine =nextLine.replace(lineBy[-1], str(num)[:7]) new.write(repLine) for i in range(0,500): nextLine = cmt.readline() new.write(nextLine) new.close() def normalDist(mean,stdev,min1,max1): x= (min1-5.0) while max1 > x < min1: x = gauss()*stdev + mean return x def gauss(): fac = 0.0 r = 1.5 V1 = 0.0 V2 = 0.0 rnd = random.random while r>=1: V1 = 2*rnd() - 1 V2 = 2*rnd() - 1 r = V1**2 + V2**2 fac =(-2*math.log(r)/r)**(1/2.0) gauss = V2*fac return gauss def parseResults(path,numRuns,b,timeSum): resultFilePath = path numRuns = float(numRuns) b=b timeSum = timeSum openResultsSum = open((resultFilePath + "ResultSummary%i.txt" %b), 'a') maxPeak = [0.0,0.0] minPeak = [100000000000000000000000000000000.0,0.0] sumPeak = [0.0,0.0] sqrsumPeak = [0.0,0.0] avePeak = [0.0,0.0]
63
stdPeak = [0.0,0.0] varPeak = [0.0,0.0] listPeak = [] listTime = [] for dir in os.listdir(resultFilePath): currentFolder = resultFilePath + dir + "/" if os.path.isdir(currentFolder): for file in os.listdir(currentFolder): if file.endswith('.sum'): sumFilePath = currentFolder + file elif file.endswith('.otl'): oltFilePath = currentFolder + file peak = extractPeakFlow(oltFilePath) sumPeak[0] = sumPeak[0] + peak[0] sumPeak[1] = sumPeak[1] + peak[1] listPeak.append(peak[0]) listTime.append(peak[1]) if peak[0] > maxPeak[0]: maxPeak = peak elif peak[0] < minPeak[0]: minPeak = peak avePeak[0] = sumPeak[0]/numRuns avePeak[1] = sumPeak[1]/numRuns for i in range(0,len(listPeak)): varPeak[0]= varPeak[0] + ((listPeak[i]-avePeak[0])**2) varPeak[1]= varPeak[1] + ((listTime[i]-avePeak[1])**2) stdPeak[0] = (varPeak[0]/numRuns)**(0.5) stdPeak[1] = (varPeak[1]/numRuns)**(0.5) openResultsSum.write("MaxPeak Discharge = %f @ Time = %f\n" %(maxPeak[0],maxPeak[1])) openResultsSum.write("MinPeak Discharge = %f @ Time = %f\n" %(minPeak[0],minPeak[1])) openResultsSum.write("AvePeak Discharge = %f AveTime to Peak = %f\n" %(avePeak[0],avePeak[1])) openResultsSum.write("StDevPeak Discharge = %f StDevTime to Peak = %f\n" %(stdPeak[0],stdPeak[1])) openResultsSum.write(timeSum + "\n") openResultsSum.close() def extractPeakFlow(path): otlFilePath = path openOtl = open(otlFilePath) peak = [0.0,0.0] flow = 0.0 for line in openOtl: lineBy = line.split() flow = float(lineBy[1]) if flow > peak[0]:
64
peak[0] = flow # include the time for comparison peak[1] = float(lineBy[0]) return peak def fileFinder(baseFilePath): baseFilePath = baseFilePath fileList = [] files = os.listdir(baseFilePath) for file in files: if file.startswith(projectName): if file.endswith('.prj') or file.endswith(".cmt") or file.endswith(".ohl") or file.endswith(".rec") or file.endswith(".gmh") : pass else: fileList.append(baseFilePath + file) elif file.endswith(".idx"): fileList.append(baseFilePath + file) elif file.startswith("HMET"): fileList.append(baseFilePath + file) return fileList #set basepath baseFilePath = os.path.split( sys.argv[0] )[0] + "\\" print baseFilePath #set and clear results folder resultFilePath = baseFilePath + "Results1\\" b = 2 while os.path.exists(resultFilePath): resultFilePath = (baseFilePath + "Results{0}\\").format(b) b = b + 1 b=b-1 #"b" is the number of the current Results folder #Make current Results path if not os.path.exists(resultFilePath): os.makedirs(resultFilePath) #set gssha.exe path gssha = baseFilePath + "gssha.exe" print "The path to gssha.exe is " + gssha #set main .prj, .cmt, and .idx file paths project = "" allFiles = os.listdir(baseFilePath) for file in allFiles: if file.endswith('.prj'): projectName = file.split(".")[0] cmt = ""
65
fileList = fileFinder(baseFilePath) numRuns = 100 run = numRuns + 1 # run-1 is the number of theads that will be generated for i in range(1, run): print "Processing Project%i_%i..." %(b,i) #set variable to look for in main .prj file stuff = "RAIN_INTENSITY" arg1 = [baseFilePath,project,projectName,stuff,b,i,gssha,resultFilePath,run,cmt,fileList] #create thread "i" and run gssha model thread = threading.Thread(target=pargssha, args=[arg1]) thread.start() for i in range(0,1000000): pass thread.join() for i in range(1,run): checkFile = (resultFilePath + "NewGSSHAfolder{0}_{1}/project{0}_{1}.sum").format(b,i) while not os.path.exists(checkFile): pass print "WORKING..." dt = (time.clock()) -t1 timeSum = ("Finished {0} Projects in {1} seconds").format(i,int(round(dt,0))) parseResults(resultFilePath,numRuns,b,timeSum) print "FINISHED!"