applications development for the computational grid march...
TRANSCRIPT
1
Applications Development for the Computational Grid
David AbramsonFaculty of Information TechnologyMonash University
2
OverviewNew Methods in Scientific discovery
e-Science & e-ResearchComputational Platforms
The Grid and the WebSupporting a Software LifecycleThe role of Grid Services & MiddlewareSoftware Lifecycle Tools
Applications developmentDeploymentTest and debuggingExecution
Examples from Monash Tools
3
Scientific discovery
e-Science & e-Research
4
e-SciencePre-Internet
Theorize &/or experiment, aloneor in small teams; publish paper
Post-InternetConstruct and mine large databases of observational or simulation dataDevelop simulations & analysesAccess specialized devices remotelyExchange information within distributed multidisciplinary teams
Source: Ian Foster
5
6
Typical Grid ApplicationsCharacteristics
High Performance ComputationDistributed infrastructureInstruments are first class resourcesLots of dataNot just bigger – fundamentally different
Some examplesIn silico biology (See MyGrid)Earthquake simulationVirtual observatoryDynamic aircraft maintenanceHigh energy physicsMedical applicationsEnvironmental questions
7
Computational Platforms
Grid and Web Services
8
The GridInfrastructure (“middleware” & “services”) for establishing, managing, and evolving multi-organizational federations
Dynamic, autonomous, domain independentOn-demand, ubiquitous access to computing, data, and services
Mechanisms for creating and managing workflow within such federations
New capabilities constructed dynamically and transparently from distributed servicesService-oriented, virtualization
Source: Ian Foster
9
What is a Grid?
Three key criteriaCoordinates distributed resources …using standard, open, general-purpose protocols and interfaces …to deliver non-trivial qualities of service.
What is not a Grid?A cluster, a network attached storage device, a scientific instrument, a network, etc.Each may be an important component of a Grid, but by itself does not constitute a Grid
Source: Ian Foster
10
The (Power) Grid:On-Demand Access to Electricity
Time
Qua
lity,
eco
nom
ies
of s
cale
Source: Ian Foster
11
By Analogy, A Computing Grid
Decouple production and consumptionEnable on-demand accessAchieve economies of scaleEnhance consumer flexibilityEnable new devices
On a variety of scalesDepartmentCampusEnterpriseInternet
Source: Ian Foster
12
Grid and Web Services Convergence
The definition of WSRF means that the Grid and Web services communities can move forward on a common base.
Source: Globus Alliance
13
Supporting the Software Lifecycle
14
Why is this challenging?
Write software for local workstation
15
Why is this challenging?
Build heterogeneous testbed
16
Why is this challenging?
Deploy Software
17
Why is this challenging?
Test Software
18
Why is this challenging?
Build, schedule & Execute virtual application
19
Why is this challenging?
Interpret results
20
But this what I do well
21
Can we support this process better?
Deploy & Build
Execution
ApplicationsDevelopment
Test & Debug
22
Grid Services & Middleware
23
MiddlewareGlobus GT4 CondorAPST
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Building Software for the Grid
VPN SSH
Courtesy IBM
24
MiddlewareGlobus GT4 CondorAPST
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Building Software for the Grid
VPN SSH
Courtesy IBM,Lower Middleware
Upper Middleware & Tools
Bonds
25
LowerMiddleware Globus GT4 Web Services
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Building Software for the Grid
VPN SSH
Shilbolith SRB
26
LowerMiddleware
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Building Software for the Grid
VPN SSH
Semantic Gap
Globus GT4 Web Services Shilbolith SRB
27
Coding to underweardef build_rsl_file(executable, args, stagein=[], stageout=[], cleanup=[]):
tocleanup = []stderr = t5temp.mktempfile()stdout = t5temp.mktempfile()rstderr = '${GLOBUS_USER_HOME}/.nimrod/' + os.path.basename(stderr)rstdout = '${GLOBUS_USER_HOME}/.nimrod/' + os.path.basename(stdout)
rslfile = t5temp.mktempfile()f = open(rslfile, 'w')f.write("<job>\n <executable>%s</executable>\n" % executable)for arg in args:
f.write(" <argument>%s</argument>\n" % str(arg))f.write(" <stdout>%s</stdout>\n" % rstdout)f.write(" <stderr>%s</stderr>\n" % rstderr)# User defined stage-in sectionif stagein:
f.write(" <fileStageIn>")for src, dest, leave in stagein:
if not leave:tocleanup.append(dest)
f.write("""<transfer>
<sourceUrl>gsiftp://%s%s</sourceUrl><destinationUrl>file:///${GLOBUS_USER_HOME}/.nimrod/%s</destinationUrl>
</transfer>""" % (hostname, src, dest))f.write("\n\t</fileStageIn>\n")
f.write(" <fileStageOut>")# User defined stage-out files section…………………………………………………………
28
LowerMiddleware
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Software Layers
VPN SSH
UpperMiddleware/Tools
Globus GT4 Web Services Shilbolith SRB
29
LowerMiddleware
PlatformInfrastructure Unix Windows JVM TCP/IP MPI .Net Runtime
Environmental Sciences
Life & Pharmaceutical
Sciences
ApplicationsGeo Sciences
Software Layers
VPN SSH
NimrodNimrodPortal& WS
DistANT
UpperMiddleware/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
Development Deploy Test/Debug Execution
30
Applications Development
31
Applications Development on the Grid
New ApplicationsCode to middleware standardsSignificant effortExciting new distributed applicationNumerous programming techniques
Legacy ApplicationsWere built before the GridThey are fragileFile based IOMay be sequentialLeverage old codes to produce new virtual applicationAmenable to Grid Workflows
32
Approaches to Grid programming
General Purpose Workflows
Generic solutionWorkflow editor Scheduler
Special purpose workflows
Solve one class of problemSpecification languageScheduler
33
Grid Workflows
LowerMiddlewear
NimrodNimrodPortal& WS
DistANT
UpperMiddlewear/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
34
Genomics: Promoter Identification Workflow
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
Source: Ilkay Altintas, SDSC
35
Ecology: GARP Analysis Pipeline forInvasive Species Prediction
Training sample
(d)
GARPrule set
(e)
Test sample (d)
Integratedlayers
(native range) (c)
Speciespresence &
absence points(native range)
(a)EcoGridQuery
EcoGridQuery
LayerIntegration
LayerIntegration
SampleData
+A3+A2
+A1
DataCalculation
MapGeneration
Validation
User
Validation
MapGeneration
Integrated layers (invasion area) (c)
Species presence &absence points
(invasion area) (a)
Native range
predictionmap (f)
Model qualityparameter (g)
Environmental layers (native
range) (b)
GenerateMetadata
ArchiveTo Ecogrid
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
RegisteredEcogrid
Database
Environmental layers (invasion
area) (b)
Invasionarea prediction
map (f)
Model qualityparameter (g)
Selectedpredictionmaps (h)
Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)Source: Ilkay Altintas, SDSC
36
Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)
Source: Ilkay Altintas, SDSC
37
The KEPLER GUI (Vergil)
Drag and drop utilities, director and actor libraries.
Source: Ilkay Altintas, SDSC
38
A Generic Web Service ActorGiven a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.
Configure - select service operation
Source: Ilkay Altintas, SDSC
39
Kepler DirectorsOrchestrate WorkflowSynchronous Data Flow
Consumer actors not started until producer completesFiles copied from producer to consumer.
Process NetworksAll actors execute concurrentlyCommunication through
TCP/IP SocketsDedicated IO
IO modes produce different performance results.Actors need to be coded to support specific IO modes
40
Parameter Sweep Workflows with Nimrod
LowerMiddlewear
NimrodNimrodPortal& WS
DistANT
UpperMiddlewear/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
41
Nimrod ...Supports workflows for robust design and search
Vary parametersExecute programsCopy data in and out
Sequential and parallel dependenciesComputational economy drives schedulingComputation scheduled near data when appropriateUse distributed high performance platformsUpper middleware broker for resources discoveryWide Community adoption
Nimrod
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Nimrod/GEnFuzion (www.axceleon.com)
Nimrod/ONimrod/OI
Nimrod/K
Active Sheets (Excel)
Nimrod Roadmap
Nimrod/WS
42
Parameter Studies & SearchStudy or search the behaviourof some of the output variables against a range of different input scenarios.
Design optimizationAllows robust analysisMore realistic simulations
Computations are loosely coupled (file transfer)Very wide range of applications
43
Nimrod scales from local to remote resources
Office
Department
OrganisationNation
44
From Quantum chemistry to aircraft design
Drug Docking Aerofoil Design
45
Nimrod Development Cycle
Prepare Jobs using Portal
Jobs Scheduled Executed Dynamically
Sent to available machines
Results displayed &interpreted
46
Experimental DesignWant to evaluate effects of parameters and parameter combinations“Design of Experiments”approach
Dates back to 1950Extensively used to generate minimum number of “right”experiments
New support in Nimrod/GSpecify resolution of experiment
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
estim
ate
B
C
T
D
BC
BT
CT
DT A
O N
BD
CD BN AC
AT DN
BO
K AB
CK AD
NO KS M
A = Kncx B = KPCa C = KgL D = Kgammacf E = PropLocalNCX F = constKmCaG = constKmNa H = constksat J = Kryrmax K = Kvmax L = Kgto M = KVSRN = KVSS O = Kr_xfer P = K_IpCamax Q = K_KmpCa R = KbL S = KaLT = KfL
47
Optimization using Nimrod/O
Nimrod/G allows exploration of design scenarios
Search by enumeration
Search for local/global minima based on objective function
How do I minimise the cost of this design?How do I maxmimize the life of this object?
Objective function evaluated by computational model
Computationally expensive
48
Genetic AlgorithmGenetic Algorithm
SimplexSimplex
Grid or ClusterGrid or Cluster
How Nimrod/OWorks
BFGSBFGS
Nimrod orNimrod orEnFuzionEnFuzion
DispatcherDispatcher
FunctionFunctionEvaluationsEvaluations
JobsJobs
NimrodNimrodPlanPlanFileFile
49
Interactive Design
Human-in-the-optimization-loopUse population based methodsRank solutions
50
Interactive Design
51
Deployment
LowerMiddlewear
NimrodNimrodPortal& WS
DistANT
UpperMiddlewear/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
52
Why is this challenging?
Deploy Software
53
DeploymentTwo approaches
Hide the heterogeneityUse local knowledge about the instruction set, machine structure, file system, I/O system, and installed libraries
Build on VM technologyProvide services for deployment
Expose the heterogeneityBuild integrated framework that knows about the testedSupport the user in managing differencesBuild on IDE technology
54
DeploymentService
Intermediate Code Application Binary
ApplicationHandle
GRAM
ApplicationHandle
Installed Applications
Install Execute
Application Source
.NET Compilers
Client Machine
Grid Resource
.NET Runtime .NET Parallel Virtual Machine Globus/OGSA
Hide the heterogeneity
Grid build files based on ANTCreate deployment space consistent with GT GRAM
55
Motor Runtime-A VM for HPCOur approach is runtime-internal
Why do Java & .NET support web services, UI, security and other libraries as part of the standard environment?
Functionality is guaranteed
Similarly, we aim to provide guaranteed HPC functionality
56
Leveraging IDEs: WorqbenchUse Eclipse IDE to support usersTestbedis first class object in the IDE
57
Test and Debug
LowerMiddlewear
NimrodNimrodPortal& WS
DistANT
UpperMiddlewear/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
58
Why is this challenging?
Test Software
59
Grid level basic debugging
Hardware
Software
60
Relative debuggingWhat do you do when you move your application to another node of the Grid and it stops working?Subtle errors can be introduced through changes
By programmerIn the environment (DLL Hell)
Programmer must understand application intimately to be able to locate source of errorsProgrammer can spend much time:
tracing program state to locate source of errorunderstanding how code changes may have resulted in errors
Relative debugging is about automating this process.Hybrid Test and Debug methodology
61
Relative Debugging on the Grid
Client running GUARDClient running GUARD
Server running Server running applicationapplicationBig Big EndianEndian64 bit64 bit
Server Server running running applicationapplicationLittle Little endianendian32 bit32 bit
TCP/IPTCP/IP
62
Source Code
AssertionsSimple Data Types
Complex Data Types
Run Both Applications
Build Assertions
Visualize differences
DifferentResults?
63
Execution
LowerMiddlewear
NimrodNimrodPortal& WS
DistANT
UpperMiddlewear/Tools
MotorGlobus GT4 Web Services Shilbolith SRB
Worqbench
Debug REMUS
GriddLeSKepler Guard ActiveSheets
64
Why is this challenging?
Build, schedule & Execute virtual application
65
The Nimrod Portal
66
Nimrod’s Runtime machinery
0
2
4
6
8
10
12
0 1 3 4 6 8 9 10 12 14 15 17 19 20 21 22 24 25 27 28 30 31 33 34 36 37 38 40 41 43 44 46 47 49 51 52 54
Time (minutes)
Jobs
Linux cluster - Monash (20) Sun - ANL (5) SP2 - ANL (5) SGI - ANL (15) SGI - ISI (10)
Soft real-time scheduling problem
67
Flexible Grid Workflows
CCAM CCAM DARLAM
RCMRCM
RCMRCM
RCMRCM
RCMCIT
Vector Machine
Shared Memory Multiprocessor
Linux Cluster
Global Climate DataTemperature, Pressure, etc
Regional Weather DataTemperature, Pressure, etc
Ozone concentration contours
All models provided by CSIRO Division of Atmospheric Research
Mainframe
68
GriddLeSLegacy applications need to be shielded from IO details in Grid
Local filesRemote filesReplicated filesProducer-consumer pipes
Don’t want to lock in IO model when application is written (or even Grid Enabled)Choice of IO model should be
DynamicLate bound
69
Flexible IO in GriddLeS
read()write()seek()
open()close()
Local File
Local File
Remote File
Remote File
RemoteApplication
Process
FileMultiplexer
Legacy Application
CacheCache
Replica 1Replica 1
Replica 2Replica 2
Replica 3Replica 3
70
GriddLeS ArchitectureApplication
Read,
Write,
etc
Grid Buffer Client
Grid Buffer Server
Grid FTP Server
Local File System
Remote File Client
GNS Client
Local File Client
File Multiplexer
GriddLeS NameServer (GNS)
Application
Read,
Write,
etc
Grid Buffer Client
Grid Buffer Server
Grid FTP Server
Local File System
Remote File Client
GNS Client
Local File Client
File Multiplexer
Replication Service
(GRS) ClientSRB
GlobusReplication
GFarm
Replication Service
(GRS) ClientSRB
GlobusReplication
GFarm
71
Nimrod/K
New project to integrate Special purpose function of Nimrod/G/OGeneral purpose workflows from KeplerIO model from GriddLeS
Better integration with PortalsMore flexible scheduling
72
Can we support this process better?
Deploy & Build
Execution
ApplicationsDevelopment
Test & Debug
Support scientists do what they do best
Science
Combination of MiddlewareSoftware tools
73
Acknowledgements (Monash Grid Research)
Research FellowsColin EnticottSlavisa GaricJagan KommineniTom PeachyJeff Tan
PhD StudentsShahaan AyyubPhilip ChanTim HoDonny KurniawanWojtek GoscinskiAaron Searle
Funding & SupportCRC for Enterprise Distributed Systems (DSTC)Australian Research CouncilGrangeNet (DCITA)Australian Partnership for Advanced Computing (APAC)MicrosoftSun MicrosystemsIBMHewlett PackardAxceleon
74
Questions?
www.csse.monash.edu.au/~davida
75
parameter energy label "Variable Photon Energy" float select anyof 0.03 0.05 0.1 0.2 0.3 default 0.03 0.05 0.1 0.2 0.3;parameter iseed integer random from 0 to 10000;parameter length label "Length of collecting electrode" float select anyof .8 .9 1 default .8 .9 1;parameter radius label "Radius" float select anyof 0.0625 0.0725 0.0825 default 0.0625 0.0725 0.0825;task nodestart
copy NE2611.dat node:.copy ne2611.skel node:.
endtasktask main
node:substitute ne2611.skel NE2611.INPnode:execute ne2611.xx copy node:NE2611.OP ne2611out.$jobnamecopy node:stderr ne2611.time.$jobname
endtask
Plan File
www.monash.edu.au
Burnoff of the Australian savanna –Does it affect the climate? Testing the Pragma Testbed.
K. Görgen, A. Lynch, C. Enticott*, J. Beringer, D. Abramson**,
P. Uotila, N. Tapper
School of Geography and Environmental Science* Distributed Systems Technology Centre** School of Computer Science and Software Engineering
77
Savanna Burnoff
• Extensive savanna eco-systems in northern Australia
– 25 % of Australia– Vegetation: spinifex / tussok
grasslands; forest / open woodland
– Warm, semiarid tropical climate– Primary land uses:
> Pastoralism> Mining > Tourism> Aboriginal land
management
(Tropical Savannas CRC)
78
Motivation
• Extensive savanna eco-systems in northern Australia
• Changing fire regime • Fires lead to abrupt changes in
surface properties– Surface energy budgets– Partititioning of convective fluxes – Increased soil heat flux→ Modified surface-atmosphere
coupling • Sensitivity study: do the fire’s
effects on atmospheric processes lead to changes in highly variable precipitation regime of Australian Monsoon?
• Many potential impacts (e.g. agricultural productivity)
(J. Beringer)
79
• Combination of atmospheric modelling (C-CAM), re-analysis and observational data
• C-CAM Simulations
Experiment Design
1974 to 1978 1979 to 1999
spinup control run, no fires / succession
real fires / succession, selected scenarios
~ 90 independent runs (fire / succession scenarios)for sensitivity studies → 1890 yrs of simulations
Part IPart II
80
Use of Grid Computing
• 90 parallel independent model runs • Single CPU model version of parallelized C-CAM (MPI)• Distribution of forcing data repositories to cluster sites (~80
GB), 250 MB forcing data per month• Machine independent dataformats (NetCDF)• Architecture specific, validated C-CAM executables• ~1.5 month CPU time for one experiment (90 exp. total)• Robust, portable, self-controlling model system incl. all
processing tools and restart files• PRAGMA Testbed
– Can we get enough nodes to complete experiment?– Can we maintain a testbed for 1.5 Months?– Can we maintain a node up for 0.5 days?– Can we make this routine for climate modelers?
81
0
10
20
30
40
50
60
70
80
90
100
Mar
08
2006
Mar
12
2006
Mar
17
2006
Mar
22
2006
Mar
27
2006
Mar
31
2006
Apr 0
5 20
06
Apr 1
0 20
06
Apr 1
4 20
06
Apr 1
9 20
06
Apr 2
4 20
06
Apr 2
8 20
06
May
03
2006
May
08
2006
May
13
2006
May
17
2006
May
22
2006
May
27
2006
May
31
2006
Jun
05 2
006
Jun
10 2
006
Jun
15 2
006
Jun
19 2
006
Jun
24 2
006
Jun
29 2
006
Jul 0
3 20
06
Jul 0
8 20
06
Jul 1
3 20
06
Jul 1
8 20
06
Jul 2
2 20
06
Jul 2
7 20
06
Aug
01 2
006
Aug
05 2
006
Aug
10 2
006
Aug
15 2
006
Aug
19 2
006
maharrocks-52umejupiterpragma001amata1tgcTotal