ian c. smith ulgrid – experiments in providing a campus grid
TRANSCRIPT
![Page 1: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/1.jpg)
Ian C. Smith
ULGrid – Experiments in providing a campus grid
![Page 2: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/2.jpg)
Overview Current Liverpool systems
PC Condor pool
Job management in ULGrid using Condor-G
The ULGrid portal
Storage Resource Broker
Future developments
Questions
![Page 3: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/3.jpg)
Current Liverpool campus systems ulgbc1
24 dual processor Athlon nodes, 0.5 TB storage GigE
ulgbc2 38 single processor nodes, 0.6 TB storage, GigE
ulgbc3 / lv1.nw-grid.ac.uk NW-GRID - 44 dual-core, dual-processor nodes, 3 TB storage, GigE HCC - 35 dual-core, dual-processor nodes, 5 TB storage, InfiniPath
ulgbc4 / lv2.nw-grid.ac.uk 94 single core nodes, 8TB RAID storage, Myrinet
PC Condor pool ~ 300 Managed Windows Service PCs
![Page 4: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/4.jpg)
38u 38u 38u
High Capability Cluster
NW-GRID
35 dual processor, dual core nodes (140 cores), 2.4GHz, 8GB RAM, 200GB disk
Front-end Node: ulgbc3 or lv1.nw-grid.ac.ukInfiniPath
InterconnectGig Ethernet Interconnect
44 dual processor, dual core nodes (176 cores), 2.2GHz, 8GB RAM, 146 GB disk
Panasas Disk Subsystem (3 TB)
SATA RAID Disk Subsystem (5.2 TB)
![Page 5: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/5.jpg)
PC Condor Pool
allows jobs to be run remotely on MWS teaching centre PCs at times at which they would otherwise be idle ( ~ 300 machines currently )
provides high throughput computing rather than high performance computing (maximise number of jobs which can be processed in a given time)
only suitable for DOS based applications running in batch mode no communication between processes possible (“pleasantly parallel”
applications only) statically linked executables work best (although can cope with DLLs) can access application files on a network mapped drive long running jobs need to use Condor DAGMan authentication of users prior to job submission via ordinary University
security systems ( NIS+/LDAP )
![Page 6: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/6.jpg)
Condor and power saving
power saving employed on all teaching centre PCs by default
machines power down automatically if idle for > 30 min and no user logged in but ...
... will remain powered up if Condor job running until it completes
NIC remains active allowing remote wake-on-LAN
submit host detects if no. of idle jobs > no. of idle machines and wakes up the pool as necessary
couple of teaching centres remain "always available" for testing etc
![Page 7: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/7.jpg)
Teaching Centre 1 Teaching Centre 2
... other centres
Condor submit host
Condor central
manager
Condor view server
Condor portal
user login
Condor pool
![Page 8: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/8.jpg)
Condor research applications
molecular statics and dynamics (Engineering) prediction of shapes and properties of molecules using quantum
mechanics (Chemistry) modelling of avian influenza propagation in poultry flocks (Vet Science) modelling of E. Coli propagation in dairy cattle (Vet Science) model parameter optimization using Genetic Algorithms (Electronic
Engineering) computational fluid dynamics (Engineering) numerical simulation of ocean current circulation (Earth and Ocean
Science) numerical simulation of geodynamo magnetic field (Earth and Ocean
Science)
![Page 9: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/9.jpg)
![Page 10: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/10.jpg)
![Page 11: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/11.jpg)
![Page 12: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/12.jpg)
![Page 13: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/13.jpg)
![Page 14: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/14.jpg)
![Page 15: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/15.jpg)
![Page 16: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/16.jpg)
![Page 17: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/17.jpg)
![Page 18: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/18.jpg)
Boundary layer fluctuations induced by freestream streamwise vortices
Flow
![Page 19: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/19.jpg)
Boundary layer ‘streaky structures’ induced by freestream streamwise vortices
Flow
![Page 20: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/20.jpg)
![Page 21: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/21.jpg)
![Page 22: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/22.jpg)
![Page 23: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/23.jpg)
![Page 24: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/24.jpg)
ULGrid aims
provide a user friendly single point of access to cluster resources Globus based with authentication through UK e-Science
certificates job submission should be no more difficult than using a
coventional batch system users should be able to determine easily which resources are
available meta-scheduling of jobs users should be able to monitor progress of all jobs easily jobs can be single process or MPI job submission from either the command line (qsub-style script)
or web
![Page 25: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/25.jpg)
ULGrid implementation
originally tried Transfer-queue-over-Globus (ToG) from EPCC for job submission but ... messy to integrate with SGE limited reporting of job status no meta-scheduling possible decided to switch to Condor-G
Globus monitoring and discovery service (MDS) originally used to publish job status and resource info but ... very difficult configure hosts mysteriously vanish because of timeouts (processor overload ?
network delays ? who knows ) all hosts occasionally disappear after single cluser reboot eventually used Apache web servers to publish information in the form of
Condor ClassAds
![Page 26: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/26.jpg)
Condor-G pros
familiar and reliable interface for job submission and monitoring very effective at hiding the Globus middleware layer meta-scheduling possible though the use of ClassAds automatic renewal of proxies on remote machines proxy expiry handled gracefully workflows can be implemented using DAGman nice sysadmin features e.g.
fair-share scheduling changeable user priorities accounting
![Page 27: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/27.jpg)
Condor-G cons
user interface is different from SGE, PBS etc
limited file staging facilities
limited reporting of remote job status
user still has to deal directly with Globus certificates
matchmaking can be slow
![Page 28: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/28.jpg)
Local enhancements to Condor-G
extended resource specifications – e.g. parallel environment, queue
extended file staging
‘Virtual Console’ - streaming of output files from remotely running jobs
reporting of remote job status (e.g. running, idle, error)
modified version of LeSC SGE jobmanager runs on all clusters
web interface
MyProxy server for storage/retrieval of e-Science certificates
automatic proxy certificate renewal using MyProxy server
![Page 29: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/29.jpg)
Specifying extended job attributes without RSL schema extensions:
globusrsl = ( environment = (transfer_input_files file1,file2,file3)\
(transfer_output_files file4,file5 )\
(parallel_environment mpi2) )
with RSL schema extensions:
globusrsl = (transfer_input_files = file1, file2, file3)\ (transfer_output_files = file4,file5 )\ (parallel_environment = mpi2)
or ... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3 transfer_output_files = file4, file5
or ... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3
![Page 30: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/30.jpg)
Typical Condor-G job submission file
universe = globus globusscheduler = $$(gatekeeper_url)x509userproxy=/opt2/condor_data/ulgrid/certs/bonarlaw.cred requirements = ( TARGET.gatekeeper_url =!= UNDEFINED ) && \ ( name == "ulgbc1.liv.ac.uk" )
output = condori_5e_66_cart.out error = condori_5e_66_cart.err log = condori_5e_66_cart.log executable = condori_5e_66_cart_$$(Name)
globusrsl = ( input_working_directory = $ENV(PWD) )\ ( job_name = condori_5e_66_cart )( job_type = script )\ ( stream_output_files = pcgamess.out )
transfer_input_files=pcgamess.in notification = never queue
![Page 31: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/31.jpg)
NW-GRID cluster
(ulgbc3)
Condor-G submithost
CSD-Physicscluster
(ulgbc2)
CSD-Physicscluster
(ulgbc2)
NW-GRID/POL cluster(ulgp4)
Condor-Gportal
CSD AMDcluster
(ulgbc1)
Condor-G central manager
MyProxy server
User login
Condor ClassAds
Globus file staging
![Page 32: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/32.jpg)
![Page 33: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/33.jpg)
![Page 34: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/34.jpg)
![Page 35: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/35.jpg)
![Page 36: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/36.jpg)
![Page 37: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/37.jpg)
Storage Resource Broker (SRB)
open source grid middleware developed by San Diego Supercomputing Center allowing distributed storage of data
absolute filenames reflect the logical structure of data rather than its physical location (unlike NFS)
meta-data allows annotation of files so that results can be searched easily at a later date
high speed data movement through parallel transfers several interfaces available: shell (Scommands), Windows GUI
(InQ), X/Windows GUI, web browser (MySRB) also APIs for C/C++, Java, Python
provides most of the functionality needed to build a data grid many other features
![Page 38: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/38.jpg)
![Page 39: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/39.jpg)
![Page 40: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/40.jpg)
NW-GRID cluster
(ulgbc3)
CSD-Physicscluster
(ulgbc2)
CSD-Physicscluster
(ulgbc2)
NW-GRID/POL cluster(ulgp4)
CSD AMDcluster
(ulgbc1)
Condor-G central manager/submit
host
Globus file staging
SRB MCATserver
SRB data vaults (distributed storage)
meta-data‘real’ data
![Page 41: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/41.jpg)
Future developments
make increased use of SRB for file staging and archiving of results in ULGrid
expand job submission to other NW-GRID sites ( and NGS ? ) encourage use of Condor-G for job submission on UL-Grid/NW-
GRID incorporate more applications into the portal publish more information in Condor-G ClassAds provide better support for long running jobs via the portal and
improved reporting of job status
![Page 42: Ian C. Smith ULGrid – Experiments in providing a campus grid](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eda5503460f94bea263/html5/thumbnails/42.jpg)
Further Information
http://www.liv.ac.uk/e-science/ulgrid