ian c. smith*

29
Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Upload: moe

Post on 24-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Harvesting unused clock cycles with Condor. Ian C. Smith*. *Advanced Research Computing The University of Liverpool. Overview. what is Condor ? High Performance versus High Throughput Computing Condor fundamentals setting up and running a Condor Pool - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ian C. Smith*

Ian C. Smith*

Harvesting unused clock cycles with Condor

*Advanced Research ComputingThe University of Liverpool

Page 2: Ian C. Smith*

Overview

what is Condor ? High Performance versus High Throughput Computing Condor fundamentals setting up and running a Condor Pool The University of Liverpool Condor Pool example applications

Page 3: Ian C. Smith*

What is Condor ?

a specialized system for delivering High Throughput Computing a harvester of unused computing resources developed by Computer Science Dept at University of Wisconsin

in late ‘80s free and (now) open source software widely used in academia and increasing in industry available for many platforms: Linux, Solaris, AIX, Windows

XP/Vista/7, Mac OS

Page 4: Ian C. Smith*

HPC vs HTC (1)

High Performance Computing (HPC) delivers large amounts of computing power over relatively short periods of

time (peak FLOPS ratings important)

can also provide lots of memory, large amounts of fast (parallel) storage

fairly exotic hardware, may need plenty of TLC

large capital outlay on hardware

need to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources)

users run relatively small numbers of parallel jobs

essential for certain time-critical applications

Page 5: Ian C. Smith*

HPC vs HTC (2)

High Throughput Computing (HTC) allows many computational tasks to be completed over a long period of time

(peak FLOPS ratings not so important)

users more concerned with running large numbers of jobs over a long time span than a few short burst computations

makes use of existing commodity hardware (e.g. desktop PCs)

small capital outlay on hardware possible

limited memory and storage available generally

mostly aimed at running concurrent serial jobs (although MPI and PVM are supported by Condor)

Page 6: Ian C. Smith*

Types of Condor application

large numbers of independent calculations typically (“pleasantly parallel”)

data parallel applications – split large datasets into smaller parts and analyse independently biological sequence analysis

processing of census data

optimisation problems microprocessor design and testing

applications based on Monte Carlo methods radiotherapy treatment analysis

epidemiological studies

Page 7: Ian C. Smith*

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Page 8: Ian C. Smith*

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

ClassAdsClassAds

ClassAdsClassAds

Page 9: Ian C. Smith*

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Match Info

Match InfoMatch Info

Match Info

Page 10: Ian C. Smith*

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

JobsJobs

Page 11: Ian C. Smith*

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Output

Output

Page 12: Ian C. Smith*

ClassAds and Matchmaking

ClassAds are a fundamental part of Condor similar to classified advertisements in a paper “Job Ads” represent jobs to Condor (similar to “wanted” ads) “Machine Ads” represent compute resources in a Condor Pool

(similar to “for sale” ads) Condor central manager matches Machine Ads to Job Ads and

hence machines to jobs Job Ads are created using submit description files

Page 13: Ian C. Smith*

Simple submit description file

# simple submit description file # (anything following a # is comment and is ignored by Condor)# this would be used for Windows XP based execute hosts

universe = vanillaexecutable = example.exe # what to runoutput = stdout.out$(PROCESS) # job`s standard outputlog = mylog.log$(PROCESS) # log job`s activitiestransfer_input_files = common.txt, myinput$(PROCESS).txt # input files neededrequirements = ( Arch=="Intel") && ( OpSys=="WINNT51" ) # what machines to run onqueue 2 # number of jobs to queue

Page 14: Ian C. Smith*

Requirements and Rank

Requirements expression determines where (and when) a job will run e.g.

Rank is used to express a preference

Requirements = ( OpSys==“WINNT51” ) && # Windows XP OS wanted ( Arch==“Intel” ) && \ # Intel/compatible processor

( Memory >= 2000 ) && \ # want a least 2GB memory and( Disk >= 33554432 ) && \ # at least 32 GB of free disk

( HAS_MATLAB == TRUE ) && \ # must have MATLAB installed ( ( ClockMin > 1020 ) || \ # only run jobs after 5 pm OR ... ( ClockMin == 6 ) || ( ClockDay == 0) ) # at weekends

Rank = Kflops # run on machines with best floating point performance first

Page 15: Ian C. Smith*

Job submission and monitoring[einstein@submit ~]$ condor_submit example.subSubmitting job(s).2 job(s) submitted to cluster 100.[einstein@submit ~]$ condor_q-- Submitter: submit.chtc.wisc.edu : <128.104.55.9:51883> : submit.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD1.0 sagan 7/22 14:19 172+21:28:36 R 0 22.0 checkprogress.cron2.0 heisenberg 1/13 13:59 0+00:00:00 I 0 0.0 env3.0 hawking 1/15 19:18 0+04:29:33 R 0 0.0 script.sh4.0 hawking 1/15 19:33 0+00:00:00 R 0 0.0 script.sh5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0 script.sh6.0 hawking 1/15 19:34 0+00:00:00 R 0 0.0 script.sh...96.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh97.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh98.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh99.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh100.0 einstein 4/5 13:55 0+00:00:00 I 0 0.0 cosmos

557 jobs; 402 idle, 145 running, 1 held[einstein@submit ~]$

Page 16: Ian C. Smith*

Condor policies

Condor supports a wide range of policies for when to start jobs e.g. run jobs only outside office hours

run jobs only if load average on host is small and there has been no recent activity

run jobs at any time on one core (at low priority)

run jobs only submitted by certain users

also a wide choice of what to do when a job is about to be interrupted e.g. suspend the job for a limited time then let it resume

checkpoint the job and migrate it to another machine

kill off the job immediately

Page 17: Ian C. Smith*

UNIX or Windows execute hosts ? (1)

UNIX Condor’s natural environment

not widely installed on desktop machines (but depends on institution...)

supports the Condor “standard universe” containing many useful features checkpointing allows jobs to be migrated from one machine to another without

loss of useful work

Remote Procedure Calls give transparent access to files on submit host

streaming of standard output (stdout) from jobs to submit host

Network filesystems work well making installation and configration much easier

leverages large amount of scientific and engineering codes which have been developed under UNIX

Page 18: Ian C. Smith*

UNIX or Windows execute hosts ? (2)

Windows world’s most widely installed OS – rich source of execute hosts

many commercial 3rd party applications run on Windows

using shared (network) filesystems can be difficult under Condor

only supports the “vanilla” Condor universe no checkpointing – evicted jobs may waste a lot of cycles

all input and output files need to be transferred to/from execute host

output streaming not supported

may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux can make life easier)

Windows support from the U-W Condor Team tends to lag behind UNIX

Page 19: Ian C. Smith*

Setting up a Condor pool best to start off small and build up pool slowly need to understand Condor fundamentals:

role of Condor processes and how they interact

life-cycle of jobs

ClassAds and Matchmaking

avoid firewalls if possible (may be easier said than done ...) talk to central IT services (particularly network and PC teams) submit hosts may need to be fairly high spec if large numbers of

jobs are to be run - ideally want multi-core/processor machine (quad core at least)

plenty of memory (say 8 GB or more)

large fast access filestore (e.g. 1 TB RAID)

Page 20: Ian C. Smith*

Where to go for help

Read The Fine Manual ! log files contain a lot of useful information take a look at the presentations, tutorials and “how-to recipes”on

the Condor website: (www.cs.wisc.edu/condor) search the condor-users mail list archive: (lists.cs.wisc.edu

/archive/condor-users) subscribe to the condor-users mail list join the Campus Grids SIG: (wikis.nesc.ac.uk/escinet/

Campus_Grids) commercial support is also available (e.g. Cycle Computing)

Page 21: Ian C. Smith*

University of Liverpool Condor Pool

contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon)

most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine

single submission point for Condor jobs provided by Sun Solaris V445 SMP server

policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours

job will be killed off if running when a user logs in to a PC web interface for specific applications support for running large numbers of MATLAB jobs

Page 22: Ian C. Smith*

Condor service caveats

only suitable for DOS-based applications running in batch mode no communication between processes possible (“pleasantly

parallel” applications only) statically linked executables work best (although can cope with

DLLs) all files needed by application must be present on local disk

(cannot access network drives) shorter jobs more likely to run to completion (10-20 min seems to

work best) very long running jobs can accommodated using Condor

DAGMan or user level check-pointing (details available soon on the Condor website)

Page 23: Ian C. Smith*

Running MATLAB jobs under Condor many users prefer to create applications using MATLAB rather

than traditional compiled languages (e.g. FORTRAN, C)

need to create standalone application from M-file(s) using MATLAB compiler

standalone application can run without a MATLAB license

run-time libraries still need to be accessible to MATLAB jobs

nearly all toolbox functions available to standalone applications

simple (but powerful) file I/O makes checkpointing easier

see Liverpool Condor website for more information

Page 24: Ian C. Smith*

Power-saving and Green IT at Liverpool we have around 2 000 centrally managed classroom PCs across

campus which were powered up overnight, at weekends and during vacations.

original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity

policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are running

Condor’s own power management features allows machines to be woken up automatically according to demand

Page 25: Ian C. Smith*

Condor-G and Grid Computing

Condor-G is an extension to Condor allowing job submission to remote resources using Globus

provides familiar Condor-like interface to users hiding the underlying middleware complexity

we have used Condor-G to give users grid access to a variety of HPC resources: local HPC clusters (UL-Grid)

NW-Grid resources at Daresbury Lab, Lancaster and Manchester

National Grid Service facilities

Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine)

Web portal removes the need for command line use completely

Page 26: Ian C. Smith*

Radiotherapy example

3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1]

aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methods

code written in MATLAB and compiled into standalone executable set of 800 simulations took ~ 36 hours to run on Condor pool would require 4-5 months of computing time on a single PC several dozen sets of simulations have since been completed

[1] Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic simulation of normal-tissue damage inradiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.

Page 27: Ian C. Smith*

Personalised Medicine example

project is a Genome-Wide Association Study aims to identify genetic predictors of response to anti-epileptic drugs

try to identify regions of the human genome that differ between individuals (referred to as SNPs)

800 patients genotyped at 500 000 SNPs along the entire genome

test statistically the association between SNPs and outcomes (e.g. time to withdrawl of drug due to adverse effects)

very large data-parallel problem – ideal for Condor divide datasets into small partitions so that individual jobs run for

15-30 minutes batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute

time on Condor but ~ 5 weeks on a single PC

Page 28: Ian C. Smith*

Epidemiology example

researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2]

Monte Carlo type method - highly parallel original code written in MATLAB and compiled

into standalone application individual simulations take only 10-15 minutes

to run – ideal for Condor require ~ 10 000 - 20 000 simulations per

scenario would have needed several years of compute

time on single machine, on Condor needed a few weeks

[2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28

Page 29: Ian C. Smith*

Further Information

http://www.liv.ac.uk/e-science/[email protected]