condor

14
Condor Distributed parallelized computing Eric Marshall Associate Director for Research Technology Rutgers University

Upload: ericwilliammarshall

Post on 08-Jan-2017

64 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Condor

CondorDistributed parallelized computing

Eric MarshallAssociate Director for Research Technology

Rutgers University

Page 2: Condor

Office of Instructional and Research Technology

A little background - why use clusters?• Computing power - more is more

• Omelets - a dozen chicken eggs work as well as one ostrich egg and is cheaper

• Big tasks can run for months

• Moore’s law (doubling the number of transistors on a chip every 24 months) still only nibbles at the Grand Challenges

Page 3: Condor

Office of Instructional and Research Technology

What is Condor?• Via the Condor Project home page - Condor is a

“specialized workload management system”

• Via Wikipedia - Condor is a “high-throughput computing software framework for coarse grain distributed parallelization of computationally intensive tasks”

Page 4: Condor

Office of Instructional and Research Technology

What is Condor again?• Via Eric - Condor is a “way to get work done on

computers that no one is using right now”• Condor is a an application that runs your programs on

other computers• Condor is a job scheduler aka a batch scheduler

Page 5: Condor

Office of Instructional and Research Technology

Why use a batch scheduler?• Generally high performance machines are expensive

enough that people would like to have them in use all the time

• Generally people have noticed that computers are good are doing mundane tasks like waiting for a program to finish and start another program

• Generally people use batch schedulers to ensure that a resource, like a cluster, is being fully used without overwhelming the resource or without having the resource sit idle, merely generating heat

Page 6: Condor

Office of Instructional and Research Technology

Condor twist on idle resources• The clever folks at University of Wisconsin-Madison

expanded the notion of idle resources to include all desktop computers that were not in use at the moment

• Like the SETI screensaver idea, Condor runs jobs on computers that are otherwise idle

• Once a user moves a mouse or touches a key, Condor gets out of the way and lets the user have the full machine back

Page 7: Condor

Office of Instructional and Research Technology

How does Condor work? Configure who can submit jobs

Page 8: Condor

Office of Instructional and Research Technology

How does Condor work? Configure where to run jobs

Page 9: Condor

Office of Instructional and Research Technology

How does Condor work? Submitting jobs

Page 10: Condor

Office of Instructional and Research Technology

Condor - more of big picture • Developed by the Universiy of Wisconsin - Madison• Cost: Free (Open source-ish license) • Overhead: Need a ‘Master server’ and workstations. Each

workstation runs a daemon that watches user I/O and CPU load. When a workstation has been idle for two hours, a job from the batch queue is assigned to the workstation and will run until the daemon detects a keystroke, mouse motion, or high non-Condor CPU usage. At that point, the job will be removed from the workstation and placed back on the batch queue

Page 11: Condor

Office of Instructional and Research Technology

Condor - yet more of big picture • Condor can run both sequential and parallel jobs. Sequential jobs

can be run in several different "universes", including "vanilla" which provides the ability to run most "batch ready" programs, and "standard universe" in which the target application is re-linked with the Condor I/O library which provides for remote job I/O and job checkpointing.

• Condor supports the standard Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) as well as a Globus module

• Supported on Windows 2000, 2003, XP, Vista• Supported on Solaris 8, 9, 10• Supported on Linux Red Hat (7.1 and on), SuSE (8 & 9), Debian 3.1• Supported on Mac OSX 10.3 and on• Other supported platforms

Page 12: Condor

Office of Instructional and Research Technology

Condor pluses• No need to recompile jobs (recompile for

checkpointing however)• Users do not have to worry about details, logons, etc.

on remote machines• Respects the owner of the remote system• Flexible system for matching resources with requests

Page 13: Condor

Office of Instructional and Research Technology

Condor limitations• Most platforms other than linux/unix only support the

“vanilla universe” which means no checkpointing, so a job sleeps or is killed outright

• “Standard universe” checkpointing can not handle simple multi-process jobs - no fork(), no exec(), no system()

• No interprocess communication - no pipes, semaphores or shared memory

• No reading or writing of files larger than 2 GB• Limits on signals, timers and file locks• Network communication must be brief

Page 14: Condor

Office of Instructional and Research Technology

Questions?Website: The Condor Project

http://www.cs.wisc.edu/condor/

Eric MarshallOffice of Instructional and Research [email protected] 445-2262