distributed computing in practice: the condor experience cs739 11/4/2013
DESCRIPTION
Distributed Computing in Practice: The Condor Experience CS739 11/4/2013. Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison. The Condor High Throughput Computing System. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/1.jpg)
Distributed Computing in Practice: The Condor
Experience
CS739 11/4/2013Todd Tannenbaum
Center for High Throughput ComputingDepartment of Computer SciencesUniversity of Wisconsin-Madison
![Page 2: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/2.jpg)
The Condor High Throughput Computing System
2
“Condor is a high-throughput distributed computing system. Like other batch systems, Condor provides a job management mechanism, scheduling policy, priority scheme, resource monitoring, and resource management….While similar to other conventional batch systems, Condor's novel architecture allows it to perform well in environments where other batch systems are weak: high-throughput computing and opportunistic computing.”
![Page 3: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/3.jpg)
3
3
Grids
CloudsMap-Reduce
eScience
Cyber Infrastructure
SaaS
HPC
HTC
eResearch
Web ServicesVirtual Machines
HTPC
MultiCore
GPUs
HDFS
IaaS
SLAQoS
Open Source
Green Computing
Master-Worker
WorkFlows
Cyber Security
High Availability
Workforce
100Gb
![Page 4: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/4.jpg)
Definitional Criteria for a Distributed Processing System
4
hMultiplicity of resourceshComponent interconnectionhUnity of control hSystem transparencyhComponent autonomy
P.H. Enslow and T. G. Saponas “”Distributed and Decentralized Control in Fully Distributed Processing Systems” Technical Report, 1981
![Page 5: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/5.jpg)
5
A thinker's guide to the most important trends of the new decade
“The goal shouldn't be to eliminate failure; it should be to build a system resilient enough to withstand it”
“The real secret of our success is that we learn from the past, and then we forget it. Unfortunately, we're dangerously close to forgetting the most important lessons of our own history: how to fail gracefully and how to get back on our feet with equal grace.”
In Defense of FailureBy Megan McArdle Thursday, Mar. 11, 2010
![Page 6: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/6.jpg)
Claims for “benefits” provided by Distributed
Processing Systems
6
h High Availability and Reliabilityh High System Performanceh Ease of Modular and Incremental Growthh Automatic Load and Resource Sharingh Good Response to Temporary Overloadsh Easy Expansion in Capacity and/or Function
P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978
![Page 7: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/7.jpg)
› High Throughput Computingh Goal: Provide large amounts of fault tolerant
computational power over prolonged periods of time by effectively utilizing all resources
› High Performance Computing (HPC) vs HTC h FLOPS vs FLOPYh FLOPY != FLOPS * (Num of seconds in a year)
High Throughput Computing (HTC)
7
![Page 8: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/8.jpg)
› Opportunistic Computingh Ability to use resources whenever they are
available, without requiring one hundred percent availability.
h Very attractive due to realities of distributed ownership
› [Related: Volunteer Computing]
Opportunistic Computing
8
![Page 9: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/9.jpg)
› Let communities grow naturally.h Build structures that permit but do not require
cooperation. h Relationships, obligations, and schemata will
develop according to user necessity› Leave the owner in control.› Plan without being picky.
h Better to be flexible than fast!› Lend and borrow.
Philosophy of Flexibility
9
![Page 10: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/10.jpg)
› Architect in terms of responsibility instead of expected functionalityh Delegate like Ronald Reagan. How?
› Plan for Failure!h Leases everywhere, 2PC, belt and
suspenders, …› Never underestimate the value of code that
has withstood the test of time in the field› Always keep end-to-end picture in mind
when deciding upon layer functionality
Todd’s Helpful Tips
10
![Page 11: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/11.jpg)
› Code reuse creates tensions…› Example: Network layer uses checksums to
ensure correctness
End to end thinking
11
![Page 12: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/12.jpg)
The Condor Kernel
12
Step 1: User submits a job
![Page 13: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/13.jpg)
› ClassAds and Matchmakingh Name-value pairsh Values can be literal data or expressionsh Expression can refer to match candidateh Requirements and Rank are treated special
How are jobs described?
13
![Page 14: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/14.jpg)
› Semi-structured datah No fixed schemah “Let communities
grow naturally…”› Use three-value logic
h Expressions evaluate to True, False, or Undefined
› Bilateral
Interesting ClassAd Properties
14
![Page 15: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/15.jpg)
› What if schedd crashes during job submission?h ARIES-style recovery logh 2 phase commit
Plan for Failure, Lend and borrow
15
Begin Transaction105 Owner Todd105 Cmd /bin/hi…End TransactionBegin Transaction106 Owner Fred106 Cmd /bin/bye
scheddPrepare
Job ID 106
Commit 106
![Page 16: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/16.jpg)
The Condor Kernel
16
Step 2: Matchmaking
![Page 17: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/17.jpg)
› Each component (A, M, R) service is independent and has a distinct responsibility
› Centralized Matchmaker is very light weight: after Match step, it is not involved
› Claims can be reused, delegated, …
Matchmaking Protocol
17
1. Advertise2. Match3. Claim
![Page 18: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/18.jpg)
› Federation many years later was easy because of clearly defined roles of services
Federation via Direct Flocking
18
![Page 19: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/19.jpg)
› GRAM = Grid Resource Access and Managementh “Defacto-standard” protocol for submission to a
site scheduler› Problems
h Dropped many end-to-end features• Like exit codes!
h Couples resource allocation and job execution• Early binding of a specific job to a specific queue
Globus and GRAM
19
![Page 20: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/20.jpg)
20
![Page 21: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/21.jpg)
21
![Page 22: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/22.jpg)
22
![Page 23: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/23.jpg)
23
![Page 24: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/24.jpg)
Solution: Late-binding via GlideIn
24
![Page 25: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/25.jpg)
› Standard Universeh Process Checkpoint/Restarth Remote I/O
› Java Universe
Split Execution Environments
25
![Page 26: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/26.jpg)
Process Checkpointing
› Condor’s process checkpointing mechanism saves the entire state of a process into a checkpoint fileh Memory, CPU, I/O, etc.
› The process can then be restarted from right where it left off
› Typically no changes to your job’s source code needed—however, your job must be relinked with Condor’s Standard Universe support library
![Page 27: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/27.jpg)
Relinking Your Job for Standard Universe
To do this, just place “condor_compile” in front of the command you normally use to link your job:% condor_compile gcc -o myjob myjob.c- OR -% condor_compile f77 -o myjob filea.f fileb.f
![Page 28: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/28.jpg)
When will Condor checkpoint your job?
› Periodically, if desired (for fault tolerance)› When your job is preempted by a higher priority
jobh Preempt/Resume scheduling – powerful!
› When your job is vacated because the execution machine becomes busy
![Page 29: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/29.jpg)
Remote System Calls› I/O system calls are trapped and sent back to
submit machine› Allows transparent migration across
administrative domainsh Checkpoint on machine A, restart on B
› No source code changes required› Language independent› Opportunities for application steering
![Page 30: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/30.jpg)
The Condor Kernel
30
![Page 31: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/31.jpg)
JobI/OLib
Remote I/O
condor_schedd condor_startd
condor_shadow condor_starter
File
![Page 32: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/32.jpg)
Java Universe Jobuniverse = javaexecutable = Main.classjar_files = MyLibrary.jarinput = infileoutput = outfilearguments = Main 1 2 3queue
![Page 33: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/33.jpg)
Why not use Vanilla Universe for Java jobs?
› Java Universe provides more than just inserting “java” at the start of the execute lineh Knows which machines have a JVM installedh Knows the location, version, and performance
of JVM on each machineh Can differentiate JVM exit code from program
exit codeh Can report Java exceptions
![Page 34: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/34.jpg)
DAGMan› Directed Acyclic Graph Manager
› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
![Page 35: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/35.jpg)
What is a DAG?
› A DAG is the data structure used by DAGMan to represent these dependencies.
› Each job is a “node” in the DAG.
› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!
Job A
Job B Job C
Job D
![Page 36: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/36.jpg)
› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
› each node will run the Condor job specified by its accompanying Condor submit file
Defining a DAG
Job A
Job B Job C
Job D
![Page 37: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/37.jpg)
› CEDAR› Job Sandboxing
Security
37
![Page 38: Distributed Computing in Practice: The Condor Experience CS739 11/4/2013](https://reader035.vdocuments.us/reader035/viewer/2022062501/5681691f550346895de04b2d/html5/thumbnails/38.jpg)
Questions?
38