what’s new in condor? what’s coming up? condor week 2009

49
Condor Project Computer Sciences Department University of Wisconsin-Madison What’s new in Condor? What’s coming up? Condor Week 2009

Upload: kiley

Post on 05-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

What’s new in Condor? What’s coming up? Condor Week 2009. Release Situation. Stable Series Current: Condor v7.2.2 (April 14 2009) Last Year: Condor v7.0.1 (Feb 27th 2008) Development Series Current: Condor v7.3.0 (Feb 24 2009) v7.3.1 “any day” Last Year : Condor v7.1.0 (April 1st 2008) - PowerPoint PPT Presentation

TRANSCRIPT

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

What’s new in Condor?What’s coming up?

Condor Week 2009

2www.condorproject.org

Release Situation› Stable Series

Current: Condor v7.2.2 (April 14 2009) Last Year: Condor v7.0.1 (Feb 27th 2008)

› Development Series Current: Condor v7.3.0 (Feb 24 2009)

• v7.3.1 “any day” Last Year : Condor v7.1.0 (April 1st 2008)

› How long is development taking? v6.9 Series : ~ 18 months v7.1 Series : ~ 12 months v7.3 Series : plan says done in July 09

3www.condorproject.org

New Ports In 7.2.0 and Beyond

› Full ports: Debian 5.0 x86 & x86_64

› Also added condor_compile support for gfortran

4www.condorproject.org

Big new goodies in v7.0

› Virtual Machine Universe

› Scalability Improvements

› GCB Improvements

› Privilege Separation

› New Quill

› “Crondor”

5www.condorproject.org

Big new goodies in v7.2

› Job Router

› Startd and Job Router hooks

› DAGMan tagging and splicing

› Green Computing started

› GLEXEC

› Concurrency Limits

6www.condorproject.org

Job Router

› Automated way to let jobs run on a wider array of resources Transform jobs into different forms Reroute jobs to different destinations

6

7www.condorproject.org

What is “job routing”?

7

Universe = “vanilla”Executable = “sim”Arguments = “seed=345”Output = “stdout.345”Error = “stderr.345”ShouldTransferFiles = TrueWhenToTransferOutput = “ON_EXIT”

Universe = “grid”GridType = “gt2”GridResource = \ “cmsgrid01.hep.wisc.edu/jobmanager-condor”

Executable = “sim”Arguments = “seed=345”Output = “stdout”Error = “stderr”ShouldTransferFiles = TrueWhenToTransferOutput = “ON_EXIT”

JobRouter

Routing Table: Site 1 … Site 2 …

final status

routed (grid) joboriginal (vanilla) job

8www.condorproject.org

Routing is just site-level matchmaking

› With feedback from job queue• number of jobs currently routed to site X• number of idle jobs routed to site X• rate of recent success/failure at site X

› And with power to modify job ad• change attribute values (e.g. Universe)• insert new attributes (e.g. GridResource)• add a “portal” grid proxy if desired

8

9www.condorproject.org

Startd Job Hooks

› Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. Specialized scheduling needs Jobs live in their own database or other

storage rather than a Condor job queue

9

10www.condorproject.org

Job Router Hooks

› Truly transform jobs, not just reroute them E.g. stuff a job into a virtual machine

(either VM universe or Amazon EC2)

› Hooks invoked like startd ones

10

11www.condorproject.org

Our solution

› Make a system of generic “hooks” that you can plug into: A hook is a point during the life-cycle

of a job where the Condor daemons will invoke an external program

Hook Condor to your existing job management system without modifying the Condor code

11

12www.condorproject.org

DAGMan Depth First Example

13www.condorproject.org

Category Example

Setup

Cleanup

Big job

Small jobSmall jobSmall job

Big job

Small jobSmall jobSmall job

Big job

Small jobSmall jobSmall job

Run <= 2

Run <= 5

14www.condorproject.org

DAGMan SplicingA

B

X+A

X+CX+B

X+D

Y+A

Y+CY+B

Y+D

Z+A

Z+CZ+B

Z+D

# Example Use CaseJOB A A.subJOB B B.subSPLICE X diamond.dagSPLICE Y diamond.dagSPLICE Z diamond.dagPARENT A CHILD X Y ZPARENT X Y Z CHILD B# Notice scoping of node!

Splicing creates one “in memory”DAG. No subdags means noextra condor_dagmans.

15www.condorproject.org

Green Computing

› The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.) HIBERNATE, HIBERNATE_CHECK_INTERVAL If all slots return non-zero, then the machine is

powered down; otherwise; it continues running.› Machine ClassAd contains all information

required for a client to wake it up Condor can wake it up, also a standalone tool. This was NOT as easy as it should be.

› Machines in “Offline State” Stored persistently to disk Lots of other uses

16www.condorproject.org

Concurrency Limits

› Limit job execution based on admin-defined consumable resources E.g. licenses

› Can have many different limits

› Jobs say what resources they need

› Negotiator enforces limits pool-wide

16

17www.condorproject.org

Concurrency Example

› Negotiator config file MATLAB_LIMIT = 5 NFS_LIMIT = 20

› Job submit file concurrency_limits = matlab,nfs:3

This requests 1 Matlab token and 3 NFS tokens

17

18www.condorproject.org

Other goodies in v7.2

› ALLOW/DENY_CLIENT

› Job queue backup on local disk

› PREEMPTION_REQUIREMENTS and RANK can reference additional attributes in negotiator about group resource usage

› Start on dynamic provisioning in the startd

› $$([])

19www.condorproject.org

Dynamic Slot Partitioning

› Divide slots into chunks sized for matched jobs

› Readvertise remaining resources

› Partitionable resources are cpus, memory, and disk

› See Matt Farrellee’s talk

19

20www.condorproject.org

Dynamic Partitioning Caveats

› Cannot preempt original slot or group of sub-slots Potential starvation of jobs with large

resource requirements

› Partitioning happens once per slot each negotiation cycle Scheduling of large slots may be slow

20

21www.condorproject.org

New Variable Substitution

› $$(Foo) in submit file Existing feature Attribute Foo from machine ad

substituted

› $$([Memory * 0.9]) in submit file New feature Expression is evaluated and then

substituted

21

22www.condorproject.org

More Info For Preemption

› New attributes for these preemption expressions in the negotiator… PREEMPTION_REQUIREMENTS PREEMPTION_RANK

› Used for controlling preemption due to user priorities

22

23www.condorproject.org

Right then.

What about v7.3.x and beyond?

Terms of LicenseAny and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….

25www.condorproject.org

Some tasty dishes

cooking in the Condor

kitchen

Special guest, Julia Child!

26www.condorproject.org

Already served (leftovers)

› CCB – Condor Connection Broker Dan Bradley’s presentation

› Bring checkpoint/restart to Vanilla Job Pete Keller’s presentation re DMTCP

› Asynch notification of events to fill a hole in Condor’s web service API Jungha Woo’s presentation

› Grid Universe improvements Xin Zhao’s presentation

27www.condorproject.org

Data “Drinks”

Wando Fishbowl Anyone?

28www.condorproject.org

Condor + Hadoop FS !› Lots of hard work by Faisal Khan › Motivation

Condor+HDFS = 2 + 2 = 5 !!! A Synergy exists (next slide)

• Hadoop as distributed storage system• Condor as cluster management system

Large number of distributed disks in a compute cluster

Managing disk as a resource

29www.condorproject.org

Condor + HDFS

› Dhruba Borthakur’s talk

› Synergy Condor knows a lot about its cluster

• Capability of individual machines in terms of available memory, CPU load, disk space etc.

• Availability of JRE (Java Universe)

Condor can easily automate house keeping jobs e.g

• rebalancing data blocks• Implementing user file quota

30www.condorproject.org

Condor + HDFS

› Synergy Failover

• High availability daemon in Condor ClassAds

• Let clients know the current IP of name server

• Heartbeat

31www.condorproject.org

condor_hdfs daemon

› Main integration point of HDFS within Condor

› Configures HDFS cluster based on existing condor_config files

› Runs under condor_master and can be controlled by existing Condor utilities

› Publish interesting parameters to Collector e.g IP address, node type, disk activity

› Currently deployed at UW-Madison

32www.condorproject.org

Condor + HDFS : Next Steps

› FileNode Failover

› Block placement policies & management

› Thinking about how Condor can steer jobs to the data Via a ClassAd function used in the RANK

expression?

› Integrate with File Transfer Mechanism…

33www.condorproject.org

More Job Sandbox Options

› Condor’s File Transfer mechanism Currently moves files between submit and

execute hosts (shadow and starter). Next : Files can have URLs

• HTTP• HDFS

How about Condor’s SPOOL ?

› Need to schedule movement? New Stork Mehmet Balman’s presentation

34www.condorproject.org

Virtual Meatchine

Dishes

35www.condorproject.org

Virtual Machine Sandboxing

› We have the Virtual Machine Universe… Great for provisioning Nitin Narkhede’s presentation

› … and now we are exploring different mechanisms to run a job inside a VM.

› Benefits Isolate the job from execute host. Stage custom execution environments. Sandbox and control the job execution.

36www.condorproject.org

One way to do it – via the Condor Job Router› Hard work by Varghese Mathew

› Ordinary Jobs & VM Universe Jobs.

› Job router – transform a job into a new form.

› Job router hook picks them up, sets them up inside a VM job, and submits the VM job.

› On completion, job router hook extracts output from the VM and returns to original job.

37www.condorproject.org

Different Flavors

› Script Inside VM

› Starter Inside VM

› Personal Condor Inside VM

› VM joins the pool as an execute node

› All different ways to bind a job to a specific virtual machine.

38www.condorproject.org

Speaking of VM Universe…

› Adding VM Universe Support for VMWare Server 2.x KVM

• Done via libvirt• Future VM systems added to libvirt

should be easy to add in the future VMWare ESX, ESXi

› Thank you community for contributions!

39www.condorproject.org

“Lightweight Jobs” Salad

40www.condorproject.org

Fast, quick, light jobs› Options to put a Condor

job on a diet› Diet ideas:

Leave the luggage at home! No job file sandbox, everything in the job ad.

Don’t pay for strong semantic guarantees if you don’t need em. Define expectations on entry, update, completion.

› Want to honor scheduling policy, however.

41www.condorproject.org

Some small side

dishes

Julia, a spy who really knew her eggs…

42www.condorproject.org

› Non-blocking communication via threads Refer to Dan/Igor’s talk Especially all the security session

roundtrips The USCMS scalability testbed needs

70 collectors to support ~20k dynamic machines; replaced with 1 collector w/ threading code. 70:1, baby!!!!!

› Configuration knob management Think about:config in firefox Hard-coded configurations now

possible

› Nested groups

43www.condorproject.org

Mmmm, tasty Condor Wiki

44www.condorproject.org

Scheduling Dessert

Pabst and Jack, a dessert favorite!

45www.condorproject.org

Back to Green Computing

› The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.).

› Machine ClassAd contains all information required for a client to wake it up

› Machines in “Offline State” Stored persistently to disk

› NOW… have the matchmaker publish “match pressure” into these offline ads, enabling policies for auto-wakeup

46www.condorproject.org

Scheduling in Condor Today

CM

schedd

CMschedd

scheddschedd

schedd

startdstartdstartdstartdstartd

startdstartdstartdstartdstartd

› Distributed Ownership› Settings reflect 3 separate viewpoints:

Pool manager, Resource Owner, Job Submitter

47www.condorproject.org

But some sites want to use Condor like this:

schedd

startdstartdstartdstartdstartd

› Just one submission point (schedd)› All resources owned by one entity› We can do better for these sites.

Policy configurations are complicated. Some useful policies not present because they

are hard to do a wide-area distributed system. Today the dedicated “scheduler” only supports

FIFO and a naive Best Fit algorithms.

48www.condorproject.org

So what to do?

schedd

startdstartdstartdstartdstartd

› Give the schedd more scheduling options. Examples: why can’t the schedd do

priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ?

49www.condorproject.org

Thank you to an awesome community!!!