evaluating condor for enterprise use: a ubs case study

Evaluating Condor for Enterprise Use: A UBS Case Study

April 26, 2006

Gregg Cooke, IT Technical Council

GENERALLY ACCESSIBLE

2

Overview

Context: Why UBS Uses Grids

Tests: What Did We Look At?

Results: Strengths & Limitations

SECTION 1

The Context: Grids in an Investment Bank

4

Grids at UBS

Specifically, when we say “grid” we mean a computational cluster– Condor fits the definition closely

Other terminology:

What do we mean by “grid”?

Condor term UBS term

Pool GridJob cluster JobJob TaskVirtual machine Engine or NodeCentral Manager Broker or Manager

5

Grids at UBS

Complex, long-running calculations include:– Monte Carlo simulations of risk exposure– Black-Scholes option valuations on portfolios of stock options– Valuation of complicated “exotic” financial instruments

Speed of computation directly correlates to volume of sales Accuracy of risk exposure calculation directly correlates to reserve cash Calculations constructed by quantitative analysts (“quants”)

– Write code that’s easy to change, not code that’s particularly efficient or parallelized

Why do we use grids?

6

Current Grid Environment at UBS

10 separate production grids totaling 3000+ engines– All separate grids…some 60-engine, some 2000-engine– 1 million tasks per day

Wide variety of platforms, languages, architectures– C/C++, C#, Java on Windows or Linux– Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow– Rarely any greenfield development

Dedicated deployment & operations teams (“GSD”)– Straddle the development / operations worlds– Focused on meeting businesses SLAs– Strong drivers of what grid platform we use

How do we build & run our grids?

7

Typical UBS Grid Environment

Job specification

Task input,Task results

Manager

Trader Desktop

Engine-1 Engine-2 Engine-3 Engine-N. . .

Taskassignments

Job status

Quants•write the calculations• part of the business

GSD• makes app meet SLAs • faces off with business

Dev• builds & tests the application• uses quant code, partners with GSD

SECTION 2

The Tests: Function, not Performance

9

How to Test Condor?

No performance tests…instead: Determine the functional limits of Condor Determine how Condor integrates with existing enterprise systems Port one or more projects to use Condor and measure:

– Porting effort– Opportunities for new functionality (and cost of lost functionality)– Operational impact

Feasibility Study: is Condor suitable for use within our enterprise?

10

The Tests

Scheduling capabilities– Various combinations of Requirements, Rank, Start, Suspend, etc. rules

Administrative capabilities– Features of command line tools, common admin practices,

Interaction model– Integrating Condor with an app: APIs, SOAP interface, command line interface

Robustness and resilience– Failover options, long-term stability, task retry, realtime reconfiguration, etc.

Usability– Impact to the user when a Condor engine is installed on their desktop

And…scheduling latency…

We tested the following aspects of Condor:

11

Scheduling Latency

Applications may be designed with a given scheduling latency in mind– We can control how long our code takes…we cannot control the scheduling latency– Redevelopment is often a major undertaking

We were expecting a very short (100msec) deterministic scheduling latency– Condor’s is much longer (1min or more) and nondeterministic– Condor does have an alternative (COD) but it changes the expected behavior of

the grid Impact on testing: new set of questions!

– “Does Condor’s scheduling latency present a problem for our applications?”– “Do we have applications that were not developed with assumptions about the

scheduling latency?”– “Are there other aspects of Condor’s performance that offset the scheduling

latency concerns?”– “Can we measure the performance of our applications on Condor without regard to

scheduling latency?”

Definition: the interval between the initial request and when the first engine starts working on your task

SECTION 3

The Result: Condor as a Functional Benchmark

13

What We Love About Condor

Incredibly powerful expression-based scheduling policy No-impact desktop cycle scavenging Easy reconfiguration Anything that can be run from a command line can be a task

Too many to list…here are the top four:

But, Condor has limits too…

14

What Condor Needs to Better Support UBS

Administrative interface Code deployment Scheduling latency Job submission APIs

Important: remember that these conclusions are only relevant to UBS!

This is only what we found, based on our context…your mileage may vary

We found issues in four key areas:

15

Administration Interface

What we expected:– A nice GUI admin console similar to others our operations personnel are familiar

with What we found:

– A rich command-line administration interface, but no GUI Our conclusion:

– At UBS, Condor will not be used by operations teams that cannot accept a command-line admin interface

– These are usually Windows teams…Unix teams don’t seem to have as much bias What this means for the Condor community:

– A GUI admin console will make Condor more acceptable to enterprise users– Web-based is best– Doesn’t have to be fancy…just needs to be point & click (and stable, of course)– Work being done at Indiana University on a Condor portal is a start

Our conclusions:

16

Code Deployment

What we expected:– Automatic task code deployment done once and refreshed automatically when

the grid system senses a change in a central repository What we found:

– Automatic task code deployment every time a job is submitted Our conclusion:

– At UBS, Condor causes problems with applications with huge (15Mb+) task codes and short tasks because the network transmission time impacts job completion time

What this means for the Condor community:– To make Condor more acceptable to enterprise users, task code should be

cached at the engines and only refreshing when it changes– Fortunately, this is being worked on by the Condor Project!– We’ve watched commercial grid vendors implement this…is not an easy

feature!

Our conclusions:

17

Scheduling Latency

What we expected:– Negligibly small latency that’s deterministic enough for us to predict job

completion times What we found:

– Latencies that depend on configuration settings and complexity of classads Our conclusion:

– At UBS, Condor cannot be used for tasks that require less than 3.5 min to complete or where the total job completion time must be easily predictable

– However,– Even though our highest-value applications require short deterministic

scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency

Our conclusions:

18

Application Programmer’s Interface

What we expected:– Nice, well-designed APIs for all our favorite languages

What we found:– A command line interface and a maturing SOAP interface

Our conclusion:– Once the SOAP interface matures, UBS programmers will be more amenable to

using Condor What this means for the Condor community:

– Full-speed ahead on the SOAP interface!– Make sure all of the functionality available in the command-line interface is

available in the SOAP interface

Our conclusions:

19

Condor at UBS

Teaching new teams how to grid their applications– Condor is an excellent exploration and learning environment– Has already accelerated at least one team

A functional benchmark for all things grid– Condor is a crucible where new and innovative grid ideas get tried and refined– Many of these features will prove valuable for commercial vendors to embrace

– Check-pointing & task migration– Expression-based scheduling policy– User-centric cycle scavenging

Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface– There are lots and lots of non-critical batch-oriented apps with standalone services– There are not a lot of operations teams that will tolerate a command line

interface…

We will continue to use Condor for:

evaluating condor for enterprise use: a ubs case study

Documents