evaluating condor for enterprise use: a ubs case study
DESCRIPTION
GENERALLY ACCESSIBLE. Evaluating Condor for Enterprise Use: A UBS Case Study. Gregg Cooke, IT Technical Council. April 26, 2006. Overview. Context: Why UBS Uses Grids Tests: What Did We Look At? Results: Strengths & Limitations. SECTION 1. The Context: Grids in an Investment Bank. - PowerPoint PPT PresentationTRANSCRIPT
Evaluating Condor for Enterprise Use: A UBS Case Study
April 26, 2006
Gregg Cooke, IT Technical Council
GENERALLY ACCESSIBLE
2
Overview
Context: Why UBS Uses Grids
Tests: What Did We Look At?
Results: Strengths & Limitations
SECTION 1
The Context: Grids in an Investment Bank
4
Grids at UBS
Specifically, when we say “grid” we mean a computational cluster– Condor fits the definition closely
Other terminology:
What do we mean by “grid”?
Condor term UBS term
Pool GridJob cluster JobJob TaskVirtual machine Engine or NodeCentral Manager Broker or Manager
5
Grids at UBS
Complex, long-running calculations include:– Monte Carlo simulations of risk exposure– Black-Scholes option valuations on portfolios of stock options– Valuation of complicated “exotic” financial instruments
Speed of computation directly correlates to volume of sales Accuracy of risk exposure calculation directly correlates to reserve cash Calculations constructed by quantitative analysts (“quants”)
– Write code that’s easy to change, not code that’s particularly efficient or parallelized
Why do we use grids?
6
Current Grid Environment at UBS
10 separate production grids totaling 3000+ engines– All separate grids…some 60-engine, some 2000-engine– 1 million tasks per day
Wide variety of platforms, languages, architectures– C/C++, C#, Java on Windows or Linux– Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow– Rarely any greenfield development
Dedicated deployment & operations teams (“GSD”)– Straddle the development / operations worlds– Focused on meeting businesses SLAs– Strong drivers of what grid platform we use
How do we build & run our grids?
7
Typical UBS Grid Environment
Job specification
Task input,Task results
Manager
Trader Desktop
Engine-1 Engine-2 Engine-3 Engine-N. . .
Taskassignments
Job status
Quants•write the calculations• part of the business
GSD• makes app meet SLAs • faces off with business
Dev• builds & tests the application• uses quant code, partners with GSD
SECTION 2
The Tests: Function, not Performance
9
How to Test Condor?
No performance tests…instead: Determine the functional limits of Condor Determine how Condor integrates with existing enterprise systems Port one or more projects to use Condor and measure:
– Porting effort– Opportunities for new functionality (and cost of lost functionality)– Operational impact
Feasibility Study: is Condor suitable for use within our enterprise?
10
The Tests
Scheduling capabilities– Various combinations of Requirements, Rank, Start, Suspend, etc. rules
Administrative capabilities– Features of command line tools, common admin practices,
Interaction model– Integrating Condor with an app: APIs, SOAP interface, command line interface
Robustness and resilience– Failover options, long-term stability, task retry, realtime reconfiguration, etc.
Usability– Impact to the user when a Condor engine is installed on their desktop
And…scheduling latency…
We tested the following aspects of Condor:
11
Scheduling Latency
Applications may be designed with a given scheduling latency in mind– We can control how long our code takes…we cannot control the scheduling latency– Redevelopment is often a major undertaking
We were expecting a very short (100msec) deterministic scheduling latency– Condor’s is much longer (1min or more) and nondeterministic– Condor does have an alternative (COD) but it changes the expected behavior of
the grid Impact on testing: new set of questions!
– “Does Condor’s scheduling latency present a problem for our applications?”– “Do we have applications that were not developed with assumptions about the
scheduling latency?”– “Are there other aspects of Condor’s performance that offset the scheduling
latency concerns?”– “Can we measure the performance of our applications on Condor without regard to
scheduling latency?”
Definition: the interval between the initial request and when the first engine starts working on your task
SECTION 3
The Result: Condor as a Functional Benchmark
13
What We Love About Condor
Incredibly powerful expression-based scheduling policy No-impact desktop cycle scavenging Easy reconfiguration Anything that can be run from a command line can be a task
Too many to list…here are the top four:
But, Condor has limits too…
14
What Condor Needs to Better Support UBS
Administrative interface Code deployment Scheduling latency Job submission APIs
Important: remember that these conclusions are only relevant to UBS!
This is only what we found, based on our context…your mileage may vary
We found issues in four key areas:
15
Administration Interface
What we expected:– A nice GUI admin console similar to others our operations personnel are familiar
with What we found:
– A rich command-line administration interface, but no GUI Our conclusion:
– At UBS, Condor will not be used by operations teams that cannot accept a command-line admin interface
– These are usually Windows teams…Unix teams don’t seem to have as much bias What this means for the Condor community:
– A GUI admin console will make Condor more acceptable to enterprise users– Web-based is best– Doesn’t have to be fancy…just needs to be point & click (and stable, of course)– Work being done at Indiana University on a Condor portal is a start
Our conclusions:
16
Code Deployment
What we expected:– Automatic task code deployment done once and refreshed automatically when
the grid system senses a change in a central repository What we found:
– Automatic task code deployment every time a job is submitted Our conclusion:
– At UBS, Condor causes problems with applications with huge (15Mb+) task codes and short tasks because the network transmission time impacts job completion time
What this means for the Condor community:– To make Condor more acceptable to enterprise users, task code should be
cached at the engines and only refreshing when it changes– Fortunately, this is being worked on by the Condor Project!– We’ve watched commercial grid vendors implement this…is not an easy
feature!
Our conclusions:
17
Scheduling Latency
What we expected:– Negligibly small latency that’s deterministic enough for us to predict job
completion times What we found:
– Latencies that depend on configuration settings and complexity of classads Our conclusion:
– At UBS, Condor cannot be used for tasks that require less than 3.5 min to complete or where the total job completion time must be easily predictable
– However,– Even though our highest-value applications require short deterministic
scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency
Our conclusions:
18
Application Programmer’s Interface
What we expected:– Nice, well-designed APIs for all our favorite languages
What we found:– A command line interface and a maturing SOAP interface
Our conclusion:– Once the SOAP interface matures, UBS programmers will be more amenable to
using Condor What this means for the Condor community:
– Full-speed ahead on the SOAP interface!– Make sure all of the functionality available in the command-line interface is
available in the SOAP interface
Our conclusions:
19
Condor at UBS
Teaching new teams how to grid their applications– Condor is an excellent exploration and learning environment– Has already accelerated at least one team
A functional benchmark for all things grid– Condor is a crucible where new and innovative grid ideas get tried and refined– Many of these features will prove valuable for commercial vendors to embrace
– Check-pointing & task migration– Expression-based scheduling policy– User-centric cycle scavenging
Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface– There are lots and lots of non-critical batch-oriented apps with standalone services– There are not a lot of operations teams that will tolerate a command line
interface…
We will continue to use Condor for: