large, fast, and out of control: tuning condor for film production jason a. stowe software engineer...
TRANSCRIPT
Large, Fast, and Out of Control: Tuning Condor for Film
Production
Jason A. StoweSoftware Engineer Lead - Condor
CORE Feature Animation
Submitter
Session Manager
FAM DB
Condor View CORE
User FacingBack End
CORE's Farm & Middleware
1000 2.8 GHz. Processors Linux 4GB RAM
70-100 TerabytesSeveral Filers
50 Million Renders so far(Vanilla Universe)
Condor_startd starter
Condor_render
Condor_schedd
64 Mac Procs
4 Managing Machines
Goals and Software
Goals ●High Throughput & Efficiency●Easy Condor Submission and Integration
Priority Management – Key to Throughput
Initial Configuration
Software/Policies●User Priority●Behavior Flags - STARTD
Issues●NFS issues●Out of Order Execution●Priority Management
320 Procs1 Main Filer
RenderManSchedd Server
Workstation Schedds(Sched Everything Else)
MiddlewareCentralMgr
How CG Productions Work
Traditionally,
Movie scripts = Group of Sequences
Movie's Sequences ~ Play's Scenes
Sequence = Group of Shots
Assets = Sets/Characters/Props/...
Prioritize work-units instead of users?
Design
Model
Texture
Surfacing
Assets Design
Layout
Animation
Lighting
Composite
Shots
Two Pipelines
Accounting Groups: Take 1
Software/Policies●Contracted Wisconsin: Accounting Groups(AG)●Job =unique AG●Added Filers, Fix drivers
Issues●Accountant Overload●Slow Finishing...
360 ProcsMany Filers
GeneralSchedd Server
Workstation Schedds(Sched Certain Jobs)
MiddlewareCentral Mgr
16 Mac Procs
Accounting Groups: Take 1
Every job got some resources, but not enough to finish fast for Production.
Moved quickly to Take 2...
Accounting Groups: Take 2
Software/Policies●Shots Get Unique AG●Unify Schedds to fix out of order cases
Issues●Wanted: Farm % Priority●Classic Schedd Overload: “Claimed Idle”s
360 ProcsMany Filers
GeneralSchedd Server
Fewer Workstation Schedds(Sched Certain Jobs)
MiddlewareCentral Mgr
32 Mac Procs
Accounting Groups: Final?
Software/Policies●“Priority User” - p1 p2 p3●Multiple Server & Schedds ●ASAP & Department Flags
Issues●Department “Pools”●Preemption = Bad
500 ProcsMany Filers
3 Schedd Servers
MiddlewareCentral Mgr
32 Mac Procs
Accounting Groups: Final?
Sharing Power is a difficult task for anyone, especially users with deadlines.
Need a Quality of Service guarantee: resources will always be available without preemptive department pools...
Group Quotas save the day
1000 ProcsMany Filers
3 Schedd Servers
MiddlewareCentral Mgr
64 Mac Procs
Software/Policies●Department Groupsg_lfx, g_mdl, g_chr, etc.●Quality Of Service●Nighttime Priority
Issues●Long negotiation CyclesTotal Cycle: 6 minutesServer loads >6
Middleware
Performance Optimization
2 Schedd Servers
CentralMgr
64 Mac Procs
Goal: Speed Negotiator●Remove Many Groups●Significant Attributes(SIGNIFICANT_ATTRIBUTES)
●Schedd Submit Algorithm●Separate Middleware & Central Manager Servers●Negotiator Cycle 20 sec delay => 3 sec(NEGOTIATOR_CYCLE_DELAY)
1000 ProcsMany Filers
Optimization Results
Performance Before => After:● Removed Groups: 6 => 5.5 min● Significant Attributes: 5.5 => 3 min● Schedd Algorithm: 3 => 1.5 min● Separate Servers: 1.5 => 0.6 min● Cycle delay: 0.6 => 0.33 min● Server Loads: <1 Middleware
<2 Central Manager
Lessons Learned● Remove pre-emption where possible● Simplify Startd/Negotiator (Control) policies:
● Make Consistent/remove special cases● Understandable farm behavior
● Keep Server Functions Simple● Use Accounting Groups to guarantee relative
percentage allocation of resources● Use Group Quotas instead of machine-specific
RANK policies for better throughput