large, fast, and out of control: tuning condor for film production jason a. stowe software engineer...

15
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Upload: justin-pope

Post on 28-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Large, Fast, and Out of Control: Tuning Condor for Film

Production

Jason A. StoweSoftware Engineer Lead - Condor

CORE Feature Animation

Page 2: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Submitter

Session Manager

FAM DB

Condor View CORE

User FacingBack End

CORE's Farm & Middleware

1000 2.8 GHz. Processors Linux 4GB RAM

70-100 TerabytesSeveral Filers

50 Million Renders so far(Vanilla Universe)

Condor_startd starter

Condor_render

Condor_schedd

64 Mac Procs

4 Managing Machines

Page 3: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Goals and Software

Goals ●High Throughput & Efficiency●Easy Condor Submission and Integration

Priority Management – Key to Throughput

Page 4: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Initial Configuration

Software/Policies●User Priority●Behavior Flags - STARTD

Issues●NFS issues●Out of Order Execution●Priority Management

320 Procs1 Main Filer

RenderManSchedd Server

Workstation Schedds(Sched Everything Else)

MiddlewareCentralMgr

Page 5: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

How CG Productions Work

Traditionally,

Movie scripts = Group of Sequences

Movie's Sequences ~ Play's Scenes

Sequence = Group of Shots

Assets = Sets/Characters/Props/...

Prioritize work-units instead of users?

Design

Model

Texture

Surfacing

Assets Design

Layout

Animation

Lighting

Composite

Shots

Two Pipelines

Page 6: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Accounting Groups: Take 1

Software/Policies●Contracted Wisconsin: Accounting Groups(AG)●Job =unique AG●Added Filers, Fix drivers

Issues●Accountant Overload●Slow Finishing...

360 ProcsMany Filers

GeneralSchedd Server

Workstation Schedds(Sched Certain Jobs)

MiddlewareCentral Mgr

16 Mac Procs

Page 7: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Accounting Groups: Take 1

Every job got some resources, but not enough to finish fast for Production.

Moved quickly to Take 2...

Page 8: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Accounting Groups: Take 2

Software/Policies●Shots Get Unique AG●Unify Schedds to fix out of order cases

Issues●Wanted: Farm % Priority●Classic Schedd Overload: “Claimed Idle”s

360 ProcsMany Filers

GeneralSchedd Server

Fewer Workstation Schedds(Sched Certain Jobs)

MiddlewareCentral Mgr

32 Mac Procs

Page 9: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Accounting Groups: Final?

Software/Policies●“Priority User” - p1 p2 p3●Multiple Server & Schedds ●ASAP & Department Flags

Issues●Department “Pools”●Preemption = Bad

500 ProcsMany Filers

3 Schedd Servers

MiddlewareCentral Mgr

32 Mac Procs

Page 10: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Accounting Groups: Final?

Sharing Power is a difficult task for anyone, especially users with deadlines.

Need a Quality of Service guarantee: resources will always be available without preemptive department pools...

Page 11: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Group Quotas save the day

1000 ProcsMany Filers

3 Schedd Servers

MiddlewareCentral Mgr

64 Mac Procs

Software/Policies●Department Groupsg_lfx, g_mdl, g_chr, etc.●Quality Of Service●Nighttime Priority

Issues●Long negotiation CyclesTotal Cycle: 6 minutesServer loads >6

Page 12: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Middleware

Performance Optimization

2 Schedd Servers

CentralMgr

64 Mac Procs

Goal: Speed Negotiator●Remove Many Groups●Significant Attributes(SIGNIFICANT_ATTRIBUTES)

●Schedd Submit Algorithm●Separate Middleware & Central Manager Servers●Negotiator Cycle 20 sec delay => 3 sec(NEGOTIATOR_CYCLE_DELAY)

1000 ProcsMany Filers

Page 13: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Optimization Results

Performance Before => After:● Removed Groups: 6 => 5.5 min● Significant Attributes: 5.5 => 3 min● Schedd Algorithm: 3 => 1.5 min● Separate Servers: 1.5 => 0.6 min● Cycle delay: 0.6 => 0.33 min● Server Loads: <1 Middleware

<2 Central Manager

Page 14: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Lessons Learned● Remove pre-emption where possible● Simplify Startd/Negotiator (Control) policies:

● Make Consistent/remove special cases● Understandable farm behavior

● Keep Server Functions Simple● Use Accounting Groups to guarantee relative

percentage allocation of resources● Use Group Quotas instead of machine-specific

RANK policies for better throughput

Page 15: Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Thank you

Condor Team University of Wisconsin

CORE

Any Questions?

[email protected]