real-life experiences with grids: it’s not as easy as it looks

45
Grid Experiences 1 Real-life experiences with grids: It’s not as easy as it looks Alain Roy [email protected] University of Wisconsin-Madison Condor Team

Upload: mare

Post on 14-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Real-life experiences with grids: It’s not as easy as it looks. Alain Roy [email protected] University of Wisconsin-Madison Condor Team. Who Am I?. Member of Condor Team Experience with Condor Experience with grid deployment Developer of Virtual Data Toolkit Used by GriPhyN, EDG, LCG… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 1

Real-life experiences with grids:It’s not as easy as it looks

Alain [email protected]

University of Wisconsin-MadisonCondor Team

Page 2: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 2

Who Am I?

• Member of Condor Team– Experience with Condor– Experience with grid deployment

• Developer of Virtual Data Toolkit– Used by GriPhyN, EDG, LCG…– Packaging of Globus, Condor, etc.

• Collaborator with INFN– Working with Paolo Mazzanti– In Bologna for four weeks

Page 3: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 3

Italy• Italy is beautiful

• The food is wonderful• The people are friendly

Page 4: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 4

Background• Condor’s environment is a little like a grid

– Not all computers (grid sites) are under Condor’s control

– Computers (grid sites) disappear at the owner’s whim

– Everything changes constantly

• Condor was built to deal with this dynamic environment

• Grid software needs to do the same

Page 5: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 5

Background

• Late 1980s until today– Condor developed and deployed on

hundreds of sites– Condor built to deal with failures

• Recently– Condor-G: your window to the grid– Condor team has helped deploy grid

technology for real use—not just experiments

Page 6: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 6

Background: Condor

• Condor is a batch job system• Goal: High throughput computing

– Different than high-performance

• Goal: High reliability• Goal: Support distributed

ownership

Page 7: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 7

High-Throughput Computing

• Worry about FLOPS/year, not FLOPS/second

• Use all resources effectively– Dedicated clusters– Non-dedicated computers (desktop)

Page 8: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 8

Effective Resource Use

• Requires high reliability Computers come and go, your jobs

shouldn’t. – Checkpointing– Be prepared for everything breaking

• Requires distributed ownership

Page 9: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 9

Condor-G

• Condor-G submits Globus jobs• Jobs are in persistent queue

– Unlike globus-job-run

• Jobs are retried on system failures• Jobs are held on some failures• Condor-G makes it easy to submit

grid jobs

Page 10: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 10

Background: USCMS• CMS:

– Detector online in 2007– Needs to simulate & reconstruct millions

of events

• USCMS testbed– Joint PPDG/GriPhyN effort– Integrate CMS tools with grid tools

• Globus• Condor-G

– Contribute real work to CMS

Page 11: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 11

Background: USCMS

• 7 sites, 250+ CPUs• Spring 2002: Deploy & test• Fall 2002

– Last minute production– 150,000 events in two weeks– Successful, but lots of work

• Today:– Wider deployment & use

Page 12: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 12

Background: DØ

• Experiment at Fermilab• Already doing real production, real

analysis• Deploying on grid sites today

– Condor-G– Globus– SAM

Page 13: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 13

DØ: Condor-G

• They liked Condor-G:• Condor-G missing a feature:

– Deciding which grid-site to use

• SAM (data handling software) knows where data is located

• SAMGrid: – Condor-G asks SAM for advice– Condor-G decides where to run jobs

Page 14: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 14

DØ: deployment

• Spring: Beginning of deployment• Late summer: production• Early results:

– It looks good– We have more work to do

• Better error reporting• Better matchmaking

• What will we learn later?

Page 15: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 15

Problems & Lessons

• During our experiences, we’ve:– Encountered many problems– Developed solutions to these problems– Learned many lessons about grids

• This talk:– Shares some interesting problems– Gives some advice & solutions

Page 16: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 16

Taking a taxi

• How do you take a taxi in Paestum, Italy?– We don’t need to: walk 4km there– The ruins were lovely– The ruins were outside– It was about 35°C– Wife is pregnant

Page 17: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 17

Use all your resources

• Walk up to storekeeper• Ask: Dovay Ooon Taxi? (Dove un

taxi?)• Be patient: Wait ten minutes• Take taxi• I assumed my resources (local

knowledge, Italian) were insufficient, but they saved me time when I used them

Page 18: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 18

Use all your resources• Condor:

– Uses dedicated machines (I can walk)– Uses non-dedicated machines (I can

sometimes ask for help)

• Grids:– Connect your machine rooms– Can you take advantage of other

resources?– Avoid mentality “I must control all

resources”, and you will prosper

Page 19: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 19

Grid: distributed machine room?

• You can have good control• You can pre-install applications• You know how everything works

BUT…• You lose flexibility

– How quickly can you upgrade sites?– Did they install everything correctly?– Can you use new grid sites easily?

Page 20: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 20

Grid: Use all resources

• Assume: basic grid software is installed

• Assume: nothing else is installed– Bring your software with you

• Submit one job: install software• Submit N jobs: use software

– You control software– You ensure correct installation

• Easy to use any grid site

Page 21: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 21

• Long-running programs crash– Condor has daemons on each machine:

• User (job) agent• Machine agent• Matchmaker

– They crash:• Programming errors• Network failures• Disk failures• …

Long-running programs

Page 22: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 22

Watch programs

• Condor master– Small program, rarely changed– Runs Condor daemons– When daemon crashes:

• Restart daemon, send email• If it crashes again, restart after backoff

• Result:– Many errors are silently fixed– Yet we don’t just ignore crashes

Page 23: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 23

Short-running programs

• Short-running programs crash/hang• Example: globus-url-copy

– USCMS testbed: staging data– Some fraction of copies hang or fail– Programming error + delicate network– Hard to reproduce and fix

Page 24: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 24

Watch programs

• When copy exceeds timeout, kill and retry

• Possible to do in shell scripting languages, but not easy

• Use Fault Tolerant Shell to watch programs

Page 25: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 25

Fault Tolerant Shell

• Shell language built for coping with errors

try for 30 minuteswget http://www.example.com/file.tar.gz

gunzip file.tar.gz

tar xf file.tar

endExponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2]

Page 26: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 26

FTSH: exponential backoff

• Why exponential backoff?– What if 100 ftsh scripts are executing?– Avoid synchronization reduce load,

increase chance of success– Similar to Ethernet

Page 27: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 27

Fault Tolerant Shell

• Easier to cope with failures:try 5 times

wget http://www.example.com/file.tar.gz

catch

rm –f file.tar.gz

failure

endCleanup partially downloaded file, if it exists

Page 28: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 28

Fault Tolerant Shell

• Flexibletry for 30 minutes

try for 5 minutes

wget http://example.com/file.tar.gz

end

try for 1 minute or 3 times

gunzip file.tar.gz

tar xf file.tar

catch

rm –rf file.tar

end

end

Cope with network failure

Cope with disk failure

Page 29: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 29

FTSH: More information

• Work of Doug Thain– [email protected]

• Excellent paper: – The Ethernet Approach to Grid

Computing, by Doug Thain – Available from:

http://www.cs.wisc.edu/~thain

• Even if you don’t use FTSH, read this paper!

Page 30: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 30

Whose error is it?

• The source of an error is not always obvious

• The source of an error influences how you react to the error

• Example: Java universe in Condor

Page 31: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 31

Java Universe• Users submit Java jobs to Condor• Whose error is it? Check result code:

– 1: Program dereferenced NULL pointer

– 1: Job’s image is corrupt

– 1: VM doesn’t have enough memory to run program

– 1: Java installation is misconfigured

Job shouldn’t run again

Job shouldn’t run again

Try another machine with more memory

Don’t use this machine for Java

Page 32: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 32

Don’t trust configuration

• Users tells Condor: “Java is installed”– This is just a hint!

• Condor verifies Java configuration– Run simple job, verify output

• If Java works, Condor advertises that Java can be used

• If Java fails, error is reported, Java can’t be used

Page 33: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 33

Look for error scope

• Add Java wrapper to all Java jobs– Run program– Examine return code/exception– Write all details to file

• Examine output of wrapper, or exception from JVM– We know if job is bad– We know if JVM is insufficient for job– We know if JVM is bad

Page 34: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 34

Error Scope

• We could have an entire talk on error scope

• Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain

• Useful paper even if you don’t use Condor or Java

Page 35: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 35

Many layers in a grid

condor_submit Condor job agent

Condor matchmaker Execution computer

Condor-G job agent

condor_submit

Globus jobmanager

Globus gatekeeper

Globus GRAM

inetd

Page 36: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 36

We forgot inetd

• We submitted 300 jobs at once• Inetd noticed many connections

per second• Inetd presumed there was a denial

of service attack and refused connections for five minutes

• Lots of debugging!

Page 37: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 37

There are more layers!

Master Site

Impala

MOP

Condor-G

Worker

Globus

Batch System(Condor, PBS)

Real WorkDAGMan

USCMS Testbed Architecture (A bit dated)

Page 38: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 38

More layers than that!

1. MCRunJob2. Impala3. MOP4. condor_schedd5. DAGMan6. Condor-G condor_schedd7. condor_gridmanager8. gahp_server

9. globus-gatekeeper10. globus-job-manager11. globus-job-manager-script.pl12. local batch system submit13. local batch system execute14. MOP wrapper15. Impala wrapper16. actual job

This disregards inetd, network, file servers, file transfers…

USCMS Testbed Architecture (A bit dated)

Page 39: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 39

Recovery at multiple levels

• Fault-tolerance and recovery is built in at many levels:– Condor_master: restart daemons– Condor_schedd: job queue– DAGMan: checkpoint DAG of jobs– Gahp_server: isolate Globus libraries– And others…

Page 40: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 40

Allocate debugging time

• Allocate lots of debugging time• It is very hard to propagate errors• How does a user find a remote error?

– Call system administrator– Admin looks through log files for each

layer (not accessible to user)

• We need better debugging methods

Page 41: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 41

Everything will fail(Everything)

• In the USCMS testbed production:– Power outage for several hours– Network outages: few minutes-11 hr.– Failed configuration change– Site upgraded– Jobs accidentally removed– Software bugs everywhere

Page 42: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 42

How do you cope?

• Condor-G:– Error: job cannot run. This is not good

enough– Resubmit jobs that can be resubmitted,

perhaps after a delay– Put jobs on hold in queue:

• User examines hold reason (proxy is expired)• User fixes error• User restarts job

Page 43: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 43

Everything will fail(Even the little things)

• Condor Matchmaker:– Collects descriptions of machines & jobs– Soft state in matchmaker (push smarts

to edge, like Internet)

• UDP packets to advertise machines– Less overhead than many TCP

connections– Works great in a LAN

• But…

Page 44: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 44

Everything will fail: UDP• But you lose some UDP packets

– Send packets every five minutes– Keep stale information for 15 minutes– Be prepared to cope with stale

information– This has worked for years in Condor

• DØ: matchmaking on grid– UDP packets from Korea to Chicago were

completely lost on weekdays– Added TCP option

Page 45: Real-life experiences with grids: It’s not as easy as it looks

Grid Experiences 45

Be prepared

• Assume everything will fail– Have recovery at multiple levels– Understand scope of errors– Don’t trust configuration:

• Verify it• Install & configure software “on the fly”

• Assume bugs are everywhere• Build software to cope with errors