error scope on a computational grid douglas thain university of wisconsin 4 march 2002

22
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Upload: leslie-arnold

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Error Scope on a Computational

GridDouglas Thain

University of Wisconsin4 March 2002

Page 2: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Overview

We have added a Java Universe to Condor. (More from Todd.)

Adding this code forced us to think about the fundamental problem of coupling systems and representing errors.

A lesson: One must consider the scope of an error as well as its detail.

Page 3: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Java for Scientific Computing

Java is emerging as a tool for large scale (Grande) scientific computing.• More accessible to domain scientists.• Simplified porting.• Faster development, debugging.

User communities are forming:• ACM Java Grande Conference• The Java Grande Forum

Page 4: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

The Hype:

Java:•“Write once, run anywhere!”

Condor:•“Submit once, run everywhere!”

The Grid:•Uniform, dependable, consistent,

pervasive, and inexpensive computing.

Page 5: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

The Reality:

Coupling systems is not trivial! The easy part:

• Putting java in front of the program name.

The tricky parts:• Dealing with unexpected events!

– Bad java installation.Bad java installation.– Unavailable file system.Unavailable file system.– Temporary resource exhaustion.Temporary resource exhaustion.

Page 6: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Architecture

Execution:• User just specifies “java” universe.• Execution site gives details of JVM.

I/O:• Know all of your files?

– Condor transfers whole files for you.Condor transfers whole files for you.

• Need online I/O?– Link program with Chirp I/O Library.Link program with Chirp I/O Library.– Execution site provides proxy to home site.Execution site provides proxy to home site.

Page 7: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

startershadow

HomeFile

System

Execution SiteSubmission Site

Page 8: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

startershadow

HomeFile

System

Execution SiteSubmission Site

JVM

Fork

Page 9: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

startershadow

HomeFile

System

Execution SiteSubmission Site

JVM

Fork

The Job

Page 10: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

JVM

Fork

startershadow

HomeFile

System

I/O Library

The Job

I/O Server I/O Proxy

Secure Remote I/O

Local System Calls Local I/O(Chirp)

Execution SiteSubmission Site

Page 11: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Initial Experience

Bad news: Nearly any unexpected failure would cause the job to be returned to the user:• Out of memory at execution site.• Java misconfigured at execution site.• I/O proxy can’t initialize.• Home file system offline.

Page 12: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

What do Users Want?

This was correct in a certain sense:• The information was true.• But, still frustrating.

Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment.

Page 13: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

What Did We Do Wrong?

We thought that we were very careful to propagate errors:• I/O errors: server->proxy->library->job• JVM exit code: JVM->starter->home

But, we failed to draw a distinction:• Errors that are a natural property of the

program.• Errors that were an incidental result of

the environment.

Page 14: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Scope and Detail

The scope of an error is the portion of the system that it invalidates.

The detail of an error describes its philosophical cause.

An error must be delivered according to the handler that manages its scope.

Page 15: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Examples

Detail Scope Handler

Program exited normally. Program User

Null pointer exception. Program User

Out of memory. Remote Machine

Condor

Home file system offline. Home Machine

Condor

Program image corrupt. Job User

Page 16: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

An Example

With this understanding, we reconsidered many elements of the Java Universe.

One example:• The JVM exit code is not a useful result.• It gives results that ignore error scope.

Solution:• Trap the program exit at a higher level.• Report the result and scope on a

separate channel.

Page 17: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

JVM Exit Code

Detail Scope Handler Exit Code

Program exited normally.

Program User (x)

Null pointer exception. Program User 1

Out of memory. Remote Machine

Condor 1

Home file system offline.

Home Machine

Condor 1

Program image corrupt. Job User 1

Page 18: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

JVM

startershadow

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

ProgramResult

orError and

Scope

Starter Result +Program Result

Page 19: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

JVM

starter

shadow

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

I/O Proxy

Errors of Larger Scope

Errors InsideProgram Scope

Page 20: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Conclusion

We started building the Java Universe with some naive assumptions about errors.

On encountering practical difficulties, we thought more abstractly about errors and developed the notion of scope and detail.

By routing errors according to their scope, we made the system more robust and usable.

Details in an upcoming paper.

Page 21: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Deeper Problems Systems have deep semantic differences

that cross multiple functions. Consider this self-cleaning program:

• Open a file.• Delete the file.• Close the file.

Works on UNIX, fails on WinNT. Can we really provide a uniform interface?

Page 22: Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

More Info:

Demo on Wednesday Morning• Room 3381 CS anytime

The Condor Project:• http://www.cs.wisc.edu/condor

These slides:• http://www.cs.wisc.edu/~thain

Douglas Thain• [email protected]

Questions now?