towards fingerpointing in the emulab dynamic … fingerpointing in the emulab dynamic distributed...

37
Carnegie Mellon Towards Fingerpointing in the Emulab Dynamic Distributed System Michael P. Kasick Priya Narasimhan Carnegie Mellon University Kevin Atkinson Jay Lepreau University of Utah November 5, 2006 Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 1 / 37

Upload: nguyenhuong

Post on 12-Apr-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Carnegie Mellon

Towards Fingerpointing in the Emulab DynamicDistributed System

Michael P. KasickPriya Narasimhan

Carnegie Mellon University

Kevin AtkinsonJay Lepreau

University of Utah

November 5, 2006

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 1 / 37

Carnegie Mellon

Introduction to Emulab Classic

University of Utah:Flux Research GroupNetwork emulation testbed1300 users430 local nodes740 distributed nodesIn service for 6 years

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 2 / 37

Carnegie Mellon

Emulab’s Experiments

Users upload an experiment configuration (NS file)Configuration specifies virtual node topologyUsers granted full, exclusive access to nodesNodes automatically redelegated when experiments go idle

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 3 / 37

Carnegie Mellon

Emulab Software Infrastructure

Off-the-shelf componentsDatabase, OS, etc.

Custom developed componentsWeb interfaceTestbed setup & management490,000 lines of code

Swap-* proceduresswap-in, swap-out, swap-modify40+ script invocations

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 4 / 37

Carnegie Mellon

Emulab’s Difficulties

System errors reported to operator mailing listAverage of 82 failure emails per day (April 2006)Swap-* procedures are largest sources of errors

Each mail is 100+ lines longProblem is not always obviousMany underlying causesOnly a few errors require operator attention

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 5 / 37

Carnegie Mellon

Example Swap-* FailureTIMESTAMP: 11:12:38:659154 assign started

assign foo-bar-5987.ptop foo-bar-5987.top

ASSIGN FAILED:

Type precheck passed.

Node mapping precheck:

Node mapping precheck succeeded

Annealing.

Fixed node: Could not map srs-101 to pc106

Trying assign on an empty testbed.

TIMESTAMP: 11:12:40:663660 ptopgen started

ptopargs -p foo -e bar -a

TIMESTAMP: 11:12:41:576498 ptopgen finished

TIMESTAMP: 11:12:41:576719 assign started

assign -n foo-bar-5987.ptop foo-bar-5987.top

Precheck succeeded.

Assign succeeded on an empty testbed.

*** /usr/testbed/libexec/assign_wrapper:

Unretriable error. Giving up.

*** Failed (65) to map to reality.

Recovering virtual and physical state.

Doing a recovery swap-in of old state.

236 line email111 lines of log4 errors1 root cause

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 6 / 37

Carnegie Mellon

Problem Summary

Need to automate swap-* failure analysisIsolate errorsDetermine error relevance

Can be done with post-processing analysisThis problem solved by concurrent work

Emulab’s new tblog logging mechanism

Can we do better?

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 7 / 37

Carnegie Mellon

Local vs. Global Analysis

Analysis of a single swap-* failure:Considers a single, local, error domainScope limits precision of fingerpointing

Concurrent analysis of many swap-* failures:Considers the entire, global, error domainError correlation increases precision of fingerpointing

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 8 / 37

Carnegie Mellon

The Bigger Problem

Current error reporting does not facilitate global analysis

We propose a new structured error reportingmechanism that does facilitate global analysis

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 9 / 37

Carnegie Mellon

Outline

1 Introduction

2 Initial Attempt at Fingerpointingtblog Logging MechanismLessons Learned

3 Structured Error ReportingIngredients of a SolutionDevelopment & DeploymentInitial Results

4 Summary

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 10 / 37

Carnegie Mellon

tblog Logging Mechanism

Perl module interfaces scripts with an error-log databaseAutomatically logs diagnostic messages

stdout, stderr, die(), warn(), etc.

Provides an API for scripts to write messagesRecords script execution context

Time stamp, script invocation #, parent script #, etc.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 11 / 37

Carnegie Mellon

tblog Analysis

Reconstructs script call-chainAscertains most recent error at greatest depthFlags script and its errors as relevant

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 12 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 13 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

ptopgen assign

Fixed node: Could not map srs-101 to pc106

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 14 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

ptopgen assign ptopgen assign

Fixed node: Could not map srs-101 to pc106

*** /usr/testbed/libexec/assign_wrapper:

Unretriable error. Giving up.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 15 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

ptopgen assign ptopgen assign

Fixed node: Could not map srs-101 to pc106

*** /usr/testbed/libexec/assign_wrapper:

Unretriable error. Giving up.

*** Failed (65) to map to reality.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 16 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

ptopgen assign ptopgen assign

Fixed node: Could not map srs-101 to pc106

*** /usr/testbed/libexec/assign_wrapper:

Unretriable error. Giving up.

*** Failed (65) to map to reality.

*** /usr/testbed/bin/swapexp:

Update aborted; old state restored.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 17 / 37

Carnegie Mellon

tblog Example

swapexp

tbswap

assign wrapper

ptopgen assign ptopgen assign

Fixed node: Could not map srs-101 to pc106

*** /usr/testbed/libexec/assign_wrapper:

Unretriable error. Giving up.

*** Failed (65) to map to reality.

*** /usr/testbed/bin/swapexp:

Update aborted; old state restored.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 18 / 37

Carnegie Mellon

Opaque Failure Messages

Designed to be human interpretableVague and lacking in context details

Need context for spatial correlation“Unretriable error. Giving up.”

Messages are cumbersome to parseDifferent messages may describe the same error

“Invalid OS FOO in project bar!”“[tb-set-node-os] Invalid osid FOO”

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 19 / 37

Carnegie Mellon

Absence of Error Context

tblog only captures a general contexttime stamp, script name, etc.

No provision for capturing error-specific contextnodes, OS images, configuration, etc.

Reporting must preserve the error-specific contextRequired for error correlationFacilitates global analysis

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 20 / 37

Carnegie Mellon

Outline

1 Introduction

2 Initial Attempt at Fingerpointingtblog Logging MechanismLessons Learned

3 Structured Error ReportingIngredients of a SolutionDevelopment & DeploymentInitial Results

4 Summary

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 21 / 37

Carnegie Mellon

Discrete Error Types

Identify distinct errorsDefined by a specification for each error type

named error type, definition, associated specific context

Node-boot failure example:Error type: node boot failedDefinition: Node failed to boot twiceContext: node, type, osid

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 22 / 37

Carnegie Mellon

Error Context & Propagation

Context distinguishes between errors of the same typeNode boot failures across different nodesNode boot failures with different OSes

Propagation centers focus on relevant errorsNested scripts should propagate the primary errorOtherwise parent scripts generate “me-too” errorsSecondary (“me-too”) errors add noiseAchievable with exceptions (RPC, middleware)

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 23 / 37

Carnegie Mellon

Research Phase

Used tblog to identify a set of target errorsGoal was not to obtain 100% coverageSystem functionality is always expandingSmall portion of possible errors actually observed

Drafted error specifications and error typesRequired significant knowledge of errors and meaningEliminated error ambiguitiesIdentified relevant error specific context

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 24 / 37

Carnegie Mellon

Development Phase

Developed a prototype Perl reporting moduleStructured error reporting functionError parsers for C++ & TCL language components

Added reporting hooks for the target errorsProblem: Emulab provides no error propagationNested scripts return success or failure onlyFix: severity-level assignmentAlternative: tblog post-processing analysis

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 25 / 37

Carnegie Mellon

Testing & Deployment Phases

Tested prototype in elabinelabIntegrated prototype into tblog framework

New local analysis engine: tbreport

Deployed on the production Emulab testbed750 lines of added or changed code

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 26 / 37

Carnegie Mellon

Initial Results

Data collected August 16-24th, 2006681 swap-* sessions started

108 (17.3%) reported at least one error

283 total fatal errors reportedMany errors repeated for each node in a session118 unique instances of errors in a given session

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 27 / 37

Carnegie Mellon

Error Statistics

Occurrences Error Type31 26.3% assign violation/feasible

(resource shortage)24 20.3% assign type precheck/feasible

(node shortage)22 18.6% node boot failed10 8.5% ns parse failed

(bad experiment configuration)... ... ...

Normalized errors (unique in a session) grouped by error type.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 28 / 37

Carnegie Mellon

Node Shortage Failures

Second most common error (20.3%)Insufficient free nodes for experiment swap-inCurrent node availability is listed on website

Illustrates user demand48% due to lack of pc3000

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 29 / 37

Carnegie Mellon

Other Resource Shortage Failures

Most common error (26.3%)Sufficient free nodes to swap-in, but:

Attempted assignment violated mapping constraintsOften due to oversubscribed switch bandwidth

Assignment algorithm is non-deterministicUser cannot predict when these errors might occurLater attempts may succeed w/o topology change

Frequent resubmissions lead to further errors

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 30 / 37

Carnegie Mellon

Node-Boot Failures

Third most common error (18.6%)Node status daemon

Reports boot successTimeout results in error

Many underlying causesFaulty hardware, broken user contributed OS, etc.

Motivating scenario for our future research

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 31 / 37

Carnegie Mellon

Node-Boot Failure Example (I)

pc297cust_os1

Single node, one sessionUnknown culprit

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 32 / 37

Carnegie Mellon

Node-Boot Failure Example (II)pc297

cust_os1

cust_os1pc301

Two nodes, two sessions, same OSSuggests bad OS

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 33 / 37

Carnegie Mellon

Node-Boot Failure Example (III)pc297 pc297

cust_os1

cust_os1

cust_os2

cust_os2pc301 pc301+

Same two nodes, four sessions, different OSStrongly suggests bad OS

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 34 / 37

Carnegie Mellon

Node-Boot Failures: What’s Next?

Cannot diagnose root cause from a single traceOperator dilemma:

Assume node is faulty and quarantine?Assume OS is faulty and leave node as is?

Motivates global fingerpointing (future work)Correlation of multiple error instancesReliably fingerpoints the culprit

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 35 / 37

Carnegie Mellon

Summary

Manual diagnosis of system errors is costlytblog-style analysis aids in message filteringOpaque failure messages limits error usefulnessStructured error reports enable global analysisGlobal analysis fingerpoints errors with fine granularityFuture work:

Develop a global analysis engine for EmulabStart by targeting the identified node-boot failure scenarioTarget other real-world systems for error analysis

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 36 / 37

Carnegie Mellon

Further Reading

Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, and Jay Lepreau.Towards fingerpointing in the Emulab dynamic distributed system.In Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems(WORLDS ’06), Seattle, WA, November 2006.

Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold,Mike Hibler, Chad Barb, and Abhijeet Joglekar.An integrated experimental environment for distributed systems and networks.In Proceedings of the Fifth Symposium on Operating System Design and Implementation(OSDI ’02), pages 255–270, Boston, MA, December 2002.

Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 37 / 37