fabián e. bustamante, winter 2006 understanding and dealing with operator mistakes in internet...

24
Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation)

Upload: betty-owens

Post on 16-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

Fabián E. Bustamante, Winter 2006

Understanding and dealing with operator mistakes in Internet services

K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University

OSDI 2003

Vivo Project http://vivo.cs.rutgers.edu

(based on slides from the authors’ OSDI presentation)

Page 2: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

2

Motivation

Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc.

– Expect 24 x 7 availability, but service outages still happen!

A significant number of outages in Internet services are result of operator actions

1: Architecture is complex

2: Systems are constantly evolving

3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation

Very little detail on operator mistakes– Details strongly guarded by companies and administrators

Page 3: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

3

This work

Understanding: Gather detailed data on operators’ mistakes – What categories of mistakes?– What’s the impact on the service?– How do mistakes correlate with experience, impact?– Caveat: this is not a complete study of operator behavior

Approaches to deal with operator mistakes: prevention, recovery, automation

Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service– Like offline testing, but:

• Virtual environment (extension of online environment)

• Real workload

• Migration back and forth with minimal operator involvement

Page 4: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

4

Contributions

Detailed information on operator tasks and mistakes– 43 exp. - detailed data on operator behavior inc. 42 mistakes– 64% immediately degraded throughput– 57% were software configuration mistakes– Human experiments are possible and valuable!

Designed and prototyped a validation infrastructure– Implemented on 2 cluster-based services: cooperative Web

server (PRESS) and a multi-tier auction service– 2 techniques to allow operators to validate their actions

Demonstrated validation is a promising technique for reducing impact of operator mistakes– 66% of all mistakes observed in operator study caught– 6/9 mistakes caught in live operator exp. w/ validation– Successfully tested with synthetically injected mistakes

Page 5: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

5

Talk outline

Approach and contributions

Operator study: Understanding the mistakes– Representative environment– Choice of human subjects and experiments– Results

Validation: Preventing exposure of mistakes

Conclusion and future work

Page 6: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

6

Multi-tiered Internet services

Web ServerWeb ServerWeb ServerWeb Server

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

DatabaseDatabase

Client emulator exercises the service

Tier 1

Tier 2

Tier 3

Code from the DynaServer project!

On-line auction service ~ EBay

Page 7: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

7

Tasks, operators & training

Tasks – two categories– Scheduled maintenance tasks (proactive), e.g. upgrade sw– Diagnose-and-repair tasks (reactive), e.g. disk failure

Operator composition– 14 computer science graduate students– 5 professional programmers (Ask Jeeves)– 2 sysadmins from our department

Categorization of operators – w/ filled in questionnaire– 11 novices – some familiarity with set up– 5 intermediates – experience with a similar service – 5 experts - in-charge of a service requiring high uptime

Operator training– Novice operators given warm-up tasks– Material describing service, and detailed steps for tasks

Page 8: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

8

Experimental setup

Service– 3-tier auction service, and client emulator from Rice

University’s DynaServer Project – Loaded at 35% of capacity

Machines– 2 Web servers (Apache), – 5 application servers (Tomcat), – 1 database machine (MYSQL)

Operator assistance & data capture – Monitor service throughput– Modified bash shell for command and result trace

Manual observation– Noting anomalies in operator behavior – Bailing out ‘lost’ operators

Page 9: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

9

Example trace

Task: Add an application server– Mistake: Apache misconfiguration– Impact: Degraded throughput

Application server addedFirst Apache misconfigured and

restarted Second Apache misconfigured and

restarted

Page 10: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

10

Sampling of other mistakes

Adding a new application server– Omission of new application server from backend

member list– Syntax errors, duplicate entries, wrong hostnames– Launching the wrong version of software

Migrating the database for performance upgrade– Incorrect privileges for accessing the database

• Security vulnerability

– Database installed on wrong disk

Page 11: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

11

Operator mistakes: Category vs. impact

64% of all mistakes had immediate impact on service performance– 36% resulted in latent faults

Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment

Obs. #2: Undetectable latent errors will still require online-recovery techniques

0

2

4

6

8

10

12

14

16

18

20

Degradedthroughput

Serviceinaccessible

IncreasedMTTR

Incomplete componentintegration

Securityvulnerability

Web serverpotentially

inaccessible

Reducedsystem

capacity

Potentialdatabase

crash

Impact Category

# o

f M

ista

ke

s

Page 12: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

12

0

2

4

6

8

10

12

14

16

Local config Global config Incorrectrestart

Start ofwrong SW

version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrong choiceof HW

Mistake Categories

# o

f M

ista

ke

s

Operator mistakes

Misconfigurations account for 57% of all errors– Config. mistakes spanning multiple components are more

likely (global misconfigurations)

Obs. #1: Tools to manipulate & check configs are crucial

Obs. #2: Careful maintaining multiple versions of s/w

Page 13: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

13

Operator categories

Experts also made mistakes!– Complexity of tasks executed by experts were higher

0

0.2

0.4

0.6

0.8

1

1.2

Local config Global config Incorrectrestart

Start of wrongSW version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrong choiceof HW

Mistake Categories

Ra

tio

of

mis

tak

es

/ex

pe

rim

en

ts

Novice Intermediate Expert

Page 14: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

14

Summary of operator study

43 experiments 42 mistakes

27 (64%) mistakes caused immediate impact on service performance

24 (57%) were software configuration mistakes

Mistakes were made across all operator categories

Trace of operator commands & service performance for all experiments– Available at http://vivo.cs.rutgers.edu

Page 15: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

15

Talk outline

Approach and contributions

Operator study: Understanding the mistakes

Validation: Preventing exposure of mistakes– Technique– Experimental evaluation

Conclusion and future work

Page 16: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

16

Validation of operator’s actions

Validation– Allow operator to check correctness of his/her actions prior to

exposing their impact to the service interface (clients)– Correctness is tested by:

• Migrate the component(s) to virtual sand-box environment,• Subject to a real load,• Compare behavior to a known correct one, and

– Migrate back to online environment

Types of validation: – Replica-based: Compare with online replica (real time)– Trace-based: Compare with logged behavior

Page 17: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

17

Validating a component: Replica-based

Web ServerWeb ServerWeb ServerWeb Server

DatabaseDatabase

Tier 1

Tier 3

Tier 2

Validation slice Online slice

ApplicationServer

ApplicationServer

DatabaseProxy

DatabaseProxy

Web ServerProxy

Web ServerProxy

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

Client Requests

Compare

Compare

Application State

ShuntCompare

Page 18: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

18

Validating a component: Trace-based

Validation slice Online slice

ApplicationServer

ApplicationServer

DatabaseProxy

DatabaseProxy

Web ServerProxy

Web ServerProxy

State

Compare

Compare

Web ServerWeb ServerWeb ServerWeb Server

DatabaseDatabase

Tier 1

Tier 3

Tier 2Application

Server

ApplicationServer

ApplicationServer

ApplicationServer

Client Requests

Shunt

State

Page 19: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

19

Implementation details

Shunting performed in middleware layer– Each request tagged with a unique ID all along the request

path

Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL)– Reuse discovery and communication interfaces, common

messaging core

State management requires well-defined export and import API– Stateful servers often support such API

Comparator functions to detect errors– Simple throughput, flow, and content comparators

Page 20: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

20

Validating our prototype: results

Live operator experiments– Operator given option of type of validation, duration, and to

skip validation– Validation caught 6 out of 9 mistakes from 8 experiments with

validation

Mistake-injection experiments– Validation caught errors in data content (inaccessible files,

corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput)

Operator-emulation experiments– Operator command scripts derived from the 42 operator

mistakes – Both trace-based and replica validation caught 22 mistakes

• Multi-component validation caught 4 latent (component interaction) mistakes

Page 21: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

21

Reduction in impact with validation

0

2

4

6

8

10

12

14

16

18

20

Degradedthroughput

Serviceinaccessible

IncreasedMTTR

Incomplete componentintegration

Securityvulnerability

Web serverpotentially

inaccessible

Reducedsystemcapacity

Potentialdatabase

crash

Impact Categories

# o

f M

ista

ke

s

Mistakes

Mistakes with validation

Page 22: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

22

0

2

4

6

8

10

12

14

16

18

Local config Global config Incorrectrestart

Start ofwrong SW

version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrongchoice of HW

Mistake Categories

# o

f M

ista

ke

s

Mistakes

Mistakes with validation

Fewer mistakes with validation

Page 23: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

23

Shunting & buffering overheads

Shunting overhead for replica-based validation 39% additional CPU – All requests and responses are captured and forwarded to

validation slice– Trace-based validation is slightly better 32 % additional

CPU– Overhead is incurred on single component, and only during

validation

Various optimizations can reduce overhead to 13-22%– Examples: response summary (64byte), sampling (session

boundaries)

Buffering capacity during state check pointing and duplication– Required to buffer only about 150 requests for small state

sizes

Page 24: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,

CS 395/495 Autonomic Computing SystemsEECS, Northwestern University

24

Caveats, limitations & open Issues

Non-determinism increases complexity of comparators and proxies– E.g., choice of back-end server, remote cache vs. local disk,

pseudo-random session-id, time stamps

Hard state management may require operator intervention– Component requires initialization prior to online migration

Bootstrapping the validation– Validating an intended modification of service behavior –

nothing to compare with!

How long to validate? What types of validation?– Duration spent in validation implies reduced online capacity

Future work: Taking validation further…– Validate operator actions on databases, network components– Combine validation with diagnosis for assisting operators– Other validation techniques: Model-based validation