fabián e. bustamante, winter 2006 understanding and dealing with operator mistakes in internet...
TRANSCRIPT
![Page 1: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/1.jpg)
Fabián E. Bustamante, Winter 2006
Understanding and dealing with operator mistakes in Internet services
K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University
OSDI 2003
Vivo Project http://vivo.cs.rutgers.edu
(based on slides from the authors’ OSDI presentation)
![Page 2: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/2.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
2
Motivation
Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc.
– Expect 24 x 7 availability, but service outages still happen!
A significant number of outages in Internet services are result of operator actions
1: Architecture is complex
2: Systems are constantly evolving
3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation
Very little detail on operator mistakes– Details strongly guarded by companies and administrators
![Page 3: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/3.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
3
This work
Understanding: Gather detailed data on operators’ mistakes – What categories of mistakes?– What’s the impact on the service?– How do mistakes correlate with experience, impact?– Caveat: this is not a complete study of operator behavior
Approaches to deal with operator mistakes: prevention, recovery, automation
Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service– Like offline testing, but:
• Virtual environment (extension of online environment)
• Real workload
• Migration back and forth with minimal operator involvement
![Page 4: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/4.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
4
Contributions
Detailed information on operator tasks and mistakes– 43 exp. - detailed data on operator behavior inc. 42 mistakes– 64% immediately degraded throughput– 57% were software configuration mistakes– Human experiments are possible and valuable!
Designed and prototyped a validation infrastructure– Implemented on 2 cluster-based services: cooperative Web
server (PRESS) and a multi-tier auction service– 2 techniques to allow operators to validate their actions
Demonstrated validation is a promising technique for reducing impact of operator mistakes– 66% of all mistakes observed in operator study caught– 6/9 mistakes caught in live operator exp. w/ validation– Successfully tested with synthetically injected mistakes
![Page 5: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/5.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
5
Talk outline
Approach and contributions
Operator study: Understanding the mistakes– Representative environment– Choice of human subjects and experiments– Results
Validation: Preventing exposure of mistakes
Conclusion and future work
![Page 6: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/6.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
6
Multi-tiered Internet services
Web ServerWeb ServerWeb ServerWeb Server
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
DatabaseDatabase
Client emulator exercises the service
Tier 1
Tier 2
Tier 3
Code from the DynaServer project!
On-line auction service ~ EBay
![Page 7: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/7.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
7
Tasks, operators & training
Tasks – two categories– Scheduled maintenance tasks (proactive), e.g. upgrade sw– Diagnose-and-repair tasks (reactive), e.g. disk failure
Operator composition– 14 computer science graduate students– 5 professional programmers (Ask Jeeves)– 2 sysadmins from our department
Categorization of operators – w/ filled in questionnaire– 11 novices – some familiarity with set up– 5 intermediates – experience with a similar service – 5 experts - in-charge of a service requiring high uptime
Operator training– Novice operators given warm-up tasks– Material describing service, and detailed steps for tasks
![Page 8: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/8.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
8
Experimental setup
Service– 3-tier auction service, and client emulator from Rice
University’s DynaServer Project – Loaded at 35% of capacity
Machines– 2 Web servers (Apache), – 5 application servers (Tomcat), – 1 database machine (MYSQL)
Operator assistance & data capture – Monitor service throughput– Modified bash shell for command and result trace
Manual observation– Noting anomalies in operator behavior – Bailing out ‘lost’ operators
![Page 9: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/9.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
9
Example trace
Task: Add an application server– Mistake: Apache misconfiguration– Impact: Degraded throughput
Application server addedFirst Apache misconfigured and
restarted Second Apache misconfigured and
restarted
![Page 10: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/10.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
10
Sampling of other mistakes
Adding a new application server– Omission of new application server from backend
member list– Syntax errors, duplicate entries, wrong hostnames– Launching the wrong version of software
Migrating the database for performance upgrade– Incorrect privileges for accessing the database
• Security vulnerability
– Database installed on wrong disk
![Page 11: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/11.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
11
Operator mistakes: Category vs. impact
64% of all mistakes had immediate impact on service performance– 36% resulted in latent faults
Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment
Obs. #2: Undetectable latent errors will still require online-recovery techniques
0
2
4
6
8
10
12
14
16
18
20
Degradedthroughput
Serviceinaccessible
IncreasedMTTR
Incomplete componentintegration
Securityvulnerability
Web serverpotentially
inaccessible
Reducedsystem
capacity
Potentialdatabase
crash
Impact Category
# o
f M
ista
ke
s
![Page 12: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/12.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
12
0
2
4
6
8
10
12
14
16
Local config Global config Incorrectrestart
Start ofwrong SW
version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrong choiceof HW
Mistake Categories
# o
f M
ista
ke
s
Operator mistakes
Misconfigurations account for 57% of all errors– Config. mistakes spanning multiple components are more
likely (global misconfigurations)
Obs. #1: Tools to manipulate & check configs are crucial
Obs. #2: Careful maintaining multiple versions of s/w
![Page 13: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/13.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
13
Operator categories
Experts also made mistakes!– Complexity of tasks executed by experts were higher
0
0.2
0.4
0.6
0.8
1
1.2
Local config Global config Incorrectrestart
Start of wrongSW version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrong choiceof HW
Mistake Categories
Ra
tio
of
mis
tak
es
/ex
pe
rim
en
ts
Novice Intermediate Expert
![Page 14: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/14.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
14
Summary of operator study
43 experiments 42 mistakes
27 (64%) mistakes caused immediate impact on service performance
24 (57%) were software configuration mistakes
Mistakes were made across all operator categories
Trace of operator commands & service performance for all experiments– Available at http://vivo.cs.rutgers.edu
![Page 15: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/15.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
15
Talk outline
Approach and contributions
Operator study: Understanding the mistakes
Validation: Preventing exposure of mistakes– Technique– Experimental evaluation
Conclusion and future work
![Page 16: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/16.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
16
Validation of operator’s actions
Validation– Allow operator to check correctness of his/her actions prior to
exposing their impact to the service interface (clients)– Correctness is tested by:
• Migrate the component(s) to virtual sand-box environment,• Subject to a real load,• Compare behavior to a known correct one, and
– Migrate back to online environment
Types of validation: – Replica-based: Compare with online replica (real time)– Trace-based: Compare with logged behavior
![Page 17: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/17.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
17
Validating a component: Replica-based
Web ServerWeb ServerWeb ServerWeb Server
DatabaseDatabase
Tier 1
Tier 3
Tier 2
Validation slice Online slice
ApplicationServer
ApplicationServer
DatabaseProxy
DatabaseProxy
Web ServerProxy
Web ServerProxy
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
Client Requests
Compare
Compare
Application State
ShuntCompare
![Page 18: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/18.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
18
Validating a component: Trace-based
Validation slice Online slice
ApplicationServer
ApplicationServer
DatabaseProxy
DatabaseProxy
Web ServerProxy
Web ServerProxy
State
Compare
Compare
Web ServerWeb ServerWeb ServerWeb Server
DatabaseDatabase
Tier 1
Tier 3
Tier 2Application
Server
ApplicationServer
ApplicationServer
ApplicationServer
Client Requests
Shunt
State
![Page 19: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/19.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
19
Implementation details
Shunting performed in middleware layer– Each request tagged with a unique ID all along the request
path
Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL)– Reuse discovery and communication interfaces, common
messaging core
State management requires well-defined export and import API– Stateful servers often support such API
Comparator functions to detect errors– Simple throughput, flow, and content comparators
![Page 20: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/20.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
20
Validating our prototype: results
Live operator experiments– Operator given option of type of validation, duration, and to
skip validation– Validation caught 6 out of 9 mistakes from 8 experiments with
validation
Mistake-injection experiments– Validation caught errors in data content (inaccessible files,
corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput)
Operator-emulation experiments– Operator command scripts derived from the 42 operator
mistakes – Both trace-based and replica validation caught 22 mistakes
• Multi-component validation caught 4 latent (component interaction) mistakes
![Page 21: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/21.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
21
Reduction in impact with validation
0
2
4
6
8
10
12
14
16
18
20
Degradedthroughput
Serviceinaccessible
IncreasedMTTR
Incomplete componentintegration
Securityvulnerability
Web serverpotentially
inaccessible
Reducedsystemcapacity
Potentialdatabase
crash
Impact Categories
# o
f M
ista
ke
s
Mistakes
Mistakes with validation
![Page 22: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/22.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
22
0
2
4
6
8
10
12
14
16
18
Local config Global config Incorrectrestart
Start ofwrong SW
version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrongchoice of HW
Mistake Categories
# o
f M
ista
ke
s
Mistakes
Mistakes with validation
Fewer mistakes with validation
![Page 23: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/23.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
23
Shunting & buffering overheads
Shunting overhead for replica-based validation 39% additional CPU – All requests and responses are captured and forwarded to
validation slice– Trace-based validation is slightly better 32 % additional
CPU– Overhead is incurred on single component, and only during
validation
Various optimizations can reduce overhead to 13-22%– Examples: response summary (64byte), sampling (session
boundaries)
Buffering capacity during state check pointing and duplication– Required to buffer only about 150 requests for small state
sizes
![Page 24: Fabián E. Bustamante, Winter 2006 Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin,](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649cca5503460f949923c9/html5/thumbnails/24.jpg)
CS 395/495 Autonomic Computing SystemsEECS, Northwestern University
24
Caveats, limitations & open Issues
Non-determinism increases complexity of comparators and proxies– E.g., choice of back-end server, remote cache vs. local disk,
pseudo-random session-id, time stamps
Hard state management may require operator intervention– Component requires initialization prior to online migration
Bootstrapping the validation– Validating an intended modification of service behavior –
nothing to compare with!
How long to validate? What types of validation?– Duration spent in validation implies reduced online capacity
Future work: Taking validation further…– Validate operator actions on databases, network components– Combine validation with diagnosis for assisting operators– Other validation techniques: Model-based validation