characterization of pathological behavior ices.cmu/ballista
DESCRIPTION
Characterization of Pathological Behavior http://www.ices.cmu.edu/ballista. Philip Koopman [email protected] - (412) 268-5225 Dan Siewiorek [email protected] - (412) 268-2570 (and more than a dozen other contributors). Goals. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/1.jpg)
Characterization ofPathologicalBehavior http://www.ices.cmu.edu/ballista
Philip [email protected] - (412) 268-5225
Dan Siewiorek [email protected] - (412) 268-2570
(and more than a dozen other contributors)
![Page 2: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/2.jpg)
2
Goals Detect pathological patterns for fault prognosis Develop fault propagation models Develop statistical identification and stochastic characterization of
pathological phenomena
![Page 3: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/3.jpg)
3
Outline Definitions Digital Hardware Prediction Digital Software Characterization Research Challenges
![Page 4: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/4.jpg)
4
Definitions: Cause-Effect Sequence and Duration FAULT - incorrect state of hardware/software caused
by component failure, environment, operator errors, or incorrect design
ERROR - manifestation of a fault within a program or data structure
FAILURE - services deviates from specified service due to an error
DURATION• Permanent- continuous and stable due to hardware failure, repair
by replacement
• Intermittent- occasionally present due to unstable hardware or varying hardware/software state, repair by
replacement
• Transient- resulting from design errors or temporary environmental conditions, not repairable by
replacement
![Page 5: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/5.jpg)
5
CMU Andrew File Server Study Configuration
• 13 SUN II Workstations with 68010 processor
• 4 Fujitsu Eagle Disk Drives Observations
• 21 Workstation Years Frequency of events
• Permanent Failures 29
• Intermittent Faults 610
• Transient Faults 446
• System Crashes 298 Mean Time To
• Permanent Failures 6552 hours
• Intermittent Faults 58 hours
• Transient Faults 354 hours
• System Crash 689 hours
![Page 6: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/6.jpg)
6
Some Interesting Numbers Permanent Outages/Total Crashes = 0.1
Intermittent Faults/Permanent Failures = 21• Thus first symptom appears over 1200 hours prior to repair
(Crashes - Permanent)/Total Faults = 0.255 14/29 failures had three or fewer error log entries
• 8/29 had no error log entries
![Page 7: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/7.jpg)
7
Harbinger Detection of Anomalies
![Page 8: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/8.jpg)
8
Digital Hardware Prediction
![Page 9: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/9.jpg)
9
Measurement and Prediction Module
History Collection -- Calculation and reporting of system availability
Future prediction -- failure prediction of system devices
HistoryCollection
Future Predict
Measurement & Prediction Module
Op
era
ting
Sy
ste
m
Us
er
Ap
plic
atio
n P
rog
![Page 10: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/10.jpg)
10
Op
era
ting
Sy
ste
m
History Collection
Uptime(fraction)Calculator
CrashMonitor
Files of system state info
History Collection
This module consists :
• Crash Monitor - monitors system state
• Calculator - Average uptime and average of fraction,
Us
er
Ap
plic
atio
n P
rog
Files of uptime
(fraction)information
)( downtimeuptimeuptime
=> Availability)( downtimeuptimeuptime
![Page 11: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/11.jpg)
11
Average uptime
rebootcrash
Crash Monitor
System state’s changing
periodically samples system state
up down
minmin 5600~600
:
uptimereal
downtime’ = t3 - t1=13min
uptime’ = t2 - t1 = 600min
interval = 5min
timet1 t3t2
![Page 12: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/12.jpg)
12
An NT system accumulative availability daily report overAn NT system accumulative availability daily report over
5-month period5-month period
Preliminary Experiment Data (cont.)
availability number
0
20
40
60
80
100
120
time/date
availab
ilit
y
availability number
![Page 13: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/13.jpg)
13
This module generates device failure warning information
•Sys-log Monitor : monitors new entries by checking the system event log periodically.
•DFT Engine : DFT Heuristic applied and corresponding device failure warning issued if rules satisfied.
Future Prediction
DFT
Error LogSys-log Monitor
Dispersion Frame Technique
Engine
Future Prediction
Us
er
Ap
plic
atio
n P
rog
Op
era
ting
Sy
ste
m
Files of device failure
warning
![Page 14: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/14.jpg)
14
periods of increasingly unreliable behavior prior to catastrophic failure.
Principle from observation
disk
time
errorsDisk
repairMem Board
repair
memFilt
er b
yev
ent
typ
e
CPUrepair
Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0 type time
Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time.
![Page 15: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/15.jpg)
15
i-4 i-2i-3 ii-1 t
How DFT Works via an example rule: if a sliding window of 1/2 of the current error interval successively
twice covers 3 errors in the future - issue a warning
last 5 errors of the same type (disk)
warning
![Page 16: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/16.jpg)
16
Digital Software Characterization
![Page 17: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/17.jpg)
17
Where We Started: Component Wrapping Improve Commercial Off-The-Shelf (COTS) software robustness
![Page 18: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/18.jpg)
18
Exception Handling The Basis for Error Detection
Exception handling is an important part of dependable systems• Responding to unexpected operating conditions• Tolerating activation of latent design defects
Robustness testing can help evaluate software dependability• Reaction to exceptional situations (current results)• Reaction to overloads and software “aging” (future results)• First big objective: measure exception handling robustness
– Apply to operating systems– Apply to other applications
It’s difficult to improve something you can’t measure … so let’s figure out how to measure robustness!
![Page 19: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/19.jpg)
19
Measurement Part 1: Software Testing SW Testing requires: Ballista uses:
• Test case “Bad” value combinations
• Module under test Module under Test
• Oracle (a “specification”) Watchdog timer/core dumps
INPUTSPACE
RESPONSESPACE
VALIDINPUTS
INVALIDINPUTS
SPECIFIEDBEHAVIOR
SHOULDWORK
UNDEFINED
SHOULDRETURNERROR
MODULEUNDER
TEST
ROBUSTOPERATION
REPRODUCIBLEFAILURE
UNREPRODUCIBLEFAILURE
![Page 20: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/20.jpg)
20
Ballista: Scalable Test GenerationAPI
TESTINGOBJECTS
write(int filedes, const void *buffer, size_t nbytes)
write(FD_OPEN_RD, BUFF_NULL, SIZE_16)
TESTVALUES
TEST CASE
FILEDESCRIPTORTEST OBJECT
MEMORYBUFFERTEST OBJECT
SIZETESTOBJECT
FD_CLOSED
FD_OPEN_WRITEFD_DELETEDFD_NOEXISTFD_EMPTY_FILEFD_PAST_ENDFD_BEFORE_BEGFD_PIPE_INFD_PIPE_OUTFD_PIPE_IN_BLOCKFD_PIPE_OUT_BLOCKFD_TERMFD_SHM_READFD_SHM_RWFD_MAXINTFD_NEG_ONE
FD_OPEN_READBUF_SMALL_1BUF_MED_PAGESIZEBUF_LARGE_512MBBUF_XLARGE_1GBBUF_HUGE_2GBBUF_MAXULONG_SIZEBUF_64KBUF_END_MEDBUF_FAR_PASTBUF_ODD_ADDRBUF_FREEDBUF_CODEBUF_16
BUF_NEG_ONE BUF_NULL
SIZE_1
SIZE_PAGESIZE_PAGEx16SIZE_PAGEx16plus1SIZE_MAXINTSIZE_MININTSIZE_ZEROSIZE_NEG
SIZE_16
Ballista combines test values to generate test cases
![Page 21: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/21.jpg)
21
Ballista: “High Level” + “Repeatable” High level testing is done using API to perform fault injection
• Send exceptional values into a system through the API– Requires no modification to code -- only linkable object files needed– Can be used with any function that takes a parameter list
• Direct testing instead of middleware injection simplifies usage
Each test is a specific function call with a specific set of parameters• System state initialized & cleaned up for each single-call test
• Combinations of valid and invalid parameters tried in turn
• A “simplistic” model, but it does in fact work...
Early results were encouraging:• Found a significant percentage of functions with robustness failures
• Crashed systems from user mode The testing object-based approach scales!
![Page 22: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/22.jpg)
22
CRASH Robustness Testing Result Categories Catastrophic
• Computer crashes/panics, requiring a reboot• e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );• e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537,
0);
Restart• Benchmark process hangs, requiring restart
Abort• Benchmark process aborts (e.g., “core dump”)
Silent• No error code generated, when one should have been
(e.g., de-referencing null pointer produces no error)
Hindering• Incorrect error code generated
![Page 23: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/23.jpg)
23
Digital Unix 4.0 Results
![Page 24: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/24.jpg)
24
Comparing Fifteen POSIX Operating Systems
N o rm a l iz e d F a ilu re R a te
B a ll is ta R o b u s t n e s s Te s ts f or 2 3 3 P o s ix F u n c t ion C a l ls
0 % 5 % 1 0 % 1 5 % 2 0 % 2 5%
A IX 4 .1
Q N X 4 .2 2
Q N X 4 .2 4
S u n O S 4 .1 .3
S u n O S 5 .5
O S F 1 3 .2
O S F 1 4 .0
1 C a ta s tro p h ic
2 C a ta s tro p h ic s
F ree B S D 2 .2 .5
Ir ix 5 .3
Ir ix 6 .2
L in u x 2 . 0 .1 8
L y n x O S 2 .4 .0
N e tB S D 1 .3
H P -U X 9 .0 5
1 C a ta s tro p h ic
1 C a ta s tro ph ic
H P -U X 1 0 .2 0
A b o r t F a ilu re s
R e s ta r t F a ilu re
1 C a ta s tro ph ic
![Page 25: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/25.jpg)
25
Failure Rates By POSIX Fn/Call Category
![Page 26: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/26.jpg)
26
C Library Is A Potential Robustness Bottleneck
N o rm a l iz e d F a ilu re R a te
P o rt io n s o f F a ilu re R a te s D u e To S y s te m /C - L ib ra ry
0 % 5 % 1 0 % 1 5 % 2 0 % 2 5%
A IX 4 .1
Q N X 4 .2 2
Q N X 4 .2 4
S u n O S 4 .1 .3
S u n O S 5 .5
O S F 1 3 .2
O S F 1 4 .0
F ree B S D 2 .2 .5
Ir ix 5 .3
Ir ix 6 .2
L in u x 2 . 0 .1 8
L y n x O S 2 .4 .0
N e tB S D 1 .3
H P -U X 9 .0 5
H P -U X 1 0 .2 0
1 C a ta s tro p h ic
2 C a ta s tro p h ic s
1 C a ta s tro p h ic
1 C a ta s tro ph ic
1 C a ta s tro ph ic
C L ib ra ry
S y s te m C a lls
![Page 27: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/27.jpg)
27
Failure Rates by Function Group
![Page 28: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/28.jpg)
28
Technology Transfer Original project sponsor DARPA
• Sponsored technology transfer projects for:– Trident Submarine navigation system (U.S. Navy)– Defense Modeling & Simulation Office HLA system
Industrial sponsors are continuing the work
• Cisco – Network switching infrastructure• ABB – Industrial automation framework• Emerson – Windows CE testing• AT&T – CORBA testing• ADtranz – (defining project)• Microsoft – Windows 2000 testing
Other users include• Rockwell, Motorola, and, potentially, some POSIX OS developers
![Page 29: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/29.jpg)
29
Specifying A Test (web/demo interface) Simple demo interface; real interface has a few more steps...
![Page 30: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/30.jpg)
30
Viewing Results Each robustness failure is one test case (one set of parameters)
![Page 31: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/31.jpg)
31
“Bug Report” program creation Reproduces failure in isolation (>99% effective in practice)
/* Ballista single test case Sun Jun 13 14:11:06 1999
* fopen(FNAME_NEG, STR_EMPTY) */
...
const char *str_empty = "";
...
param0 = (char *) -1;
str_ptr = (char *) malloc (strlen (str_empty) + 1);
strcpy (str_ptr, str_empty);
param1 = str_ptr;
...
fopen (param0, param1);
![Page 32: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/32.jpg)
![Page 33: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/33.jpg)
33
Research Challenges
![Page 34: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/34.jpg)
34
Research Challenges Ballista provides a small, discrete state-space for software
components Challenge is to create models of inter-module relations and
workload statistics to create predictions Create discrete simulations using model and probabilities as input
parameters Validation of model at a high level of abstraction through
experimentation on testbed Optimize cost/performance
![Page 35: Characterization of Pathological Behavior ices.cmu/ballista](https://reader035.vdocuments.us/reader035/viewer/2022070402/5681384d550346895d9ff680/html5/thumbnails/35.jpg)
35
Contributors What does it take to do this sort of research?
• A legacy of 15 years of previous Carnegie Mellon work to build upon– But, sometimes it takes that long just to understand the real problems!
• Ballista: 3.5 years and about $1.6 Million spent to date
Students: Meredith Beveridge John Devale Kim Fernsler David Guttendorf Geoff Hendrey Nathan Kropp Jiantao Pan Charles Shelton Ying Shi Asad Zaidi
Faculty & Staff: Kobey DeVale Phil Koopman Roy Maxion Dan Siewiorek