![Page 1: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/1.jpg)
The Failure Trace Archive:Enabling Comparative Analysis of Diverse Distributed Systems
Derrick Kondo1, Bahman Javadi1,Alexandru Iosup2, Dick Epema2
1INRIA, France 2 TU Delft, The Netherlands
![Page 2: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/2.jpg)
Motivation• Push toward experimental computer science
![Page 3: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/3.jpg)
Motivation• Push toward experimental computer science
• Hard to evaluate and compare algorithms and models for fault-tolerance
• Lack of public trace data sets
• Lack of standard trace format
• Lack of parsing and analytical tools
![Page 4: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/4.jpg)
Motivation• Push toward experimental computer science
• Hard to evaluate and compare algorithms and models for fault-tolerance
• Lack of public trace data sets
• Lack of standard trace format
• Lack of parsing and analytical tools
• Failures in distributed systems have increasingly high negative impact and complex dynamics
![Page 5: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/5.jpg)
Failure Trace Archive (FTA)
• Availability traces of distributed systems, differing in scale, volatility, and usage
• Standard event-based format for failure traces
• Scripts and tools for parsing and analyzing traces in svn repository
http://fta.inria.fr
![Page 6: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/6.jpg)
Related WorkResource Data Sets Format Parsing
ToolsAnalysis
Tools
Grid Observatory
Emphasis on EGEE ✗ ✗ ✗
Computer Failure Repo.
12 (mainly clusters) ✗ ✗ ✗
Repo.of Avail. Traces
5 (mainly P2P) ✓ ✓ ✗
Desktop GridArchive
4 Desktop Grids ✓ ✗ ✗
FTA1 22 ✓ ✓ ✓1 FTA includes data sets of the former three resources, in addition to providing several new data sets
✗
✗
![Page 7: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/7.jpg)
Enabled Studies
• Comparing models/algorithms using the identical data sets
![Page 8: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/8.jpg)
Enabled Studies
• Comparing models/algorithms using the identical data sets
• Evaluation of generality/specificity of model/algorithm across different types of systems
![Page 9: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/9.jpg)
Enabled Studies
• Comparing models/algorithms using the identical data sets
• Evaluation of generality/specificity of model/algorithm across different types of systems
• Evaluation of the generality of a system trace
![Page 10: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/10.jpg)
Enabled Studies
• Comparing models/algorithms using the identical data sets
• Evaluation of generality/specificity of model/algorithm across different types of systems
• Evaluation of the generality of a system trace
• Analysis of evolution of failures over time
![Page 11: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/11.jpg)
Enabled Studies
• Comparing models/algorithms using the identical data sets
• Evaluation of generality/specificity of model/algorithm across different types of systems
• Evaluation of the generality of a system trace
• Analysis of evolution of failures over time
• And many more...
![Page 12: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/12.jpg)
Contributions
• Description of FTA, trace format and analysis toolbox
• High-level statistical characterization of failures in each data set
• Show importance of public data sets and methods via characterization of ambiguous data sets
![Page 13: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/13.jpg)
Background Definitions
• Failure: observed deviation from correct system state
• Availability (unavailability) interval: continuous period that system is in correct state (incorrect state)
• Error: system state (not externally visible) that leads to failure
• Fault: root cause of an error
![Page 14: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/14.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
![Page 15: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/15.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
![Page 16: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/16.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
![Page 17: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/17.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
![Page 18: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/18.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
![Page 19: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/19.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
![Page 20: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/20.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
![Page 21: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/21.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Associated metadata
![Page 22: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/22.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Associated metadata
![Page 23: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/23.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Associated metadata
![Page 24: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/24.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Codes for different components, events, and errors
• Associated metadata
![Page 25: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/25.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Codes for different components, events, and errors
• Associated metadata
![Page 26: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/26.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Codes for different components, events, and errors
• Associated metadata
![Page 27: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/27.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Codes for different components, events, and errors
• Associated metadata
• Balance between completeness and sparseness
![Page 28: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/28.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Event-based
• Codes for different components, events, and errors
• Extensibility
• Associated metadata
• Balance between completeness and sparseness
![Page 29: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/29.jpg)
FTA Schemaplatform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
• Resource (versus job or user) centric
• Raw, Tabbed, Relational database
(MySQL)
• Event-based
• Codes for different components, events, and errors
• Extensibility
• Associated metadata
• Balance between completeness and sparseness
![Page 30: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/30.jpg)
Data Quality Assessment
• Syntactic: standard format library that checks data types, number fields (automated)
• Semantic: time moves forward and is non-overlapping, state is valid (automated)
• Visual: look at the distribution for outliers (manual)
![Page 31: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/31.jpg)
Data Sets
• Usage (p2p, supercomputer, grids, desktop PC’s)
• Type (CPU, network, IO)
• Scale (50-240,000 hosts)
• Volatility (minutes to days)
• Resolution (wrt failure detection)
![Page 42: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/42.jpg)
Statistical Analysis
![Page 43: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/43.jpg)
FTA Toolbox
initialize
MySQL trace database
query process finalizetexthtmlwikilatex
• Makes it easy to run a set of statistical measures across all the data sets
• Provides library of functions that can be reused and incorporated
• Implemented in Matlab
• svn checkout svn://scm.gforge.inria.fr/svn/fta/toolbox
![Page 44: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/44.jpg)
Failure Modelling
• Approach
• Model availability and unavailability intervals, each with a single probability distribution
• Assume availability and unavailability is identically and independently distributed
• Descriptive, not prescriptive
![Page 45: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/45.jpg)
Distributions of Availability and Unavailability Intervals
![Page 46: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/46.jpg)
Distributions of Availability and Unavailability Intervals
Qualitative Description
![Page 47: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/47.jpg)
Model Fitting
• For each candidate probability distribution
• Compute parameters that maximize the distribution’s likelihood
• Measure goodness of fit using Kolomorov-Smirnov (KS) and Anderson-Darling (AD) tests
• Compute p-value using 30 samples. Take average of 1000 p-values
![Page 48: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/48.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
![Page 49: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/49.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 50: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/50.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
(Un)availabilitygenerally
notheavy-tailed
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 51: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/51.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 52: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/52.jpg)
Exponentialusually
not a good fit.
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 53: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/53.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 54: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/54.jpg)
Gamma a good fit.
Amenable for Markov Models
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 55: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/55.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
Unavailability
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 56: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/56.jpg)
P-Values for KS & ADGoodness-of-fit tests
Availability
UnavailabilityWeibull and Log-Normal provide
best fit
p-value < 0.05 or 0.10⇒ reject H0 that data came
from fitted distribution
![Page 57: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/57.jpg)
Parameters of Distributions
Availability Unavailability
μ: mean, σ: std dev., k: shape, λ: scale
![Page 58: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/58.jpg)
Parameters of Distributions
Availability Unavailability
μ: mean, σ: std dev., k: shape, λ: scale
k < 1, ∴ decreasing hazard rate
![Page 59: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/59.jpg)
Can different interpretations of trace
data sets affect the model?
![Page 60: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/60.jpg)
Ambiguous Data SetsData Set Ambiguity Interpretation
G5K06 Monitored state is an error or failure
error
G5K06B
Monitored state is an error or failure failure
![Page 61: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/61.jpg)
Ambiguous Data SetsData Set Ambiguity Interpretation
G5K06 Monitored state is an error or failure
error
G5K06B
Monitored state is an error or failure failure
LANL0516 Overlapping intervals
union
LANL0516BOverlapping
intervals intersection
![Page 62: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/62.jpg)
Ambiguous Data SetsData Set Ambiguity Interpretation
G5K06 Monitored state is an error or failure
error
G5K06B
Monitored state is an error or failure failure
LANL0516 Overlapping intervals
union
LANL0516BOverlapping
intervals intersection
ND07CPU Definition of idleness
w/o user and CPU load for 15 mins
ND07CPUBDefinition of
idleness CPU load < 10%
![Page 63: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/63.jpg)
QQ Plots for Ambiguous
Data Sets
0 200 400 600 800 10000
100
200
300
400
500
600
700
800
900
1000
Quantiles of g5k06 fit
Qua
ntile
s of
g5k
06B
fit
![Page 64: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/64.jpg)
QQ Plots for Ambiguous
Data Sets
0 200 400 600 800 10000
100
200
300
400
500
600
700
800
900
1000
Quantiles of g5k06 fit
Qua
ntile
s of
g5k
06B
fit
0 50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Quantiles of lanl0516 fit
Qua
ntile
s of
lanl
0516
B fit
![Page 65: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/65.jpg)
QQ Plots for Ambiguous
Data Sets
0 200 400 600 800 10000
100
200
300
400
500
600
700
800
900
1000
Quantiles of g5k06 fit
Qua
ntile
s of
g5k
06B
fit
0 50 100 150 2000
20
40
60
80
100
120
140
160
180
200
Quantiles of lanl0516 fit
Qua
ntile
s of
lanl
0516
B fit
0 50 100 150 200 250 3000
50
100
150
200
250
300
Quantiles of nd07cpu fit
Qua
ntile
s of
nd0
7cpu
B fit
![Page 66: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/66.jpg)
Distribution Parametersfor Ambiguous Data Sets
μ: mean, σ: std dev., k: shape, λ: scale
![Page 67: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/67.jpg)
Distribution Parametersfor Ambiguous Data Sets
Mean of G5K06B 1.5 times greater than G5K06
μ: mean, σ: std dev., k: shape, λ: scale
![Page 68: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/68.jpg)
Distribution Parametersfor Ambiguous Data Sets
μ: mean, σ: std dev., k: shape, λ: scale
![Page 69: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/69.jpg)
Distribution Parametersfor Ambiguous Data Sets
Gamma scale parameter often significantly different
μ: mean, σ: std dev., k: shape, λ: scale
![Page 70: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/70.jpg)
How to identify interpretation?
• Parsing script is the exact interpretation
• Meaning explained in comments
• Publicly accessible in svn
• Format supports different interpretations of availability
• Can have multiple event_trace’s corresponding to different definitions availability
• So each interpretation can be uniquely identified
![Page 71: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/71.jpg)
How to resolve differences of interpretation?
• Determine which interpretation affects the application. (E.g. G5K06)
• Determine most common interpretation, or interpretation that is the lowest common denominator (E.g. ND07CPU)
• Exclude period of ambiguity or post-process it so that it is consistent with rest of data set (E.g. LANL05)
![Page 72: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/72.jpg)
platform
node
component
event_trace
creator
node_perf
event_state
component_type codes
event_type codes
event_end reason codes
Future Directions• Call to arms: trace data exists in
many production environments, but not always accessible
• Include more production systems
• Types of failures
• Causes of failures
• State before failures
• Automated trace collection
• Failure models and algorithms
• Integration of job and resource failures
![Page 73: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/73.jpg)
Acknowledgements
• All contributors of trace data to the FTA
• INRIA ALEAE project directed by Emmanuel Jeannot
• Feedback from Cecile Germain, Eric Heien, Artur Andrzejak, anonymous reviewers
![Page 74: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/74.jpg)
Summary
• FTA: Data Sets, Format, Tools
• http://fta.inria.fr
• High-level modelling and statistical characterization of 9 data sets
• Slight differences in interpretation make significant difference in model
• Got data? Questions? Please email [email protected] or any other FTA team member
![Page 75: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate](https://reader030.vdocuments.us/reader030/viewer/2022040723/5e32d586635929030b763617/html5/thumbnails/75.jpg)
Thank you