11-2-2014 challenge the future delft university of technology analysis and modeling of...

32
22-06-22 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih Yigitbasi 1 , Matthieu Gallet 2 , Derrick Kondo 3, Alexandru Iosup 1 , Dick Epema 1 1 TUDelft, 2 École Normale Supérieure de Lyon, 3 INRIA The Failur e Trace Archiv e http:// guardg.st.ewi.tudelft.nl/

Upload: cody-wiley

Post on 27-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

10-04-23

Challenge the future

DelftUniversity ofTechnology

Analysis and Modeling of Time-Correlated Failures in

Large-Scale Distributed Systems

Nezih Yigitbasi1, Matthieu Gallet2, Derrick Kondo3,

Alexandru Iosup1, Dick Epema1

1TUDelft, 2École Normale Supérieure de Lyon, 3INRIA

The Failur

eTraceArchiv

ehttp://guardg.st.ewi.tudelft.nl/

Page 2: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

2

Failures Do Happen

• … Build a computing system with 10 thousand servers with MTBF of

30 years each, watch one fail per day …

Jeff Dean, Google Fellow, LADIS’09 Keynote

• … Average worker deaths per MapReduce job is 1.2 …

MapReduce, OSDI’04

• … 20-45% failures in TeraGrid …

Khalili et al., GRID’06

• … During the month of March 2005 on one dedicated cluster with

1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an

average of 220 machines each. While running those jobs, 18,636

failures occurred (application failure, network outage, system crash,

etc.) that triggered rerunning some portion of the job ...

Rob Pike et al., Google

Page 3: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

3

• Common assumption

• Is this realistic for large-scale distributed systems?• Already know that space correlations exist

• Time correlations may impact• Proactive fault-tolerance solutions• Design decisions• Checkpointing & scheduling decisions (e.g., migrate

computation at the beginning of a predicted peak)

Are Failures Independent?

M.Gallet, N.Yigitbasi, B.Javadi, D.Kondo, A.Iosup, D.Epema, A Model for Space-correlated Failures in Large-scale Distributed Systems, Euro-Par 2010.

M.Gallet, N.Yigitbasi, B.Javadi, D.Kondo, A.Iosup, D.Epema, A Model for Space-correlated Failures in Large-scale Distributed Systems, Euro-Par 2010.

Page 4: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

4

GOAL 1Investigate whether failures have time correlations

GOAL 2Model the time-varying behavior of failures (peaks)

Our Goals

Page 5: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

5

Outline

Background

Our Approach

Analysis of Time-Correlation

Modeling the Peaks of Failures

Conclusions

Page 6: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

6

Why Not Root-Cause Analysis?

• Root-cause analysis is definitely useful

Challenges• Systems are large and complex

• Not all subsystems provide detailed info

• Little monitoring/debugging support

• Environment-specific or temporary failures

• Huge size of failure data• 19 systems

Page 7: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

7

Failure Trace Archive (FTA)

http://fta.inria.fr

Provides• Availability traces of diverse distributed systems of

different scale• Standard format for failure events• Tools for parsing & analysis

Enables• Comparing models/algorithms using identical data sets • Evaluation of the generality/specificity of

models/algorithms across different types of systems• Analysis of availability evolution across time scales• And many more …

The Failur

eTraceArchiv

e

Page 8: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

8

FTA Schema

• Hierarchical trace format

• Resource centric

• Event-based

• Associated metadata

• Codes for different components and events

• Available in raw, tabbed and MYSQL formats

Page 9: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

9

Sample Trace

Identifiers for the event/component/node/platformNode nameType of event: unavailability/availabilityEvent start/stop time (UNIX time)

Page 10: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

10

Outline

Background

Our Approach

Analysis of Time-Correlation

Modeling the Peaks of Failures

Conclusions

Page 11: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

11

Our Approach (1): Outline

Traces• Nineteen failure traces from the FTA

• Mostly production systems

Analysis• Use the auto-correlation of failure rate time series

Modeling• Fit well-known probability distributions to the failure

data to model failure peaks

Page 12: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

12

Our Approach (2): Traces

100K+ hosts~1.2 M failure events

15+ years of operation in total

http://fta.inria.fr

Page 13: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

13

Our Approach (3): Analysis

• Auto-Correlation Function (ACF)

• Similarity between observations as a function of the

time lag between them

• Mathematical tool for finding repeating patterns

• Used for assessing time correlations

• [-1 1]: weak strong correlation

Page 14: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

14

Our Approach (4): Modeling

• We use five probability distributions to fit to the empirical

data

• Exponential, Weibull, Pareto, Log-Normal, and Gamma

• Maximum likelihood estimation + Goodness of Fit Tests

Page 15: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

15

Outline

Background

Our Approach

Analysis of Time-Correlation

Modeling the Peaks of Failures

Conclusions

Page 16: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

16

WEBSITES

Analysis (1): Auto-correlation

• Many systems exhibit moderate/strong

auto-correlation for moderate/short time

lags (GRID5K, LDNS, SKYPE, …)

Page 17: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

17

TERAGRID

• Small number of systems exhibit low auto-

correlation (TeraGrid, PNNL, NOTRE-DAME)

Analysis (2): Auto-correlation

Page 18: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

18

Daily/WeeklyCycles

Analysis (3): Failure Patterns

Daily/WeeklyCycles

MICROSOFT SKYPE

• Systems with similar usage patterns

have similar failure patterns

Page 19: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

19

GRID5000

Analysis (4): Workload Intensity vs Failure Rate

• There is a strong correlation between the workload

intensity and the failure rate in some systems

Page 20: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

20

Outline

Background

Our Approach

Analysis of Time-Correlation

Modeling the Peaks of Failures

Conclusions

Page 21: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

21

Failure Peaks (1): Model

μ+kσμ

1 2

3

4

Page 22: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

22

Failure Peaks (2): IdentificationOur goal

• Balance between capturing the extreme system behavior and characterizing an important part of the system failures

We use a threshold to isolate peaks• μ + kσ where k is a positive integer• Large k=> Few periods explaining only a small fraction of

failures• Small k=> More failures of probably very different

characteristics

We use k=1• Tried k={0.5, 0.9, 1.0, 1.1, 1.25, 1.5, 2.0}• Over all traces, average fraction of downtime and average

number of failures are close (see Technical Report)

Page 23: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

23

Failure Peaks (3): Modeling Results (1)

1. On average, 50% - 95% of the system downtime is caused by the failures that originate during peaks, but the fraction of peaks < 10% for all platforms

2. The average peak durations are on the order of

1-2 hours

3. The average time between peaks is on the order

of 15-80 hours

4. Average IAT over the entire trace is about 9x the

IAT during peaks

Page 24: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

24

Failure Peaks (4): Modeling Results (2)

5. Exponential distribution is not a good fit for IAT during peaks, time between peaks, and failure duration during peaks

• Traditional models are not enough

6. Model parameters do not follow a heavy-tailed distribution

• Goodness of fit test results (p-values) for the Pareto distribution are very low

7. Weibull and the Log-Normal provide the best fit• See the paper for the parameters

Page 25: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

25

Conclusions (1)

• Nineteen traces most of which are production systems• 100K+ hosts – ~1.2 M failure events – 15+ years of operation • Four new traces available in the FTA (3 CONDOR + 1 TERAGRID)

Large-Scale Study

GOAL 1: Analysis

• Failures exhibit strong periodic behavior & time correlation• Systems with similar usage patterns have similar failure patterns• Strong correlation between workload intensity and failure rate

Page 26: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

26

Conclusions (2)

GOAL 2: Modeling

• Peak duration, time between peaks, the failure IAT

during peaks, and the failure duration during peaks• On average 50% - 95% of the system downtime is

caused by the failures that originate during peaks

(fraction of peaks < 10%)• Weibull & the Log-Normal distributions provide good fit

Page 27: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

27

[email protected]”http://www.st.ewi.tudelft.nl/~nezih/

[email protected]”http://www.st.ewi.tudelft.nl/~nezih/

More Information:

• Guard-g Project: http://guardg.st.ewi.tudelft.nl/

• The Failure Trace Archive: http://fta.inria.fr

• PDS publication database: http://www.pds.twi.tudelft.nl

Thank you! Questions? Comments?

The Failur

eTraceArchiv

e

Page 28: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

28

-0.50

0.0

0.50

1.0

-200 0 200 400 600 800

random

0.0

0.200.400.600.80

1.0

0 1000 2000 3000

randoms

t

-2.0

-1.0

0.0

1.0

2.0

0 1000 2000 3000

sin+ran

s

t-1.0

-0.500.0

0.501.01.5

-200 0 200 400 600 800

sin + ran

-1.5-1.0

-0.500.0

0.501.01.5

-20 0 20 40 60 80 100 120

sin

-1.0

0.0

1.0

-20 0 20 40 60 80 100 120

sin

s

t

X

X

X

Page 29: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

29

+1

-1

0

lag k0 100Aut

ocor

rela

tion

Coe

ffic

ient

Significant positivecorrelation at short lags

Autocorrelation Function

Page 30: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

30

+1

-1

0

lag k0 100Aut

ocor

rela

tion

Coe

ffic

ient

No statistically significantcorrelation beyond this lag

Autocorrelation Function

Page 31: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

31

•For most processes (e.g., Poisson, or compound Poisson), the autocorrelation function drops to zero very quickly

• usually immediately or exponentially fast

•For self-similar processes, the autocorrelation function drops very slowly

• i.e., hyperbolically, toward zero, but may never reach zero

Long-range Dependence

Page 32: 11-2-2014 Challenge the future Delft University of Technology Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih

32

+1

-1

0

lag k0 100Aut

ocor

rela

tion

Coe

ffic

ient

Typical long-range dependent process

Typical short-rangedependent process

Autocorrelation Function