[ieee 2008 ieee international conference on electro/information technology (eit 2008) - ames, ia,...

152

1

Abstract— The growing complexity of scientific applications

has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system.

Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose.

Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log.

In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events.

Index Terms—Blue Gene/L, Computer Fault Tolerance, Failure Analysis, Software Fault Tolerance.

I. INTRODUCTION ARGE-SCALE parallel systems with thousands of processors have been deployed in the past several years to

support the execution of complex scientific applications. One such system designed and implemented by IBM is the Blue Gene/L (BG/L) [1, 2, 3].

Since probability of faults grows linearly with the size of the system [4, 5], fault tolerant policies are required to minimize the effects of faults in the performance of large-scale parallel systems.

Fault tolerant policies can be classified as: reactive and proactive policies. Reactive fault tolerant policies minimize the effects of the faults on the execution of applications when the fault occurs. Proactive fault tolerant policies predict the

Manuscript received March 10, 2008. This manuscript has been authored

by Iowa State University of Science and Technology under Contract No. DE-AC02-07CH11358 with the U.S. Department of Energy.

The authors would like to thank IBM for a Faculty Award, granted to B. Bode, which supported this work.

L. Solano-Quinde and B. Bode are with the Scalable Computing Laboratory, Ames Laboratory, Ames, IA 50011 USA; phone: 515-294-9192; fax: 515-294-4491; e-mail: [email protected], [email protected].

fault and take actions to avoid an application crash [6, 7, 8]. Among reactive policies, checkpointing is the most

commonly used [9, 10, 11]. Although checkpointing has been successfully shown to recover a system from faults, it is not suitable for large-scale systems because checkpointing degrades the system performance down to 40% [12].

In terms of performance, the proactive approach has shown to be more efficient than the reactive approach [13].

In our previous work [14] we addressed the issue of failure prediction for the IBM BG/L as a classification problem. As a classification problem, prediction involves matching a set of learned patterns against the data and if there is a match, an event prediction is made [15]. The Reliability, Availability and Serviceability (RAS) event log produced by an IBM BG/L system was used to find patterns relating non-fatal and fatal events. We showed that 84% of fatal events could be effectively predicted using the classification approach.

Although these theoretical results are important, it is necessary to implement this classification approach on a BG/L system to obtain results in a real-world scenario.

In this paper we design and implement a prototype for a module that implements the classification approach. This module utilizes the filtering and classification algorithm described in [14, 16]. To validate the correct design and implementation of the module we simulate a realistic scenario using a RAS event log from a BG/L system. The RAS information we used was obtained from a 20 rack, 20,480-node IBM BG/L installed at the IBM T. J. Watson Research Center (BGW) during the period from December 31, 2006 to April 11, 2007. At the time of this writing BGW is ranked as the 8th fastest computer in the world at 114 teraflops on the Top500 list [17].

The remainder of this paper is organized as follows: the next section briefly summarizes related work; Section III describes the design of the module, whereas Section IV describes details of the implementation. Subsequently, in Section V the results are presented; and finally in Section VI conclusions and future work are discussed.

II. RELATED WORK Proactive fault tolerant policies predict faults based on

analysis of RAS information collected by the system. Several research efforts have been done in the analysis of RAS information.

Sahoo et al. [18] analyzed RAS data for a 350-node cluster

Module Prototype for Online Failure Prediction for the IBM Blue Gene/L

Lizandro D. Solano-Quinde, Student Member, IEEE, Brett M. Bode, Member, IEEE

L

470978-1-4244-2030-8/08/$25.00 ©2008 IEEE.

152

2

system and provided three algorithms for event prediction: time-series, rule-base classification and bayesian network model.

Liang et al. [19] investigated the temporal and spatial characteristics of fatal events, as well as the relation between non-fatal and fatal events in a RAS event log of an IBM BG/L. Based on their findings, they proposed two algorithms based on temporal and spatial characteristics of the faults for failure prediction for the IBM BG/L.

In [14] we approached the event prediction problem for the IBM BG/L as a classification problem. We provided an algorithm to build a rule-based prediction model, and showed that up to 80% of faults can be effectively predicted.

All of these works show that a high percentage of faults can be predicted; however, it is necessary to test the validity of these works in a realistic scenario, i.e. to design and implement a module for failure detection and to deploy it in a large-scale system.

In this work we are not only interested in presenting the design and implementation of a module for fault prediction for the IBM BG/L, but also we are interested in obtaining results from a realistic deployment of this module and to compare with results obtained in our previous work to gain knowledge about implementation issues in real scenarios.

III. DESIGN OF THE MODULE FOR FAILURE PREDICTION The goal of the module for failure prediction (MFFP) is to

take single system event, as they are generated by the monitoring system of the IBM BG/L, analyze them and decide whether or not a fatal event will occur, i.e. failure event prediction. For this failure event prediction to be meaningful, it should be online and accurate to avoid making decisions based on an incorrect prediction. From now on we use indistinctively entry and event to refer to system events.

Failure event prediction is achieved by analyzing the RAS event log of the IBM BG/L. Every entry in the RAS event log does not, necessarily, represent a single event in the IBM BG/L, there may be multiple entries for a single event. Therefore, a filtering phase where redundant and useless entries are eliminated is necessary before analyzing the events. After this filtering phase the analysis itself is done. As a result

of this analysis alarms are raised if necessary. Figure 1 shows the architecture of the MFFP. In this figure

the MFFP is divided into two sub-modules and two knowledge bases. The sub-modules are: the filtering and the analysis module. The knowledge bases are: filtering and categorizing rules, and rules and keywords for event analysis.

A. The Filtering Module The filtering module filters out the entries in the RAS event

log that are not meaningful for event prediction, and also categorizes the remaining entries depending on the hardware they refer to.

There are two types of entries that should be filtered out: invalid entries, i.e. entries that do not have a valid location, and file system and user applications related entries, which are caused by hardware or software external to the IBM BG/L system, and therefore, not predictable.

After the filtering process, the remaining entries are classified into five categories: memory, network, node card, midplane switch and service card related events. Events are categorized according to: i) the values of the ENTRY_DATA and LOCATION fields of the entry, and ii) the rules defined in the knowledge base. The structure of the knowledge base of filtering and categorizing rules and some examples of rules are shown in table I.

Since filtering out and categorizing entries is a process that

is performed over single entries and does not depend on previous entries, it is performed every time a new entry is inserted in the RAS event log.

TABLE I FILTERING AND CATEGORIZING RULES

Category Rule

Memory ENTRY_DATA contains any of: EDRAM, ddr, parity, CE sym.

Network ENTRY_DATA contains any of: torus, tree, collective.

Node Card ENTRY_DATA contains NodeCard AND FACILITY <> DISCOVERY AND SEVERITY = ERROR.

Midplane Switch LOCATION ends in Lx

Service Card ENTRY_DATA contains Service Card OR LOCATION ends in S

RAS database

FilteringRules

FilteringModule

FilteredEvents

Rules &keywords

AnalysisModule

Alarms

Fig. 1. Architecture of the module for failure prediction

471

152

3

B. The Analysis Module The analysis module takes categorized events and, utilizing

the knowledge base of rules and keywords, determines whether or not it is necessary to raise an alarm, i.e. a fatal event is predicted.

The rules were defined in our previous work [14]. The structure of the knowledge base of rules and keywords and some examples of rules are shown in table II.

To avoid raising several alarms for the same fatal event, a clustering scheme is required. In this clustering scheme events related to the same location within a time window, referred as cluster window (TC), are grouped into a single cluster. After clustering every cluster raises only one alarm.

Instead of raising alarms this module could communicate with the IBM BG/L system to take some action. Actions include: checkpointing the process, moving the process to another block [20], or doing fault-aware scheduling [21].

IV. IMPLEMENTATION OF THE MODULE FOR FAILURE PREDICTION

In this section the implementation details and algorithms of the modules depicted in the previous section is described.

All the information needed for the operation of the module is stored in a MySQL database.

A. The Filtering Module One of the assumptions made in the design of this module is

that right after a new entry is recorded in the RAS event log, the filtering and categorization process is performed on that entry. To guarantee this requirement, the filtering module is implemented as an after-insert trigger [22].

For each entry, the algorithm followed is: 1. Determine if the new entry is meaningful for fatal event

prediction according to the criteria described in part A of section III.

2. If the entry is meaningful, categorize the entry using the rules defined in the knowledge base of filtering and categorizing rules.

3. Insert the categorized entry into a table that contains only

categorized entries. This new table is used to avoid: i) modifications to the data that can cause bottlenecks in the RAS event log, and ii) modifications to the structure of the RAS event log that can cause errors in the system.

B. The Analysis Module The goal of this module is to raise alarms when a fatal event

is predicted to occur. However, it is important not to raise multiple alarms for the same predicted event. In part B of section III a clustering scheme was proposed for avoiding unnecessary alarms. This scheme requires working with groups of entries, which differs from the filtering module that works with single entries. The clustering requirement imposes two constraints in the implementation: i) the module should be implemented outside the database manager, and ii) the module should run periodically after a fixed sleeping time (TZ). The selection of the sleeping time is not trivial.

The sleeping time and the cluster window (TC) affect the efficiency of the prediction. On the one hand, if TZ < TC then the number of unnecessary raised alarms increases. On the other hand, if TZ >> TC then, theoretically, no unnecessary alarms are raised. However, when the alarm is raised it could be too late for taking action to minimize the effects of the fatal event. Ideally, the relationship should be TZ = TC, however, since the beginning of an event cluster is not synchronized with the beginning of the sleeping window, it is best to define TZ � TC.

Next, a sketch of the algorithm is presented. In this algorithm we assume that every cluster has the following fields: location, category, severity, initial_time, ending_time, entry_data, and frequency. The frequency field indicates the number of errors per second. The other fields are self-explained by their names. 1. Take an entry and test if the entry fits any of the rules

defined in the knowledge base of rules and keywords. 2. If the event fits any rule, check if a cluster for the location

of the event exists. If not, create a new cluster and initialize the values for initial_time and ending_time with the time stamp of the current entry.

3. Once a cluster is found for that location, test if the time stamp of the event exceeds the cluster window of its

TABLE II EXAMPLES OF RULES FOR EVENT ANALYSIS

Fatal Event Predicting Event

external input interrupt: uncorrectable torus error torus receiver x+ input pipe error(s). torus non-crc error(s).

L3 ecc status register: 05000000 uncorrectable error detected in EDRAM bank 1...1

## L3 EDRAM error(s) detected and corrected over ### seconds.

Power Good signal deactivated (Node Card) * Node card is not fully functional. Cannot get assembly information for node card. PGOOD IS NOT ASSERTED. PGOOD ERROR LATCH IS ACTIVE. MPGOOD IS NOT OK. MPGOOD ERROR LATCH IS ACTIVE.

Power Good signal deactivated (Link Card) * Most of the times a link card error is accompanied by a node card error. Because the PGOOD signal is deactivated for the whole midplane.

472

152

4

cluster. If not, update the value of ending_date with the value of the time of the current entry; otherwise create a new cluster as defined in the previous step.

4. Repeat steps 1, 2 and 3 for every entry collected in the last sleeping period.

5. Raise alarms for all clusters defined.

V. EXPERIMENTAL RESULTS The ideal scenario for testing the MFFP is an actual

implementation in an IBM BG/L. The results obtained from this implementation are to be compared with the results obtained in our previous work. In our previous work we used the RAS event log produced by the BGW with information collected during a period of three months. To be consistent with our previous work, the testing period should be three months and the implementation should preferably be in the BGW. Instead of testing our model prototype for three months in the BGW we decided to simulate a realistic scenario.

The realistic scenario simulated uses the same RAS event log used in our previous work, i.e. the RAS event log of the BGW for the period from December 31, 2006 to April 11, 2007. During this period 2,142,149 entries were recorded, out of those only 140 unique fatal events are interesting to our analysis. This RAS event log is fed into the MFFP one by one and in chronological order.

To be consistent with our previous work, we use the same cluster window of 5 minutes.

The sleeping time parameter is not used in this simulation, because our test is not running online.

The results in our previous work, for a prediction window of 24 hours, are used for comparison. This is a practical decision because doing online analysis can be compared to doing the analysis in our previous work with a big prediction window.

Table III presents the results obtained in our simulation:

According to the table III, up to 88% of the fatal events

predicted in our earlier work can be predicted, which corresponds to 70% of all the fatal events in the BGW

VI. CONCLUSIONS AND FUTURE WORK The results of our previous work are considered, for

comparison purposes, as the best possible results, since they

were analyzed on ideal circumstances, i.e. having the whole information for past, present and future events; whereas the MFFP was analyzed on an almost real-life scenario, i.e. only having the events collected during the last sleeping period. Therefore, the results obtained by the MFFP cannot be better than the results of our previous work.

We showed that the MFFP predicts 88% of the events predicted in the best scenario. This could be justified because the job log information was not used in the clustering scheme. Therefore, an enhancement to the MFFP is to use the job information in the clustering scheme, i.e. group entries that belong to the same job.

In our experiments we did not considered the sleeping time parameter because in our testing scenario the MFFP did not perform strictly online operation. Therefore, i) there are not unnecessary alarms raised because there are not broken clusters, and ii) all the raised alarms are on time. Obviously, when operating online this parameter will affect the prediction results. It is important to take this parameter into consideration in future work.

Finally, to be fully operating in an IBM BG/L, the database utilized should be DB2 instead of MySQL, because the RAS event log of the BG/L system is stored in a DB2 database manager.

ACKNOWLEDGMENT The authors would like to thank Mr. Sam Miller and IBM

Corporation for their assistance in obtaining and interpreting the BG/L RAS and job log data.

REFERENCES [1] IBM Redbooks, Unfolding the IBM eServer Blue Gene Solution, IBM

Corp., February 2006. [2] A. Gara, M.A. Blumrich, D. Chen, G. Chiu, P. Coteus, M.E. Giampapa,

R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, P. Vranas, “Overview of the BlueGene/L architecture,” IBM J. Res. & Dev. 49, IBM Corp., New York, 195-212, May 2005.

[3] J. E. Moreira, G. Almasi, C. Archer, R. Bellofatto, P. Bergner, J. R. Brunheroto, M. Brutman, J. G. Castanos, P. G. Crumley, M. Gupta, T. Inglett, D. Lieber, D. Limpert, P. McCarthy, M. Megerian, M. Mendell, M. Mundy, D. Reed, R. K. Sahoo, A. Sanomiya, R. Shok, B. Smith, G. G. Stewart, “Blue Gene/L programming and operating environment,” IBM J. Res. & Dev. 49, IBM Corp., New York, 367-376, May 2005.

[4] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, L. Alvisi, “Modeling the effect of technology trends on soft error rate of combinational logic,” in Proceedings of the 2002 International Conference on Dependable Systems and Networks, pp. 389-398.

[5] J. Srinivasan, S. Adve, P. Bose, J. Rivers, “The Impact of Technology Scaling on Processor Lifetime Reliability,” in Proceedings of the International Conference on Dependable Systems and Networks (DSN-2004), June 2004.

[6] R. Sahoo, A. Oliner, I. Rish, M.Gupta, J. Moreira, S. Ma, R. Vilalta, “Critical Event Prediction for Proactive Management in Large-scale Computer Clusters,”, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003.

[7] R. Sahoo, M. Bae, R. Vilalta, J. Moreira, S. Ma, M.Gupta, “Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale computing systems,”, in Workshop on Self-Healing, Adaptive and SelfMANaged Systems (SHAMAN), 2002.

[8] M. L. Fair, C. R. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, N. E. Weber, “Reliability,

TABLE III COMPARISON OF PREDICTION CAPABILITIES OF THE MFFP

WITH PREDICTION IN OUR PREVIOUS WORK

Category Fatal

Events Previous

Work MFFP %

Memory 4 4 4 100 Nework 67 55 51 Node Card 27 23 19 83 Midplane Switch 42 28 23 82 Service Card 0 0 0 0 TOTAL 140 110 97 88

473

152

5

Availability and Serviceability (RAS) of the IBM Server z990,” IBM J. Res. & Dev. 48, IBM Corp., New York, 2004, pp. 519-534.

[9] A. Oliner, R. Sahoo, J. Moreira, M. Gupta, “Performance Implications of Periodic Checkpointing on Large-scale Cluster Systems”, in Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05), 2005.

[10] C. Krishna, K. Shing, Y. Lee, “Optimization Criteria for Checkpoint Placement,” in Communications of the ACM, Vol. 27 NO. 10, October 1984.

[11] Y. Ling, J. Mi, X. Lin, “A Variational Calculus Approach to Optimal Checkpoint Placement,” IEEE Transaction on Computers, Vol. 50 NO. 7, July 2001.

[12] I. R. Philip, “Software failures and the road to a petaflop machine,” in 1st Workshop on High Performance Computing Issues (HPCRI), February 2005.

[13] Y. Zhang, M. Squillante, Q. Sivasubramaniam, R. Sahoo, “Performance Implications of Failures in Large-Scale Cluster Scheduling,” 2005.

[14] L. Solano-Quinde, B. Bode, “RAS and Job Log Data Analysis for Failure Prediction for the IBM Blue Gene /L.”

[15] C. Domeniconi, C. Perng, R. Vilalta, S. Ma, “A Classification Approach for Prediction of Target Events in Temporal Sequences,” in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, 2002.

[16] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, “Filtering Failure Logs for a BlueGene/L Prototype,” in Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), 2005, pp. 476-485.

[17] Top 500 supercomputer sites [online], Available: http://www.top500.org/list/2007/11

[18] R. K. Sahoo, A. J. Oliner, “Critical Event Prediction for Proactive Management in Large-Scale Computer clusters,” in Proceedings of the ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, August 2003.

[19] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R. Sahoo, “BlueGene /L Failure Analysis and Prediction Models,” in International Conference on Dependable Systems and Networks (DSN’06), 2006, pp. 425-434.

[20] S. Chakravorty, C. Mendes, L. Kale, “Proactive Fault Tolerance in Large Systems,” in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.

[21] A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, A. Sivasubramaniam, “Fault-aware Job Scheduling for Blue Gene /L Systems,” in Proceedings of the International Parallel and Distributed Processing Symposium, April 2004.

[22] P. Dubois, S. Hinz, C. Pedersen, MySQL Certification Study Guide, MySQL Press, 2004.

474

[ieee 2008 ieee international conference on electro/information technology (eit 2008) - ames, ia,...

Documents