2006: system management by exception, part 6 · pdf file · 2017-09-01system...

(c) Copyright IBM Corp. 2006 All Rights Reserved.

SYSTEM MANAGEMENT BY EXCEPTION, PART 6

Igor Trubin, PhD, IBM Global Services

Statistical Exception Detection System (SEDS) has been successfully used for more than six years to automatically produce web-based exception reports against the performance data warehouse for a large, multi-platform environment. Adding some application specific metrics including middleware traffic and response times made SEDS an excellent tool for application performance management. This paper also describes how to create statistical control charts using a spreadsheet in order to capture a performance issue without using expensive tools with built-in SPC procedure. 1. Introduction

According to the Baseline magazine survey “Top 10 Projects in 2006” [baselinemag.com], one of the major technology initiatives ranked by expected spending is “Application Performance Management” which is defined as “software that monitors applications and can proactively find potential problems”. Indeed, one of the biggest challenges in providing computer systems performance management for a large IT organization is proactively identifying upcoming performance issues. One efficient way of doing it is SEDS - Statistical Exception Detection System, which was developed and implemented in 2000 to support IT Capacity Management process in a large U.S.-based financial services company. SEDS is based on the Statistical Process Control (SPC) concept. This concept is becoming popular for detecting statistically significant exceptions of a computer system’s behavior. This approach came from the mechanical engineering discipline [1] and was successfully adapted to computer systems by developing the Multivariate Adaptive Statistical Filtering (MASF) technique [2]. Basically, SEDS is used for automatic scanning through large volumes of performance data and identifying global metrics measurements that differ significantly from their expected values. The main concept was presented at the CMG 2001 [3] and 2002 conferences [4]. To increase the efficiency of the product, in addition to producing smart alerts and control charts, the following new parts were added or suggested to be added based on some successfully tested prototypes: - Tree-map (heat-chart) reporting based on

exceptions [5]; - Mainframe workload level exception detection [6]; - Exceptions history charts [5];

- Workload pathologies recognition [7]; - Response time related metrics: number of active

processes, middleware functions response time and traffic volume;

- Integrated metric (system health index [5]); - In addition to the 24 hour profile control chart the

seven day weekly profile control chart to detect a bad day of the week if only daily data is available.

All those additions to the system made SEDS a very efficient tool for system and application performance management. The last three bullets above are new for SEDS and are discussed in this paper.

2. SEDS Structure

Figure 1 – SEDS Structure

SEDS is a subsystem of Capacity Management system and uses inputs from the Performance Database (PDB). The structure is presented in Figure 1 and consists of the following main parts:


• exception detectors for the most important metrics including the CPU queue;

• SEDS Database with a history of exceptions; • statistical process control daily (or weekly) profile

chart generator; • exception server name list generator for smart

alerting. The exception detector scans a six-month history of each server every day for hourly performance data. The full "7 days X 24 hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the previous six months. A simple example of this calculation is presented at the end of this paper. To get detailed information of the servers’ behavior for the previous day, the system publishes the control chart on an Intranet web site for each exception, as shown in Figure 2, for which Saturday is the example. Expanding on the MASF method, the author acted on a suggestion to use some new derived metric such as "amount of exceptions per day" (or “ExtraVolume” as introduced in [4]) and to keep the history of exceptions

in a separate exception database to produce advanced capacity planning analyses. Figure 2 shows this additional interesting metric as the “Extra Queue”.

Figure 2 – Typical Exception - Statistical Process Control Chart for RUN Queue Metric

Figure 3 - Performance Status Web Report Based on SEDS


In this case, if the parent metric is the CPU Run Queue, ExtraVolume is the daily extra queue length on top of the usual queue and can be used to filter out the most severe exceptions from a long list of servers. That is extremely helpful as SEDS works for a large computer farm. SEDS uses a bar chart generator to report about the top 10 servers with the most significant exceptions using this special metric. For each major application (or business area) SEDS publishes and automatically color codes the web table similar to the one shown on Figure 3 and also sends smart alerts via e-mail with the list of servers that had exceptions for each main SEDS metrics including CPU Queue.

3. CPU Queue Related Metric Exception Analysis

Regarding the CPU Queue metric, SEDS currently uses the following Measure Ware one: CPU Run Queue metric is the average number of “run-able” processes during the interval. That metric differs from Priority Queue, which is more useful for detecting possible disk subsystem issues. Run Queue is more useful for capturing CPU bottlenecks. In the HP MeasureWare tool, Run Queue is measured for HP UNIX servers across all CPUs as an average, while for Sun and AIX servers it is captured per processor. If a CPU Queue exception is detected it is a good idea to check the CPU utilization exception for the same hour. In such case, especially if CPU utilization is close to 100%, there is a high probability of a CPU capacity issue. Sometimes SEDS captures CPU Queue exception only (no other exceptions), which might indicate an application or system configuration issue (e.g. resource manager is used incorrectly). When a global exception occurs, the workload level data can be scanned to identify what particular application on the server was responsible for the exception. As shown on Figure 1, SEDS has a subroutine to scan the workload level metrics for exceptions exactly for this purpose. The most appropriate (correlative) metric for the CPU Queue global exception is: The number of active processes in an application group that are competing for the CPU. (An active process is one that exists and consumes some CPU time.)

In the HP MeasureWare (now called OVPM tool) metric set there is APP_ACTIVE_PROC which is the sum of alive-process-time/interval-time ratios of every process belonging to an application that is active (uses any CPU time) during an interval. Most of the typical CPU Queue exceptions are accompanied by severe CPU Utilization exceptions and capture real CPU performance exceptions. The following are two cases of a special interest: A. CPU Queue exceptions without any CPU utilization exceptions, and B. The opposite case, CPU exceptions without any CPU Queue exceptions. Let us take a look at those cases using some real data. CASE A: “Server1” is Sun Fire V880 4-way box. SEDS detected only a CPU Queue exception shown in Figure 3 – there were no CPU Utilization exceptions (daily maximum was about 75%).

Figure 4 - Typical CPU Queue Exception for SUN Server


An additional scan against the application level data showed that Application5 had a similar exception with it’s daily profile correlated with the global one. This means that an unusual number of active processes is the cause of global CPU Queue exceptions that were not severe enough for a global server issue but indicates a potential application performance problem. CASE B: “Server2” is HP- rp7410 1-way box. SEDS detected severe CPU utilization exception as shown on Figure 5.

Figure 5 - CPU Utilization Global and Applications Level Data Exceptions

In spite of 100% CPU utilization for about 7 hours during morning hours, SEDS did not capture any CPU queue exceptions. This means that some process which does not belong to any applications. This process was captured in the “other-user-root” bucket as seen on the middle control chart on Figure 5 was a run-away low priority process that consumed all free CPU resources but did not disturb much other applications like “Application8” on the last chart of the same figure. To prove that, SEDS has a very nice and useful utility to create the control chart on an ad-hoc basis of any SEDS metrics even if an exception did not occur. Indeed, this utility shows the CPU Queue control chart with a pretty healthy profile on Figure 5.

Figure 6 - Healthy CPU Queue Control Chart By the way, the side-effect of capturing some common workload pathologies by SEDS like this run-away process was discussed during the previous CMG conference [7].

4. Applying SEDS to a Multi - Tier Environment

To ensure an adequate performance status of a particular application, it is not enough to publish reports about servers’ exceptions. Any modern complex application has a lot of devices involved, with not only servers but also network appliances such as workload balancers, firewalls and general network devices like switches and routers. All of them are usually combined in a multi-tier architecture with a web interface in front and with business logic components distributed across multiple middleware servers with access to databases residing on data management servers.


Currently SEDS is under development to cover this complex environment. Most network appliances that are technically UNIX boxes can be easily added to the exception detectors, but some other more specific metrics should be added to SEDS, such as the number of concurrent IP connections. For routers and switches, the traffic volume and bandwidth utilization metrics (SNMP-based) are good candidates to be analyzed by SEDS. To see at a glance the performance status of a complex application, in addition to the web exception report shown on figure 3, the tree-map (or heat-chart) report can be used for some of the most important metrics. In one of the CMG papers [5], an example of the tree-map approach was presented. That was a Disk Subsystems performance status tree-map report, which reports the Disk I/O rate metric. CPU utilization is a good metric for server tree-mapping. For Network devices, the bandwidth utilization can be tree-mapped. Figure 7 shows an example of a Network tree-map.

Figure 7 - Network Devices Bandwidth Utilization Tree-Map Report.

Color coding in this report could be based on exceeding constant thresholds or statistical control limits (SEDS based). Each small box represent a device (size could be indicative of relative capacity, e.g. 1 GB or 100 MB network) and a big outline box could represent a particular application or site (e.g. building). Some good news is that the last version of SAS has a built-in procedure to create color tree-maps. The upper management of a company usually likes this type of report very much as it gives them a snapshot of proactive information about what exact area (or application) potentially can generate a problem and requires immediate attention. One problem is that the tree-map cannot report more than one metric at a time.

To avoid producing multiple tree-maps for one application an integrated performance metric can be used instead. In the aforementioned CMG paper [5], there was a discussion of one of such metrics from the Concord tool – Health Index, which has several components describing the performance status of each main subsystem such as CPU, memory, storage and network (see Figure 10). The best metric to manage the multi-tier OLTP type of an application is the end-to-end response time. One of the approaches that were successfully tested used SEDS against a response time and some other application oriented metrics. The data was collected from Middleware servers to monitor the performance of functions that control APIs triggered by some OLTP applications once a customer places a request from the front end.

Figure 8 - Middleware Transaction Data Collection/Reporting Process

Based on the data collected this way, SEDS could capture exceptions of Application Response Time (ART) and Calls Volume of particular functions (APIs Calls) within the Middleware tier. The Figure 9 shows an example of these statistical exceptions. These metrics’ exceptions have much more meaning for an application performance management but are still not covering the end-to-end transaction flow. In any case, when this type of exception occurs, a performance analyst can put together server subsystems exceptions (CPU, memory and Disk as shown on Figure 3) and network subsystems exceptions in order to discover the root cause of application performance degradation.


Figure 9 - Applications Response Time and APIs

Call Volume Control Charts

Finally, for SEDS to cover the entire multi-tier environment, the end-to-end response time total should be collected and sent to SEDS for exception detection. There is a potential to use the Mercury TOPAZ tool for this purpose. The tool simulates user activities from one front-end workstation by executing typical functions performed by a program-robot. For a WEB application the number of web sessions is another perfect metric to use in SEDS. If all those additional metrics were incorporated, the SEDS could proactively report on a statistically unusually long transaction even if this is still acceptable by SLA, and together with other exception reports can give the performance analysts or capacity planners good “heads-up” information to act before a real incident happens.

5. Control Chart Calculation Technique At every presentation that covers SEDS, there is always a question about how SEDS works. Thus, it would be logical to demonstrate some statistical techniques used for creating control charts. That could help readers perform a simple MASF type of analysis, at least on an ad-hoc basis.

This technique (Control or Quality charts) is implemented in numerous tools such as SAS/QC, JMP, BMC Visualizer or BEZsystems (for Oracle or DB2 performance management) and some others, but the user does not need to have expensive software to produce a simple control chart for capturing a system performance issue. A simple spreadsheet can be used for that! SEDS uses two types of control charts: 24-hour profile charts as shown on Figures 2-5 and a week profile chart as shown on Figures 8 and 9. The first one is preferable but requires hourly data points in historical data. If only daily points are available the weekly control chart can be built. Let us take daily Health Index data for some Wintel servers as an example. Actually, the exact same metric is used in SEDS to capture any unusual week day in term of a particular Wintel server performance.

Figure 10 - Health Index Monthly View


The Health Index is a sum of five components (variables). Each of them may have a value in the range from 0 (excellent condition) to 8 or more (poor condition) and is an indication of the following problems:

• SYSTEM, which reports a CPU imbalance problem;

• MEMORY, which means exceeding some memory utilization threshold or reflects some paging and/or swapping problems;

• CPU, which means exceeding some utilization threshold;

• COMM., which reports network errors or exceeding some network volume thresholds;

• And STORAGE, which might be a combination of

a. Exceeding user or system partition

utilization thresholds; b. File cache miss rate, allocation failures

and Disk I/O faults problem that can add additional points to this Health Index component.

Let us look at an example. SEDS captured an exception during the last week and generated a Control Chart just like the one shown inside of the Table 2. Indeed, Figure 10 shows a Health Index chart, where the unusual usage of the CPU subsystem can be seen for the last week. The CPU utilization chart on Figure 10 also shows a CPUs usage imbalance which is reflected as a blue “system” component of the Health Index. How did SEDS calculate data for this control chart? That can be done by a generic spreadsheet calculation shown in the following Tables 1 and 2. DATE HEALTH INDEX WEEKDAY

6-Dec-05 2.3 37-Dec-05 1.5 48-Dec-05 0.0 59-Dec-05 1.1 6

… … … 23-May-06 4.4 324-May-06 6.0 425-May-06 0.3 526-May-06 1.0 6

Table 1 - Raw Performance Data (ServerW Health

Index)

Table 2 - Weekly Health Index Control Chart Builder


The raw data was captured by a SNMP-based Concord tool and summarized by some SAS job as a daily stamped table shown on Table 1. A weekday is added using the WEEKDAY(DATE) spreadsheet function in the 1st column. This data can be easily resorted by weekdays as shown on the left side of Table 2. Finally, to calculate data for the control chart the following spreadsheet formulas are used: For SUNday (Column “B”): Mean = AVERAGE(B2:B25) Upperlimit = AVERAGE(B2:B25)+3*STDEV(B2:B25) Lowerlimit = IF(AVERAGE(B2:B25)-3*STDEV(B2:B25)<0,0, AVERAGE(B2:B25)-3*STDEV(B2:B25)) StdDeviation = STDEV(B2:B25) For other columns “B’ should be replaced with other column letter (e.g. MONday – “C” and so on) Note, that some special condition should be added to that calculation to reflect the natural features of a particular metric. In this case it is done for Lowerlimit, as Health Index cannot be less than zero. For the Utilization type of metrics Upperlimit also should be forced to 100% in case the calculation exceeds the natural threshold.

Figure 11 – Frequency Histogram vs. Normal Distribution

Anther most frequently asked question [8] is how close is the data to the normal distribution as this is one of the SPC technique requirements. Figure 11 shows how well the histogram fits the normal

distribution for the 1st two columns in the data from table 2. Sometimes we see a good fit, but sometimes it’s not so good. The main reason is that there are some natural thresholds which cut the histogram. In this case it is the “0” value. For utilization type of metrics it is also 100%. Also the larger sample analysis (more than six month history) might show the better fit to normal distribution for stable production servers. This example is very simple but demonstrates the method perfectly. In more complex situations (e.g. 24-hour profile of some metrics like I/O or Paging rates) to avoid “false positive” alerts some empiric rules should be added to suppress an exception detection. You can find some of these rules in the first MASF CMG paper [2] as well as in other SEDS SMG papers [3, 4] In perfectly tuned SEDS, a few capacity management analysts receive dozens of smart alerts to focus on the real exceptions while the overall number of servers they have to deal with could be thousands and thousands.

6. Summary The Statistical Exception Detection System has been successfully used as a capacity management tool in a large computer farm for about six years. An addition of application-specific metrics to the system (response time, transaction volumes, network traffic and others) allows using SEDS as a good tool for application management purposes. CPU Queue metric and some related application level metrics (the number of active process, the middleware APIs metrics and others) can be efficiently used within SEDS to capture some application and system performance issues. The web sessions or even end-to-end response time metrics are recommended to be used within SEDS in order to provide a complete daily updated picture of what is going on inside of complex applications, systems or entire business areas. This paper also shows how to build a simple Control Chart using an integrative performance metric (System Health Index). This demonstrates the main statistical method used by SEDS.


7. References

[1] Krajewski, Ritzman: “Operation Management”,

1990, Addison-Wesley Publishing Company, Inc. [2] Jeffrey Buzen and Annie Shum: "MASF --

Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1-10.

[3] Kevin McLaughlin, Igor Trubin: “Exception Detection System, Based on the Statistical Process Control Concept”, Proceedings of the Computer Measurement Group, 2001.

[4] Igor Trubin: Global and Application Levels Exception Detection System, Based on MASF Technique Proceedings of the Computer Measurement Group, 2002.

[5] Linwood Merritt, Igor Trubin: “Disk Subsystem Capacity Management Based on Business Drivers I/O Performance Metrics and MASF”, Proceedings of the Computer Measurement Group, 2003.

[6] Igor Trubin: “Mainframe Global and Workload Level Statistical Exception Detection System, Based on MASF”, Proceedings of the Computer Measurement Group, 2004.

[7] Igor Trubin: “Capturing Workload Pathology by Statistical Exception Detection System”, Proceedings of the Computer Measurement Group, 2005.

[8] Mazda Marvasti: “Rethinking IT Measurement to Find the Source of Performance Problems in Distributed Applications”, CMG MeasureIT issue 4.08, August, 2006.


2006: system management by exception, part 6 · pdf file · 2017-09-01system...

Documents