hp service health analyzer: decoding the dna of it ... · hp service health analyzer: decoding the...

HP Service Health Analyzer: Decoding the DNA of IT performance problems

Technical white paper

Table of contents

Introduction .................................................................................................................................. 2

HP unique approach—HP SHA driven by the HP Run-time Service Model ........................................... 2

HP SHA—runtime predictive analytics ............................................................................................. 5

Product capabilities ....................................................................................................................... 7

Get started with zero configuration and zero maintenance ................................................................ 7

Return on investment ................................................................................................................... 11

Conclusion ................................................................................................................................. 12

2

Introduction Making sure you have complete visibility into the health of your business service, that you can adapt, and even survive, in today’s cloud and virtualized IT environment isn’t just a “nice-to-have.” It is mandatory. Managing a dynamic infrastructure and applications will take more than just reacting to business service problems when they occur, or manually updating static thresholds that are difficult to set accurately and problematic to maintain.

In today’s world, you need advanced notification of problems so you can solve those issues before the business is impacted. You need better visibility into how your applications and business services are correlated with your dynamic infrastructure, so you can track anomalies across the complete IT stack, including the network, servers, middleware, applications, and business processes. You need an easier way of determining acceptable thresholds as a basis for identifying events that might impact the business. You need automation to leverage the knowledge from past events that can be applied to address new events more efficiently and can also be used to suppress extraneous events allowing IT to focus on just the business-impacting events.

While IT organizations have the methods to collect massive amount of data, what has been lacking is the analytic tool set and automated intelligence to correlate these disparate metrics from both an application and a topology perspective to help these organizations anticipate or forecast potential problems on the horizon. IT managers are looking into the world of predictive analytics, one of the notable business intelligence trends of 2011, to help them improve service uptime and performance, thereby increasing business-generated revenue and decreasing maintenance and support cost.

HP Service Health Analyzer (SHA) is a predictive analytics tool built on top of a real-time, dynamic service model so you can understand the relationship of metric abnormalities with the application and its underlying infrastructure.

HP unique approach—HP SHA driven by the HP Run-time Service Model Monitoring systems provide measurements and events from all layers of the IT stack—hardware, network OS, middleware, application, business services, and processes. Configuration management databases (CMDBs) provide the model that links all of the different components. But, given the ever changing nature of IT systems, CMDBs’ need to be constantly updated, such as in the case of HP Run-time Service Model (RtSM). The combination of the monitors and the real-time CMDB provide all the necessary data to meet the challenges above. However, all of the data needs to be transformed to provide actionable information. HP SHA uses advanced algorithms that combine multiple disciplines, topology, data analytics, graph theory, and statistics in the Run-time Anomaly Detection (RAD) Engine.

HP solution to the outdated service model is our RtSM. The RtSM synchronizes with the HP UCMDB to leverage the service modeling in the “external” Universal Configuration Management Database (UCMDB). The RtSM then leverages the data collectors of HP Business Service Management (Business Service Management) portfolio that are monitoring for performance, availability, fault, and topology to share “real-time” topology so that the RtSM has the most current understanding of topology and relationships. The RtSM is a core foundation to SHA.

For more information on how the RtSM works with the UCMDB, see the “RtSM best practices guide”

http://support.openview.hp.com/selfsolve/document/KM1230398/binary/BSM9.10_RTSM_BestPractices.pdf?searchIdentifier=-308a48b7%3a13432d0f9eb%3a-4081&resultType=document

3

Figure 1. Solution template

Figure 1 outlines the components of SHA that we determined are required for an accurate solution to decode IT performance problems. We now outline the components and their requirements.

Baselining is the first component, which takes every metric collected by monitoring systems and learns its normal behavior. Deviations from normal behavior by the metric serves as the first step for detecting, predicting, and decoding performance problems. However, learning the normal behavior of metrics accurately is a challenging task. Factors such as seasonal behavior, trends, and changes due to an ever evolving IT system require the learning algorithm estimating the baseline to be adaptive and aware of these factors. Figure 2 shows the distribution of the season for over 17,000 performance metrics collected from a real IT system. These are a combination of system, application, and user-level monitors. As can be seen, over two thirds of the metrics display some seasonal behavior, and these represent a range of various seasons, not just the typically assumed daily or weekly seasonality. A baseline algorithm must first estimate the season to be accurate—for example, if a metric has a seasonal behavior of five hours, and a baseline algorithm ignores the season or uses a predetermined season that is incorrect (for example, 24 hours), it will produce a poor baseline. The baseline will be too sensitive, producing many false deviations from normal, which are actually normal, or be too indiscriminate and will not detect deviations from normal behavior when those exist.

4

Figure 2. Distribution of seasonal behavior for over 17,000 metrics collected from an IT environment

Similarly, estimating trend and being adaptive to changes are important for estimating a good baseline.

While understanding the normal behavior of individual metrics is important, it is not sufficient to detect and predict real problems. By definition, some of the deviations from the baseline will not be related to any problem (a small fraction); in a large IT environment with millions of metrics, even this small fraction can lead to too many false alerts, if treated individually as a problem. In addition, problems typically do not manifest themselves on a single metric in the environment.

Temporal analysis: It is one of the widespread approaches to combine metrics into a single anomaly. Temporal analysis methods include metric-to-metric correlations, where metrics are grouped together based on the similarity of their time-series measurements, or multivariate temporal analysis/prediction that combines multiple metrics together through a, typically linear, multivariate mathematical model, such as multivariate regression, neural, and Bayesian models.

These methods are powerful but have their limitations. First, they scale poorly with the number of metrics. Second, given their statistical nature, they can find misleading correlations if they are provided a very large number of metrics that have no real relationship between them; the chance of finding such wrong correlations increases with the number of metrics.

Topology analysis: What helps temporal methods overcome their limitations is domain-related context. In particular, in IT environments, the set of metrics being analyzed should be limited to a logical set of related metrics. If the CPUs of two completely unrelated servers become high at the same time, they should not be considered correlated even if statistically they appear to be. Such context is provided in the topology of IT systems, through CMDBs. A CMDB is essentially a graph, modeling the relationships between all components making up IT systems—the physical, middleware, software, application, business services, and processes layers. Topology analysis, in the form of advanced graph algorithms, is therefore required for extraction of the contextual information within the CMDB, and to help detect real problems and correlations between metrics, while filtering noise.

5

Therefore, detecting a real problem requires detection of patterns of deviations from normalcy of multiple metrics that span time and filtered by topology. This leads to statistical learning methods that analyze temporal and topological data.

Historical analysis: Beyond detection and prediction of a problem, the topology provides the ability to scope the problem and separate root cause from symptoms; both are important for quickly resolving the problems. With a problem detected and analyzed, its DNA pattern is finally decoded, and it can be stored in a knowledgebase. To leverage the knowledgebase, algorithms that perform historical analysis are required. These include algorithms for matching and comparing different problem DNA patterns, clustering them, and classifying techniques. With the knowledgebase and algorithms in place, past problems can be leveraged quickly and automatically to help find root cause and resolutions to new problems.

RAD Engine: It is defined by this complete set of algorithms. The algorithms within the RAD Engine are the subject of 10 separate patent applications. The output of the RAD Engine is a critical key performance indicator (KPI) in the HP BSM dashboard, and it send an event into the BSM event subsystem, HP Operations Manager i (OMi). The event from SHA contains a wealth of contextual information gathered by the RAD Engine including the lead suspects, location information, business impact information, a list of the configuration items (CIs) involved in the anomaly, and any similar anomaly information. This information will help customers isolate and resolve the event quickly before the business is impacted.

HP SHA—runtime predictive analytics In SHA we developed statistical learning algorithms coupled with graph algorithms, for analyzing the full spectrum of data collected by BSM systems:

• Monitoring data (synthetic and real user)

• Events

• Changes

• Topology from the RtSM

These algorithms accurately detect anomalies, decode their DNA structure, their business impact, and match them to previously decoded anomalies, collected in our Anomaly DNA Knowledgebase.

SHA can be described in the following steps:

• Metric behavior learning Learning the normal behavior, also known as baselining, of the metrics collected from all levels of the service (system, middleware, application, and others) is a necessary first step. It removes the need to set static thresholds and enables early detection of deviations from normalcy. Key strengths of our algorithms are: – Automatic learning of the metric seasonal behavior and its trend

– Adaptive to behavioral changes over time―a must in virtualized environments

– Configuration free—no administrative effort to set or maintain thresholds is required

• Anomaly DNA Technology—detection

As a holistic problem evolves in an IT service, numerous metrics and components related to that service begin to experience deviations from normal behavior. However, there are constant momentary deviations from normalcy by various components that do not represent any meaningful problem. Selecting the meaningful problems and discovering the DNA of true problems is the challenge of any anomaly detection system. Our anomaly DNA detection algorithm accomplishes this using a unique statistical algorithm, which combines three types of information, necessary for achieving accurate detection: – Topological: logical links between monitors and the components they monitor – Temporal information: the duration and temporal correlation of the monitors being in an abnormal state

– Statistical confidence information: the probability of the monitor to be truly in an abnormal state, as learned by the baseline over time

6

Key strengths of our anomaly detection algorithm are: – Clutter reduction: Provides an automatic method to group metrics that breached their baseline, using both

temporal and topological information. This in turn reduces the number of events of baseline breaches that an operator would have to look at, without having to set any rules.

– Event reduction: The algorithms of SHA combine multiple abnormal metrics into a single event reducing the total number of events presented to an operator. The entry point of this type of event is multiple metrics breaching their dynamic thresholds. Then, SHA correlates these metrics by time and topology to generate a single event allowing the operator to focus on the real issue.

– False alarm reduction: Reduces the number of false alerts by computing the significance of an anomaly in the system using a statistical algorithm. Also, known anomalies that have been marked as noise in the past will be used to match current anomalies and suppress the anomaly event.

• Anomaly DNA Technology—decoding The next step following the detection of the anomaly and its structure is decoding of its DNA. Decoding the anomaly DNA is done by analyzing and classifying it based on the topology (CIs and their topological structure), the metrics and additional information. In particular, the decoding achieves: – Separation of suspects, thus providing actionable information. Identification of business impact using business

related information: user volume, service-level agreements (SLAs), and affected geographical areas, thus allowing prioritizing the anomaly according to the impact

– Identification of related changes that may have affected the system behavior

• Anomaly DNA Technology—matching With the anomaly DNA structure decoded, matching of the current anomaly with past anomalies is performed. The matching is performed with a unique graph similarity algorithm, which compares abstract anomaly structures, thus allowing matching between anomalies that were detected on different services that have a similar architecture. The advantages of our matching are: – Enables reuse of discovered solutions of past events.

– Matches to anomalies of known issues that are yet to be resolved, reducing the need to reinvestigate – Reduces false alarms when the past similar anomaly was classified as noisy DNA structures, for example an

anomaly that is caused by normal maintenance actions on the service

• Anomaly DNA Knowledgebase

As the knowledge base of past anomalies and their resolutions is collected, using advanced data mining methods analyzes and generates the relationship between all anomalies, thus creating a map of the entire Anomaly DNA Knowledgebase. Our anomaly DNA matching algorithm defines the required metric space for data mining methods such as clustering and classification. These are applied to provide the following benefits: – Proactive problem resolution―identification of recurrent problems through anomaly DNA classification to

problem and resolution types, reducing the time to diagnose and resolve these types in the future

– Leveraging knowledge collected from various services that display similar behavior

7

Product capabilities Built upon HP RtSM, HP SHA analyzes historical norms and trends of both applications and infrastructure, and compares that data against real-time performance metrics. Leveraging a run-time service model is crucial for your dynamic environment so you can:

• Correlate anomalies to topology changes and past issues

• Understand the business impact of each issue and prioritize resolution

• Identify the suspects of the issue and use that knowledge to prevent similar issues in the future SHA automatically learns the dynamic thresholds in your environment so you don’t have to invest the labor to set and maintain static thresholds. SHA works on metrics from the following BSM data sources:

• HP Business Process Monitor

• HP Diagnostics

• HP Network Node Manager i

• HP Operations Manager, Performance Agent

• HP Real User Monitor

• HP SiteScope

SHA identifies anomalies based on abnormal metric behavior related to the RtSM, sets a KPI, and generates an event with context to help identify business priority of this issue. Additionally, SHA uses Anomaly DNA Technology to analyze the structural makeup of an anomaly, and it compares that with the known DNA of other anomalies. Matches provide known remediation actions without further investigation, and matches marked as noise are suppressed. If you have abnormalities related to a specific service, you can see the SLAs, and know the impact that anomaly could cause. Finally SHA incorporates remediation capabilities from the HP Closed Loop Incident Process (CLIP) solution and provides direct integration with HP Operations Orchestration. For instance, you can fuse analytics and automation together to remediate issues quickly. When SHA sends an event into OMi, an operator can take action before service is impaired with the CLIP process. This quick remediation solution simplifies the complexities of virtualization, cloud computing environments.

Get started with zero configuration and zero maintenance After you install the product, you select the applications that you want to monitor, and SHA starts collecting data and learns your system behavior. SHA gathers data from the application, infrastructure, database, network, and middleware, as well as topology information from the RtSM, and learns the baseline. The baseline defines the normal behavior of an individual metric over time, including the seasonal characteristics. For example, normal behavior for a metric may include a very busy Monday morning and a very quiet Friday afternoon.

8

Figure 3. Example of a dynamic baseline sleeve in gray band with actual metric data in purple.

After you have the dynamic baselines established for all the application metrics, SHA RAD Engine starts looking for anomalies in application behavior. The entry point into the RAD Engine is a baseline breach indicating that a metric is exhibiting abnormal behavior. To define an anomaly, the RAD Engine takes the abnormal metric information gathered from all monitored metrics and couples that with topology information from the RtSM to determine if there are multiple breaches, from different metrics, affecting the same service. If an anomaly is detected, a event is generated and sent to the event subsystem. Additionally, when an anomaly is detected SHA automatically captures the current topology of the CIs involved with the event. The value of this is to understand the topology as it was at the time of the anomaly, which is especially valuable when reviewing anomalies that occurred overnight or when there are no on-call operators to address the issues. SHA also collects and presents discovered changes for the relevant CIs so this information can be used as part of the root cause analysis. This correlation means faster troubleshooting and reduced mean time to repair (MTTR).

When SHA discovers an anomaly in the application behavior it changes the status of the Predictive Health KPI and triggers an event that is sent to BSM event browser. From this point you can start to drill down, isolate the problem, and understand its impact on the business.

SHA provides a page with anomaly highlights that contains all you need to know about the problem and its impact on your business, as well as advanced isolation capabilities if you need to drill down and investigate it further.

9

Figure 4. An anomaly highlights page

At the top of figure 4 “An anomaly highlights page,” you can find the “suspects list.” The suspects are CIs (applications, transactions, infrastructure elements) that were found by SHA as the possible cause of the anomaly. Suspects can be CIs whose metrics breached baseline, anomaly patterns that were previously identified by the user as abnormal, and CIs that failed verifications with user provided verification tool.

The highlights page also provide the anomaly business impact by presenting which SLAs were breached because of the anomaly, the services and applications that were affected, and a breakdown of the locations that were impacted.

10

SHA also offers to run relevant reports to drill down and have a better view of the problem. The similar anomalies section is generated using Anomaly DNA Technology, and it provides more confidence on the occurrence of the problem by showing a list of similar patterns, and additional information about how they were handled.

SHA provides a problem investigation and isolation tool to drill down into the anomaly, and isolate a possible root cause of the problem with the Subject Matter Expert User Interface (SME UI). The investigation tool allows you to “travel in time” in the anomaly and have a detailed view into the turn of events that lead to the problem as it is reflected in the application topology.

The figure below shows an example of an anomaly and its turn of events over time.

Figure 5. SME UI showing topology of anomaly

The lower part of the screen shows the events in the system as they occurred and captured by SHA over the time before and during the anomaly.

• At 06:15 a.m. SHA recorded a discovered change in the system.

• At 06:30 a.m. SHA triggered an anomaly. It means that it detected some abnormal metrics that breached its baseline—before SiteScope and OM that were monitoring the system discovered it. At this point of time SHA already triggered an event that was sent to the operations personnel.

• At 08:00 – 08:20 a.m. SiteScope and OM triggered events on high CPU usage. The reason why SiteScope and OM discovered the problem later than SHA, is that their thresholds were set higher than SHA’s dynamic baseline—to reduce noise and false positive alerts.

• At 8:30 a.m. the first real-user experienced the performance problem and opened an incident.

As you can see, SHA discovered the problem and alerted on it two hours ahead of time and before any of the users complained about it—while providing the operations personnel advance notice to handle and resolve it.

SHA provides you with a powerful tool to correlate and find out which of the metrics can be the possible root cause of the problem in your system.

11

In the figure below you can see SHA metric view that is part of the SME UI.

Figure 6. SME UI in metric view

The metric view allows you to preview your application’s metrics as they were captured during the anomaly time frame in the “envelop” of their baseline. It also allows you to find out which of the metrics was the root cause of the problem by correlating it to the other metrics related to the same service using sophisticated statistical algorithms.

In this example, the user decided to correlate Real User Monitor (RUM) metric with all the others. The reason why this metric was picked is that it represents best the real response time that the actual users are experiencing while using the application. The rest of the metrics are of infrastructure and middleware components, and the metric view provides a point-and-click mechanism to present a correlation between them to poor response time. The metric that got the highest correlation value (81 percent) was “Sitescope_paging File Usage”, that indicates that the root cause is most likely insufficient memory allocation.

Return on investment SHA calculates a return on investment (ROI) using information gathered from the deployment environment. The metric management section looks at the ROI from reducing the administrative labor of setting and maintaining thresholds with the self learned dynamic thresholds that SHA delivers. The events and anomaly section looks at ROI from an event reduction perspective comparing the current OMi event stream to the anomaly events generated from SHA. This information is rolled up into the overall efficiency.

Figure 7. SHA ROI view

Conclusion SHA is HP next-generation run-time predictive analytics solution that can anticipate IT problems before they occur by analyzing abnormal service behavior and alerting IT managers of real service degradation before that issue impacts their business. SHA delivers tight integration with the HP BSM solutions for event remediation to reduce the MTTR.

Additionally, SHA is simple to use, requires minimal configuration and settings, and has a small learning curve. With SHA you no longer have to maintain your monitoring thresholds, as it constantly learns the behavior of your applications and adjust them accordingly. It reduces your application MTTR as you get fewer events in your system, each of them represents a real problem, and it is focused on the root cause. And because it is powered by dynamic HP RtSM, SHA can help IT operations identify potential issues across both topology and the services and solve them before the problem is even experienced by end users.

HP SHA is the new era of analytics in IT. For more information, visit www.hp.com/go/sha.

© Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

4AA3-8672ENW, Created December 2011

http://www.hp.com/go/sha

http://www.hp.com/go/getconnected

hp service health analyzer: decoding the dna of it ... · hp service health analyzer: decoding the...

Documents