the business - it alignment challenge - unifr.ch · the business - it alignment challenge ... the...
TRANSCRIPT
08.02.07
The Business - IT Alignment Challenge
Being precise with uncertainty- a Fuzzy-Intuitionistic Approach
Roland Schütze
23.09.2014
Jean Hennebert
Problem Statement and Background1
SLA Definition and Measurements2
Multi-Layered SLA Translations3
Practical Considerations4
Fuzzy Mapping of Quality Aspects5
2
Background and Problem Statement1
Jean Hennebert
Engineering education and process improvement actions always stress the importance of ”SMART”ness of requirements.
The original use of this acronym was by George T. Doran, in an article about management goals and objectives. Today the acronym is mostly used to stress the specificity and measurability.
The meaning of SMART :• Specific• Measurable• Assignable (Achievable, Attainable, Action oriented, Acceptable, Agreedupon, Accountable)
• Realistic (Relevant, Result-Oriented)
• Time-related (Timely, Time-bound, Tangible, Traceable)
”SMART”ness of requirements
Jean Hennebert
The ”Fuzzy” needs of the User
The customer needs are often ill-defined or fuzzy. The need for specific and verifiable user requirements is obvious. The smartening process of fuzzy requirements often significantly increases the understanding of the requirements requirements, mostly due to the need to articulate everything explicit.
Example: Gerrit MullerEmbedded Systems Institute, Netherlands
Jean Hennebert
Problem Statement and Background1
SLA Definition and Measurements2
Multi-Layered SLA Translations3
Practical Considerations4
Fuzzy Mapping of Quality Aspects5
7
Background and Problem Statement1
Jean Hennebert
For “Service Quality“ several measurement
and delivery criteria can be defined
Example SLA Critical Performance Indicator for User Help Desk : x% of Urgent Incidents found and fixed or Workaround established within 4 Service Hours. Measured from the time of incident raised to the time of incident resolution when logged in the workflow system (closing the ticket).
Jean Hennebert
SLA Measurements and Metrics
A central concept of the quality of service management is adaptive penalization of individual requests according to the current degree of SLA conformance
The conformance is monitored per service class, that is, for each transaction type invoked by an individual customer and the associated SLA. We define conformance c as
c = Number of timely transaction invocations / Total number of invocations of the transaction
In practice, so-called step-wise SLAs are commonly used to specify the QoS requirements of a service class. The SLAs consist of one or more percentile constraints and an optional deadline constraint. Percentile constraints require e.g. n% of all service requests to be processed within x seconds.
Jean Hennebert
SLA, an example
Online Services Availability Minutes of service unavailability
Period 1 definition: MON-FRI 8-18
Period 2 definition: other
Observation interval 1 YEAR:
“Inappropriate” SL: more than 523 min/year in period 1, more than 680
in period 2
“Insufficient” SL: more than 756 min/year in period 1, more than 983 in
period 2
“Unsuitable” SL: more than 1.047 min/year in period 1, more than 1.361
in period 2
Observation interval 1 MONTH:
“Inappropriate” SL: n/a
“Insufficient” SL: n/a
“Unsuitable” SL: more than 209 min/month in period 1, more than 272 in
period 2
Jean Hennebert
SLA, more examples
Online Services Performance Transactions mean response time ≤ 2,5 sec
Maximum percentage of transactions ending in more than 1 sec =
5%
DR Service RTO (Recovery Time Option):
Applications A, B, C, ... restarting in 2 hours after the disaster formal
statement
Applications X, Y, Z, ... restarting in 24 hours after the disaster formal
statement
RPO (Recovery Point Option):
No data loss for applications A, B, C, ...
Maximum data loss for applications X, Y, Z, ... updates in the last hour
before the disaster
Jean Hennebert
Efficient Service Level Targets based on Business Impacts
Enforcement of
SLAs should be
closely related to the
estimated business
impact caused by a
SLA breach
Different outage
durations lead to
significantly
different business
cost.
Source:
KSRI: Leveraging Service Incident
Analytics to Determine Cost-Optimal
Service Offers
2011SRII Towards Service Level
Engineering for IT Services - Defining IT
Services from a Line of Business
Perspective A. Kieninger
Jean Hennebert
IT Services deliver to Availability Contracts
13
Availability
Security (and Compliance)
Failure Management
Reliability
Recoverability
Scalability
Maintainability
Operability
Performance
…
Availability % Downtime / yr Downtime / mon* Downtime / wk
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.50% 1.83 days 3.60 hours 50.4 minutes
99.80% 17.52 hours 86.23 minutes 20.16 minutes
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
* based on 30 day month
Availability % Downtime / yr Downtime / mon* Downtime / wk
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.50% 1.83 days 3.60 hours 50.4 minutes
99.80% 17.52 hours 86.23 minutes 20.16 minutes
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
* based on 30 day month
Jean Hennebert
Groups of KPIs
Business
Activity
1. Ticket
Resolution time
2.
3.
...
Technical
1. Throughput
2.
3.
...
KPIs
Operational /
Application
1. Response Time
2. Availability
3.
...
Finance
Revenue
Environmental
KPI
HR KPI
Jean Hennebert
the quality objective is defined and measured via KPI‘s
which can be negotiated within a SLA
The quality objectives are defined within the KPIs (Key Performance Indicator). The KPI is derived from a number of sources, including performance metrics of the service or underlying support services as PI (Performance Indicator). As a service or application is supported by a number of service elements, a number of different PI may need to be determined to calculate a particular KQI
Example of objective Service KPIs with quantitative measurements :
Availability : The service is available for use at the time required. As a KQI, this includes all aspects of the service, physical terminal availability, network, etc.
Response Time :How quickly the service responds to an internal or external stimulus.
Transaction Rate : Rate that the system or service can service requests..
Throughput : The total amount of information that is offered to the system. Throughput includes all processed information, including retries and replications
Example of objective Service KPIs with qualitative measurements : Authorization :The system is only available to authorized resources (information and
personnel) at times allowed.
Confidentiality : Information can only be seen by those intended to see the information.
Integrity : Information is available as required and has not been changed from the original.
Example of subjective (perceptive) KPIs :
terms like “80% of respondents should respond Satisfied or Higher”
verbal/linguistic expressions “acceptable”, “good”, “excellent”
Being subjective, KQI parameters can be hard to include as a contractual requirement.
Jean Hennebert
Problem Statement and Background1
SLA Definition and Measurements2
Multi-Layered SLA Translations3
Practical Considerations4
Fuzzy Mapping of Quality Aspects5
16
Background and Problem Statement1
Jean Hennebert
User Experience: Interaction or Irritation
(Component and System View)
To guarantee business-focused SLAs results in optimization problem solving across multiple domains. The landscape of today's IT service. providers is inherently integrated. It consists of all kinds of elements, namely networks, servers, storage, and software stacks.
Jean Hennebert
KQI/PI Association Hierarchy Graph
The automated process oftranslating and correlating high-level requirements and policies ofall kinds down to infrastructurelevel creates a set of related PIs,which we term now a Key Quality/ Performance Indicator (KQI/PI)Hierarchy.
The KQI/PI Association Graph, orKQI/PI Hierarchy for short, is adirected graph representing theassociation relationships betweensets of KQI/PIs within (or across)tiers in a multi-tier architectureas well as across multi-stakeholder domains.Introduction of service quality parameters (KQI)
rather the individual component performance(PI). This concept is described in the “OpenGroup SLA Management Handbook. Volume 4:Enterprise Perspective”
Jean Hennebert
In very most cases the KQI/PI relationship cannot be
mathematically described. For instance, when extending
the response time of as DB query by 1 second, this may
lead to an additional delay of half-a-second in the business
service response time to the end-user.
Having multiple PI parameters Pn, a formula as f(P1; P2; … ;Pn) = F(Q1; Q2) may in
theory be determined to calculate KQI parameters Qn [The Open Group 04].
The KQI is derived from a number of information
sources, including metrics for calculating the
performance of the service or derived from metric of
underlying services as PI. In general way a KQI is
defined from a set of PIs and each PI or KQI will have
upper thresholds and lower thresholds of warning
("Lower Warning/Upper Warning") and error ("Lower
Error/Upper Error")
The concept of Key Quality - and Performance Indicators
20
Jean Hennebert
ITSM – Service Delivery
Service Level Management - Terminology and Definitions
Service Level Requirements (SLR) – A listing of the customer’s service
requirements (e.g. availability, capacity, financial, criticality, service restoration,
etc.).
Service Level Agreement (SLA) – a written agreement with a customer defining
the service targets and responsibilities of both parties.
Operational Level Agreement (OLA) – a written agreement between two internal
IT areas (e.g. Networks and Service Desk)
Underpinning Contract (UC) – a contract with a 3rd party vendor/supplier that
documents the delivery of services that supports IT in their delivery of service.
Jean Hennebert
ITSM – Service Delivery
Service Level Management
•The process responsible for maintaining and improving IT Service quality through a constant cycle of agreeing, monitoring, and reporting to meet customers’ objectives.
•Provides us and our customers a clear and consistent understanding and expectation of the level of service required to provide a quality product.
• Through these methods, a better relationship between IT and the customers can be developed.
Jean Hennebert
Problem Statement and Background1
SLA Definition and Measurements2
Multi-Layered SLA Translations3
Practical Considerations4
Fuzzy Mapping of Quality Aspects5
24
Background and Problem Statement1
Jean Hennebert
approach for loosely coupled
services26
Process invoice
WSDL XML SOAP UDDI
Open standards
SuppliersCustomers
Approve/
reject credit
Process invoice
Collect
A/R
Check
inventory
Receive
PO
Apply for
credit
Check for outstanding
A/R
Check print
history
Business
partners
Check order
status
Ship
goods
Fulfill order
Jean Hennebert
Today’s IT systems behave like “Complex Systems1”
“Complex systems” are systems whose behaviour is perceived2 to be
complicated. They typically consist of
Many elements (enterprise IT will have 2-10M Configuration Items)
Many relationships between elements
Nonlinear and discontinuous relationships
Incomplete information about of elements and relationships
1 Complexity science: http://informatics.indiana.edu/rocha/complex/csm.html http://en.wikipedia.org/wiki/Complex_system. 2. Complexity is perceived because apparent complexity can decrease with learning. Helicopters can be flown with training / practise.
Application
Change
Hardware
UpgradeHardware
Upgrade
Hardware
UpgradeHardware
Upgrade
OS
Upgrade
DBMS
Upgrade
Application
Change
Re-
integrate
Re-test
Time
Efforts Ops
ProcessCutover
Plan
OS
Upgrade
OS
Upgrade
OS
Upgrade
App Server
Upgrade
Middleware
Upgrade
Problem
Fix/Recover
Back-out
Application
Change
Hardware
UpgradeHardware
Upgrade
Hardware
UpgradeHardware
Upgrade
OS
Upgrade
DBMS
Upgrade
Application
Change
Re-
integrate
Re-test
Time
Efforts Ops
ProcessCutover
Plan
OS
Upgrade
OS
Upgrade
OS
Upgrade
App Server
Upgrade
Middleware
Upgrade
Problem
Fix/Recover
Back-out
Application
Change
Hardware
UpgradeHardware
Upgrade
Hardware
UpgradeHardware
Upgrade
OS
Upgrade
DBMS
Upgrade
Application
Change
Re-
integrate
Re-test
Time
Efforts Ops
ProcessCutover
Plan
OS
Upgrade
OS
Upgrade
OS
Upgrade
App Server
Upgrade
Middleware
Upgrade
Problem
Fix/Recover
Back-out
Application
Change
Hardware
UpgradeHardware
Upgrade
Hardware
UpgradeHardware
Upgrade
OS
Upgrade
DBMS
Upgrade
Application
Change
Re-
integrate
Re-test
Time
Efforts Ops
ProcessCutover
Plan
OS
Upgrade
OS
Upgrade
OS
Upgrade
App Server
Upgrade
Middleware
Upgrade
Problem
Fix/Recover
Back-out
Tight interdependencies between components result in cascading impact of change and therefore complexity, cost, risk and therefore the unexpected, including failure
Example: A large bank spend 80% of their IT budget managing their existing systems / infrastructure
Source: Increasing Client Capacity
for Change, TT Assessment, Jenny
Choy, Jan, 2007
27
Source: Gartner Group
Application40.0%
Process40.0%
Hardware10.0%
Operating Systems10.0%
Source: Gartner Group
Unscheduled Outages
“An average of 80 percent of mission-critical application service downtime is directly caused by people or process failures. The other 20 percent is caused by technology failure, environmental failure or a disaster. Source: Gartner
Change – new app, workload, technology, people,
procedure, fewer people - is most often the trigger.
Jean Hennebert
Managing systems is costly, complex and remains labour intensive
28
Enterprise IT systems are complex
They require (a lot of) “managing systems” which also have to be managed and infrastructure
Support organisations are split by discipline
Coordination essential to make / resolve anything other than simple change / problem
E.g. alter tablespace to use new container affects DBA, storage, potentially network support, operations
Labour – dominated by effort to fix problems and make changes – dominates IT cost
Support cost related to number of moving parts – i.e. things to manage; OS, instances, subsystems
Partial explanation for drive to “as a Service” and Cloud delivery models
TCO Model
De
sig
n
Co
de
Bu
ild
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
Ru
n
6 12 18 24 30 36 42 48 54 60 66 72 78 84 90
Phase and Time (months)
Cost
Client £ value
Physical
Server Installed
Base (Millions)
Source: IDC, May 2006
Server Mgt and Admin Costs
$0
$50
$100
$150
$200
$250
$300
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Spending
(US$B)
New Server SpendingNew Server Spending
Power and Cooling Costs
0
5
10
15
20
25
30
35
40
45
50
55
60
60
Virtual+Physical
Server Installed
Base (Millions)
Virtualization
Management
Gap
• Server virtualization will result in a significant
increase in the number of servers (physical + virtual)
to be managed
• The projected increase is not yet reflected in
their forecast of server management costs
Lots of focus on TCA and time to value but TCO and releasing IT budget “lost” to supporting “as is” increasingly
important. IT is an inhibitor to business change.
For clients to allow IT to support the rate of business change needed
For IBM; whoever addresses TCO of their systems and software will allow clients to release IT budget for something more
valuable than supporting current systems
Jean Hennebert
Complexity of multi-layered SLA translations • M2C (Metric to Configuration)
translates in the example the end-
user objective “Response Time” to
the underlying application server
topology, which is needed to ensure
enough capacity to handle the
expected number of requests in time.
• C2C (Configuration to Configuration)
is used to translate here the deploy
option of a web application server to
the supporting DB configuration. A
clustered application server for high
availability topologies needs a
corresponding database
configuration to support the clustered
processing of Java 2 Entity Beans.
• M2M (Metric to Metric) correlates the
high-level metric with lower-level
metrics, here for example the service
objective for application response
time to the required average
database query execution time. For
instance, a sub second end-user
application response time requires
an average DB query execution time
of max half-a-second.
• C2M (Configuration to Metric) we
use this to translate the requested
DB configuration and cluster setup to
the lower-level system parameters of
the Storage Area Network (SAN)
infrastructure with required
bandwidth capacity.
29
Jean Hennebert
Problem Statement and Background1
SLA Definition and Measurements2
Multi-Layered SLA Translations3
Practical Considerations4
Fuzzy Mapping of Quality Aspects5
31
Background and Problem Statement1
Jean Hennebert
Example of Fuzzy Logicfor quality parameters with trapezoidal membership function
• fuzzy set A
• A = {(x, µA(x))| x Є X} where
µA(x) is called the membership
function for the fuzzy set A. X is
referred to as the universe of
discourse.
• The membership function
associates each element x Є X
with a value in the interval [0,1].response time
1.0
µ
0.0
good acceptable bad
If response time is a linguistic variable then its term set is
T(response time) = { good, not good, very good, not very good,…… acceptable, sufficient,… bad, not
too bad, very bad, more or less bad, not very bad,…not very good and not very bad,…}.
Jean Hennebert
For this example, we fuzzify the “response time” performance metric into the fuzzy variables HIGH, LOW and MEDIUM. For the Collaboration tool service, Response time performance is LOW if response time is greater than 10 seconds. It is MEDIUM if response time lies between 3 and 10 seconds. It is HIGH if response time is less than 3 seconds
Performance metrics mapped into fuzzy variables
Jean Hennebert
Fuzzification of Performance Parameters
Thresholds as natural boundaries for fuzzy attributes. For instance, a
set of Performance Indicators values indicating warnings can degrade a
service until it provokes the interruption, then, it would have to be considered
as an error indicating a quality violation.
Jean Hennebert
Quality of Service parameter :
types of membership functions
examples : type of membership function, depending on the property of the
QoS parameter:
QoS parameter - should have Gaussian waveform when missing might cause a
drastic loss of the perception
QoS can have a trapezoidal function when quality remains the same until we
reach a threshold (that is usually referred to as the JND - Just Noticed
Difference) after which the quality starts decaying.
Psychological measures have often best a linear triangular membership function
as they are linearly distributed based on the user.
User Satisfaction is again a Gaussian membership function because of the
normal distribution of human satisfaction measures.
Quality of perception - This can be a simple triangular membership function when
linearly distributed.
Source: A Fuzzy Logic System for Evaluating Quality of Experience of Haptic-Based Applications, 2008
A. Hamam et al, Distributed & Collaborative Virtual Environments Research Laboratory University of Ottawa, Canada
Jean Hennebert
36
Coming next Friday:
Their key contribution is a concept of a new framework that enables the translation of backstage metrics to those at the frontstage.
It captures the dependency of a service on others, or on backend applications and resources. Linguistic rules are then used to define how quality measures of a service at the frontstage relate to those of its resources or the other services it calls. Fuzzy Logic is used to reason over such rules to move from the known hard metrics at the backstage to the soft metrics at the front.