ex0-100 itilf user guide

1

ITIL

Foundation

User Guide

2

Printed and Published by: Key Skills George House Princes Court Beam Heath Way Nantwich Cheshire CW5 6GD The company has endeavoured to make sure that the information contained within this User Guide is correct at the time of its release. The information given here must not be taken as forming part of or establishing any contractual or other commitment by Key Skills and no warranty or representation concerning the information is given. All rights reserved. This publication, in part or whole, may not be reproduced, stored in a retrieval system, or transmitted in any form or by any means – electronic, electrostatic, magnetic disc or tape, optical disk, photocopying, recording or otherwise without the express written permission of the publishers, Key Skills. © Key Skills 2000

3

Page Foreword 4 Section 1 Hardware/Software Pre-requisites 5 Section 2 Installation Procedure 6 Section 3 Operating the software 7 Sign-on procedure 7 The user interface 7 Section 4 - Course Notes

Topic 1 – Overview of IT Service Management and ITIL Lesson 1a - Introduction 9 Topic 2 – Supporting the User of IT Services Lesson 2a – Service Desk 13 Lesson 2b – Incident Management 20 Lesson 2c – Problem Management 24 Topic 3 – Control Processes Lesson 3a – Configuration Management 30 Lesson 3b – Change Management 37 Lesson 3c – Release Management 46 Topic 4 – Service Delivery Building Blocks Lesson 4a – Availability Management 51 Lesson 4b – Capacity Management 59 Topic 5 – Getting the Right Service Quality at the Right Price Lesson 5a – Service Level Management 65 Lesson 5b – Financial Management for IT Services 72 Topic 6 – Protecting Business and IT Services Lesson 6a – Continuity Management 77 Topic 1- Exam Technique Lesson 7a – Passing the ITIL Foundation Exam 82

Acronyms 84 Glossary of Terms 90

Contents

4

Projects are essentially about change – and because managing change is an increasingly significant fact of business life – project management is an essential key skill in today’s working environment. Many people are involved in project work, either directly or in a supporting role, and yet they have never received formal training in the basic techniques which can make the difference between a successful project and an expensive failure. For exactly the same reason, the introduction of computer-based project management tools can lead to disappointing results. Training someone to simply operate a computerised project planning tool does not make them a project manager – any more than teaching them to use a calculator would make them an accountant! Key Skills in Project Management (Fundamentals) is the first step in bridging this skills gap. For many people it is all the training they will need to enable them to operate more effectively in a project environment and to make more effective use of their planning software. Professional project managers will find this course lays a solid foundation on which the other modules in the Key Skills PM Portfolio will build to provide a career-enhancing programme of learning and development.

Foreword

5

For optimum performance, you should operate this multimedia course on a computer with the following minimum specification: Pentium P100 Processor 16 mb RAM 8x CD-ROM drive Sound-card & speakers SVGA Monitor (NB The course uses 800x600 resolution) Mouse/Pointing device A 32-bit Windows® operating system is also needed.

Section 1: Hardware/Software Pre-requisites

6

2.1 From CD-ROM (Single User) Place the CD in your CD drive and run START.EXE. START.EXE will run the course directly from your CD-ROM drive and no runtime files will be copied to your hard disk drive. The first time you run the course you will be required to register it with Key Skills. Please follow the accompanying registration instructions carefully 2.2 Network Instructions Subject to bandwidth and licensing terms, this multi-media training course can be installed and operated over a local area network or a corporate Intranet. There are a number of ways in which installation and operation can be effected and you should contact Key Skills Technical Support Section for advice. Any problems, please call us on 01270 611600.

Section 2: Installation Procedure

3.1 Sign-On Procedure To start the course, double click on the course icon and the program will commence, with music and introductory title screen. Once you have passed the title screen and copyright notices you will be asked to identify yourself to the system. If you are new to the course you must enter your name/identification and then confirm this to the system. If you have used the course previously be sure to use the same name, otherwise your bookmarks within the course will be invalid. 3.2 The User Interface Once sign on is completed you will be presented with the main menu which looks like this:

Each topic is represented by one of the “panes” on the menu screen, for example:

Eacthen Thewhearea Noteith

Section 3: Operating the Software

Start at First Pages
Go to Bookmark
7

h of these lesson panes is divided into two distinct areas. If you click on the lesson title text you will be taken to the start of the corresponding lesson.

left side of the pane is the bookmark area and a pink bar will appear in this area to show ther you have part or fully completed the corresponding lesson. By clicking in the bookmark you will be taken to your last point of study within the corresponding lesson.

e: The bookmarking system is switched off as soon as you move around the course using er the Index or the Contents buttons at the bottom of each page.

8

Throughout the course, the main user controls are located at the bottom of the screen, and their functions are as shown below:

Newcomers to the course will gain most benefit from starting at the beginning of the first lesson and working their way through, sequentially, to the end. However, the package is also a valuable source of reference and it is possible to re-visit specific lessons, or parts of a lesson, at any time. The Contents and Index facilities are particularly useful for browsing in this way.

Lesson 1a Introduction

9

Lesson 1a - Introduction Welcome to this computer based training course in IT Service Management. This course has been designed to provide you with sufficient knowledge to pass the ISEB and EXIN Foundation level exams. People who will benefit from this course include: • Individuals currently working in an

organisation’s IT department • Those wishing to develop skills in IT

Service Management • Organisations and their employees who

have implemented or intend to implement an IT Service Management structure.

• In this introductory lesson we will:

• Discover what ITIL is, and how ITIL fits in

to a quality environment. • Examine Service Management and the

Organisation, the ICT infrastructure, and how we define a service in IT terms.

• Finally we will examine the functions that

make up the core ITIL processes. So what is ITIL? ITIL is an acronym for Information Technology Infrastructure Library. It consists of a library of reference books outlining good practice guidelines for IT Service Management. It was conceived by the UK government who approached various organisations and subject matter experts to write all of the books in the library, and it was originally published in the late 1980’s. The ITIL library is published by the Office of Government Commerce, or OGC, and in 2001 revised versions of the ITIL manuals were published to include, amongst other things, recent technological developments, such as the internet and e-commerce. Further updates to the manuals were published late in 2002.

Since its inception ITIL has expanded from a library of books into a whole industry, with many organisations offering related products including training, consultancy and management tools.

The ITIL Library consists of seven volumes. The central core of the library consists of Service Delivery, Service Support, Business Perspective and Infrastructure Management, and at its centre - Application Management. Applications Management holds this central position as it’s the only volume in the library which deals with both Development and Service Delivery issues. There are two further ancillary volumes, which provide additional guidance. They are: ‘Planning to Implement Service Management’ – used by Project managers who are implementing ITIL, and ‘Security Management’ – which offers additional information on infrastructure. For the purposes of this course we are interested in what’s known as ‘Core ITIL’ This core consists of two major volumes, ‘Service Support’ and ‘Service Delivery’. In addition to the two main manuals we will also refer to a guidance overview booklet known as ‘little ITIL’. This overview booklet is published by the IT Service Management Forum or ITSMF, an independent user organisation dedicated to IT Service Management. This course forms an ‘introductory overview’ to the content of both books, and you will find that much of the material is also covered in the ‘little ITIL’ book. This ‘overview’ will provide you with enough knowledge to sit the Foundation Certificate in Service Management. ITIL and ISO9000 Today’s businesses need to concentrate on providing a ‘Quality Service’ and to adopt a more customer focussed approach. ITIL provides a best practice framework focusing on the provision of high quality services, and it places particular importance on customer-supplier relationships. For example, areas within ‘Service Delivery’ address customer agreements and monitors targets within these agreements. On an operational level ‘Service Support’ processes address any changes or failings outlined in these agreements. In both cases, there is a strong link between ITIL and recognised quality systems, such as ISO9000. ITIL’s non prescriptive nature allows the tailoring of ‘Service Management’ implementation, allowing it to sit comfortably along side a recognised quality system. Many companies require their suppliers to become registered to ISO 9001 and because of

Section 4: Course Notes


10

this, registered companies find that their market opportunities have increased. In addition, a company's compliance with ISO 9001 ensures that it has a sound Quality Assurance system. Registered companies have had dramatic reductions in Customer complaints, significant reductions in operating costs and increased demand for their products and services. Service Management and the Organisation In any organisation, managing IT services is a fundamental part of day to day operations. As well as maintaining and servicing these ongoing business functions, an organisation develops new applications. Each new application might be made up of a number of projects, or a group of projects, known as a programme. The relationship between these different projects needs to be understood and documented in order to monitor progress, change and so on. As these projects develop they approach a transition point. A transition point is defined as the point at which responsibility for the project passes from the development team to the team responsible for end user delivery and support. This transition point is also known as the implementation point, and it can vary depending on organisational structure and policy. For example a development team might retain project responsibility until the end of a warranty period, at the end of which they hand over the completed project, and associated ownership, to service management staff. ITIL defines a major process to handle the complex relationships which affect projects, and this is known as Application Management. Application Management considers the whole ‘cradle to grave’ lifecycle of an application, considering issues from feasibility through productive life and final retirement of the application. It considers applications as ‘strategic resources’ that need to be managed throughout their life, understanding the implications that decisions made at one stage has on later stages. Although this process isn’t examined in detail in this course, it is important to understand the relationship between Service Management Guidance the IT business as a whole.

The ICT Infrastructure If service provision to business is to be effective, then its implementation should be as transparent as possible. It should be assumed that end-users have no Information & Communications Technology knowledge. IT Service Management staff must take a customer focused view and concentrate on providing high quality services that are available when users want them, that respond quickly to demand, and are easily maintainable. As IT management staff, you will be working alongside technical specialists helping to maintain the ICT infrastructure, and ensuring that delivered services are cost effective. The ICT infrastructure is divided into 3 areas. Hardware, Software and Peopleware. Hardware consists of all the ICT and environmental infrastructure, including mainframe computers, network equipment, workstations etc. Software consists of network and mainframe operating systems, database management systems, development tools and general applications and computer data itself. Inclusion of data here is a contentious one, as it’s suggested by some people that a fourth infrastructure category should exist, handling data as a separate corporate resource. And finally, Peopleware, this includes skills sets, details of training products, documentation of both products and services, Working practices and general procedures. To deliver effective services to business, all three infrastructure components should be managed and controlled efficiently. The management of Hardware and Software is dealt with in a separate ITIL guidance volume called ‘ICT Infrastructure Management’. Our focus in this course is the management of ‘Peopleware’, its documents and procedures, and how it relates to Service Support and Service Delivery.


11

What does ITIL regard as a Service We all encounter business services in our everyday lives. Placing an order for goods or services for example, or when checking into a hotel, we are being offered a business service. In most cases businesses are underpinned by IT services. The IT service consists of a set of related functions provided by IT systems in support of the business, and is seen by the customer as a coherent and self-contained entity. A key phrase in the definition of IT services is ‘end to end’. Broadly speaking ‘end to end’ means that we deal with all aspects of the service, its documentation, its support, the application software, its networks, hardware and so on. Obvious examples of IT services might include e-mail, payroll and order processing. However, there are other less obvious IT services, and these could include a wide area network or a UNIX server, or a customer database forming part of a service support IT system. The ITSMF’s ‘little ITIL’ book defines Service as: ‘An integrated composite that consists of a number of components, such as management processes, hardware, software, facilities and people, that provides a capability to satisfy a stated management need or objective. The core ITIL processes are made up of eleven disciplines.

Five of these disciplines relate to service delivery. These are: • Service Level Management • IT Financial Management • Availability Management • Capacity Management • IT Service Continuity Management

Day to day Service Delivery functions might consist of technical support, and pro-active long-term planning of services. The remaining six disciplines make up the Service Support function. These are: • Service Desk • Incident Management • Problem Management • Change Management • Release Management • Configuration Management

All six disciplines relate to the day-to-day maintenance of a quality service. Ten of the eleven disciplines support the Process Management discipline, with exception of one, and that’s the Service Desk function. Service Desk is seen as a function. Every organisation will have this function in place, operating a Service Desk, employing service desk staff and managed by a service desk manager. The remaining 10 disciplines all relate to processes. For example we might have in place an Incident Management process, but may not have an Incident Manager. Our Incident Management function might be managed by a member of the Technical Support or Service Desk team. ITIL does not mandate the creation of specific functional areas. So, for example a Problem Management team need not be separate from a Capacity Management Team and so on. In practice, many organisations do follow this model, but ITIL guidance allows you to form your own structures. However, ITIL does suggest one good practice, that is for Configuration, Change and Release Management to ‘share’ staff, and to be managed by one individual. This shared management is known as the CCRM or Configuration, Change and Release Management function.


12

Although we have represented each function here as a separate entity a great deal of interactivity exists between each of them. Each function communicates with others in the group. In fact there is a great deal of relationship management within IT Service Management. For example, Service Level Management deals with the provision of high quality services, provided at the right cost levels. Consequently it interacts frequently with IT Financial Management. Interaction between other functional departments might be less frequent. For example, Capacity Management and IT Service Continuity Management might work together to develop a cost effective and workable strategy to handle a major disaster, such as a flood. In this scenario, Information on available capacity at a remote site or location would be provided by Capacity Management. The pre-determined level of support required for on-going business function would be managed by IT Service Continuity Management. These 11 disciplines and the relationship between them form the basis of this course, and are the subject of the ISEB and EXIN examinations, leading to certificates in Foundation IT Service Management.

Summary In this introductory lesson we have: Briefly examined the history of the ITIL library, its make-up, and how Service Delivery and Service Support sit at its core. We have discussed how ITIL’s flexibility allows easy integration into a recognised quality system, such as ISO9000. We looked at the relationship between service management and the business organisation, and how ITIL defines Application Management as a major process designed to handle these complex relationship. We looked at the ICT infrastructure and its three constituent components, Hardware, Software and Peopleware. We highlighted Peopleware, its documents and procedures as a primary focus of this course. We defined ‘What a service is’ in IT terms, and examined some less obvious examples of ‘IT services’ And finally we looked at the eleven disciplines which form the core ITIL processes, and the interactivity which exists between them within IT Service Management.

Lesson 2a Service Desk

13

Lesson 2a Service Desk In this lesson we will be examining the IT Service Desk, which is described in Chapter 4 of the Service Support book of the IT Infrastructure Library. When you have completed this lesson you will be able to: • List the main reasons why the

establishment of a service desk can have major benefits for the organisation, the end-user and the IT provider alike.

• Describe the importance of the Service

Desk as a single point of contact for IT users.

• Identify three of the main approaches to

structuring a service desk. • Explain what is meant by “escalation” in a

service desk context and identify two different types of escalation procedure.

• Name at least six technological aids that

can be employed to improve the efficiency of a service desk.

Introduction One of the most important considerations when delivering IT Services is to ensure the provision of proper support for the users, so that when a problem or a query comes up, they can contact someone who will provide a solution or an answer. Often, time is of the essence and what the users want is either a rapid resolution or a work-around to their problem that will enable them to carry on with their work with a minimum of interruption. In order to support users in this way, ITIL has three closely related chapters, namely: • Service Desk • Incident Management • and Problem Management.

The Service Desk is meant to be the focal point for the reporting of incidents, requests for change, or any queries that a user may have about the service. On the other hand it also provides a channel for the IT provider to communicate information to users. The Incident Management process enables the recording, tracking, monitoring and resolution of events that are a threat to “normal service”.

Problem Management addresses the underlying reasons for such incidents and seeks to implement permanent resolutions in order to prevent a recurrence. We will be looking in more detail at both Incident Management and Problem Management in the remaining two lessons of this topic. For the rest of this lesson we will be examining the Service Desk function. When a Customer or User has a problem, complaint or question, they want answers quickly. More importantly they want a result - their problem solved. Nothing is more frustrating than calling an organisation and getting passed around until you find the right person to speak to - provided, of course, that they are not out at lunch or on holiday or it's just after five o'clock. ITIL Best Practice demands a single point of contact for users in their communication with the IT service provider. Such a facility is known by various names in different organisations – some common ones being Help Desk, Call Centre or Customer Hotline. The name used by ITIL – and hence during this course - is “Service Desk”. Obviously, what ITIL is referring to in this context is an IT Service Desk – but the principle can, and often is applied to many areas of a company’s business. So, in addition to an IT Service Desk there may be a Service Desk where customers for the company’s products can call to get support. Another Service Desk may exist so that employees can get answers to queries relating to company policies, personnel issues and so on. For the purposes of this course will be making the assumption that the term Service Desk refers to an IT and Communications Technology -or ICT - Service Desk. The integration between IT and communications technology is so close these days that it makes sense to handle them via the same Service Desk.


14

Why Have A Service Desk? The establishment and operation of an effective Service Desk is a relatively expensive proposition. So it is important to understand why such a facility might be needed and the benefits that it should provide. The principle of a “single point of contact” that we have already mentioned is considered an essential element of ITIL Best Practice. The users of our IT services and their managers are customers in every sense of the word. Like all customers they would quickly become frustrated and unhappy if they were unable to find somebody who could help them when they had problems with the systems on which they depend. So customer satisfaction and retention can also be listed as an important benefit. Another guiding principle of ITIL is that IT should maintain a focus on the support of business goals. IT does not exist just to provide ICT components or technology just for the sheer joy of playing with new equipment. It is there to help the organisation achieve its business objectives. A well-staffed and efficient Service Desk is a critical element in proving to the business that IT is listening and responding to their needs. An efficient Service Desk can help to reduce the overall cost of ownership of the IT department, and it can do this in a number of ways. The alternative to a Service Desk is for each group of users to have their own “super-user”, to whom they can turn when things go wrong. ITIL strongly suggests that IT costs can be reduced by not requiring high levels of IT skills within the business community, and by making it obvious to all how support can be achieved very quickly via a centralised Service Desk. Making better use of skilled and expensive IT staff can also reduce costs. Straightforward issues can be resolved immediately by the Service Desk, leaving skilled network technicians or database experts, for example, to concentrate only on the complex problems or concentrate on improving the quality of the infrastructure. It will usually be the case that the users or customers are performing a valuable function for the organisation. So, any time that they are unable to operate at full efficiency as a result of

a problem with the IT Services that they use will be both disruptive and costly. An effective Service Desk will significantly reduce the likelihood of such problems. A further consequence of this will be that the IT users will in turn be able offer a better level of service to the external customers of the business. This factor becomes even more crucial in an e-business context where the lack of service will directly impact on end-customers and certainly lead to loss of business. Finally, another major benefit that a Service Desk brings is its contribution to the principle of continuous improvement of the services offered by IT. The Service Desk will keep records of types of enquiry, the issues that are raised, the particular services, or aspects of a service, that seem to cause most problems and so on. Identifying the most commonly occurring problems and feeding this information back quickly to the IT Service Management structure is a critical aspect of the Service Desk. In this way, the Service Desk is the thermometer by which we can monitor the health of the IT services that are being provided. Additionally, the service desk can also operate as a “shop window” – adding value to the business by making users aware of facilities that they may not know exist – or how to make better use, in a business sense, of the facilities that they are already using. Points of Contact There is often some confusion about the terms “user” and “customer” – so far in this course we have used the words interchangeably and for many people they mean pretty much the same thing. ITIL, however, does draw a distinction between the two terms. A User – or End-User - is taken to mean the person who actually uses the product or service under discussion. A machine operator for example. A Customer is the person who negotiates for the provision of the product or service, what the specification should be, any changes that may be needed and possibly the payment arrangements.


15

It may well be that the User and the Customer are the same person. But in many cases, for operational systems, they will be different groups of people. Customers normally being managers, and users being the operators. These definitions are relevant here because whilst the Service Desk is the main point of contact between the User and the IT service provider, the Service Level Management process is main point of contact between the paying customer and the provider. In both cases the key point of reference is the IT Service itself – as defined in the Service Level Agreements – which will contain statements about hours of availability, time to resolve issues, response times and so on. The importance of this to the Service Desk is that they must be aware of what Service Level Agreements are in place and how these match up with the question, complaints and issues that may be being raised by users. It may well be for, example, that a user calls in complaining of a 2 second response time – when in fact the Service Level Agreement specifies that 95% of responses should be within 4 seconds. Such an incident would be given a much lower priority than had the figures been reversed. So, the general point is that Service Level Agreements provide the link between the Customer, User and Service Level Management relationships and that the Service Desk has a responsibility to act on behalf of the User within the IT infrastructure. Service Desk as a Single Point of Contact As we have already seen, the idea of the Service Desk as a single point of contact is an important one in ITIL. Some organisations will take this principle to its ultimate conclusion and have a single Service Desk as the point of contact for everything to do with the ability of the business to continue to function properly. So staff within such an organisation could call the Service Desk if the lift broke down, or a light bulb in their area failed, or if they had a query on their pension arrangements. This kind of Service Desk has the disadvantage of demanding a very wide range of skills to be available – which normally implies a referral system being used – which in turn reduces the

chances of a problem being resolved directly and immediately at the desk. Here, we’ll assume a Service Desk is the single point of contact for just Information and Communications Technology issues. So, as the single contact point, the first duty of the Service Desk is act as the IT users “friend” within the IT department. This particularly relates to the role of the Service Desk in: • Monitoring progress on incidents and

queries • Reporting this progress back to the user. • Chasing any experts that have been

assigned responsibility for resolving an issue.

• Keeping an eye on any Service Level

Agreements that may specify maximum acceptable response times for resolving user issues.

As the user’s friend, the Service Desk has the responsibility of communicating with the user, both Reactively and Proactively. Reactively being in response to issues, problems and queries raised by the users and ‘proactively’ being where the Service Desk goes out to make users aware of issues that might affect them. It is not uncommon, for example, for the Service Desk to publish regular electronic newsletters to the user community informing them of new facilities, changes to services and so on. In order to operate effectively as a single point of contact and the users friend, the Service Desk should have the following ingredients: • Well trained staff with good interpersonal

skills. • Well organised systems and processes for

recording and tracking incidents and matching against previous incidents and solutions.

• Appropriate technology, such as automatic

call distribution equipment and knowledge-based systems that assist in identifying solutions to problems.

• Enough technical competence to address

users’ problems directly or to interface with technical experts if necessary.


16

In addition, the Service Desk must have all the necessary linkages with other ITIL disciplines. For examples, there must be continuous communication with the Problem Management process – particularly when a major problem has cropped up. There will need to be liaison with Service Level Management so that potential breaches of Service Level Agreements can be recognised. Configuration Management records will need to be readily accessible so that, for example, a caller’s IT equipment can be easily identified. Conversely, the Availability Management process will be keen to look at Service Desk records of incidents for conducting their own analyses and as part of their role in improving service availability. Service Desk Structure A debate that always occurs early on in the implementation of a service desk is how the desk should be structured, from a geographical perspective. There are a number of strategies that will usually be considered. Here for example, each distinct site or region of the organisation has it’s own service desk – and hence can provide local expertise to solve local problems. There are a couple of obvious disadvantages to this approach, such as duplication of resources and the maintenance of organisation-wide standards and consistency. Also, lessons learned in one area may not be passed on to the others. Such problems can be minimised the use of centralised logging of incidents and results and by establishing a central configuration management database that is accessible by all the local service desks. The big advantage of this approach, which is local knowledge, will obviously become more important the more geographically and functionally dispersed the organisation’s sites become. In these situations, the issue of language alone may give favour to local service desks. The opposite extreme of the local service desk is the central service desk, where all incidents and queries are reported to and handled by a single centralised structure.

Centralisation has the benefit of providing consolidation of management information and improves utilisation of resources – and therefore can reduce operation costs. There are dangers, however, in that a perceived loss of local knowledge may tempt local sites to set up their own super-users or unofficial help desks. Another major issue with this centralised approach is the cost of communications. Particularly in an international context, careful planning will be needed, otherwise long-distance telephone calls could easily drive up the cost of providing the service to unacceptable levels. The Virtual Service Desk is based on the concept that physical location is not relevant and that whilst the Service Desk may be perceived as a centralised point, it may actually consist of several local service desks. As far as the local users are concerned they are contacting a local service desk – but in reality their calls may be automatically routed to the most appropriate desk, based on the proximity, time of day, staffing or whatever criteria apply. This option is obviously much more demanding on the use of technology, particularly telephony re-routing equipment, in order to ensure that the whole process appears transparent to the end user. The logical extension of the virtual service desk is what is sometimes called the “follow the sun” option. This is widely used by multi-national companies – or even, these days, by local companies who want to take advantage of cheaper labour rates in other parts of the world. So a typical “follow the sun” strategy might consist of a service desk in Australia, operating between the hours of 6am to 6pm local time and a second desk in London operating the same hours local time there. The aim is to provide as close to 24 hour coverage as possible for users in each hemisphere with the European service desk coming on line just as the Australian one is closing down for the night – and vice versa. So people in Europe requiring support during the night will have their calls automatically re-routed to Australia.


17

This is in fact a major advantage of this approach in that the local desk will tend to be handling local calls during the period of peak demand – so that overnight re-routing, and hence long-distance traffic, should be relatively minimal – but it’s there if needed. Of course, “follow the sun” may well be more than two service desks, depending on the location or users, time differences and coverage required. To make this work effectively it is imperative that information about incidents is replicated or shared between the different sites so that the European Desk, for example, can continue to support a user with a query that may have been raised with the Australian Desk a few hours earlier. Although there are some complexities with this approach, it clearly has many advantages and it is becoming a very common arrangement for multi-national organisations offering 24 hour /7day a week coverage – particularly those in the e-commerce field. Communicating with the Service Desk There are many mechanisms by which problems and incidents can be communicated to the service desk. These can be categorised into two sorts - human generated and machine generated. Human users can communicate using a whole range of options such as telephone, fax, voice mail, e-mail, browser-based web-forms and so on. Machine generated communications could come from some form of system monitoring tool. For example, the loss of a particular communications link in a network would usually be reported via network monitoring software. Such incidents are often referred to as Operational Events. So when a service desk is established, the different inputs that will be encountered must be anticipated and catered for. Clearly, some of these inputs allow potential for some form of automated response. If something comes in via e-mail then at least an acknowledgement of receipt can generated automatically.

It may even be possible to introduce a degree of self-service where users register and track their own incidents without the need for inter-personal communication with service desk staff. Be careful with this one though. It can all too easily be used as an excuse for the service desk not playing its role in monitoring and processing incidents on behalf of the user as the user’s friend. Also, be careful with telephone calls. If they are not handled properly it is possible that the user will hang up in frustration and not re-dial. Hence the information that would have been gained about a particular incident or query will be lost. All that would be recorded is that a call had been dropped, which in turn will be used as a key measure of service desk performance. Lost calls of this kind are often referred to as “fugitives”. There’s a problem out there that cannot be investigated because it hasn’t been recorded – and although the user could have been more persistent, the fault is with the service desk staff and or their technology for not making it easier for them to report the incident. Finally, the service desk needs the automatically generated notifications about operational events so that they can inform users about a possible degradation in performance caused by the fault or actions necessary to repair it. The role of the service desk is simply to act on these reports and to ensure that they are handled in the same way as user-reported incidents as far as recording and classification are concerned. Escalation Escalation Management is an important part of running an effective service desk. Escalation is the process of moving an incident or query to the point where it is most ably resolved. So, if the initial recipient of the call is not able to deal with the incident or query – who should it be passed to so that resolution can be achieved as quickly as possible? ITIL distinguishes between Functional and Hierarchical escalation.


18

Here for an example, in a generic rather than just ICT service desk, calls that cannot be directly handled by the service desk will be directed to experts in the relevant functional area. The percentage of calls that get passed upwards will be determined by the skill levels and training of the service desk staff. So functional escalation is the handing over of responsibility to a functionally more competent area, in order to tackle a particular issue. Hierarchical escalation is where problems are passed up the management chain - either because they are very serious or need higher level authority to sanction the resources needed to provide a solution. The first level of hierarchical escalation would normally be to the service desk manager, who is usually the own of the incident management process. More serious issues may then go to the problem manager, with a remit to call together the necessary specialists to resolve the incident as quickly as possible. Very explicit parameters need to be established to govern hierarchical escalation, otherwise it is very easy for it to become the norm, rather than the exception, which would clearly be unacceptable. Service Desk Capability Related to the escalation procedures is the general debate about how skilled and capable of resolving issues the service desk staff should be. ITIL does not make any recommendations in this respect because there is no absolute answer – every case must be considered on its merits. Factors that are normally considered are the increased costs of employing more highly skilled staff against the improved service to the end-users that will almost certainly result. Also this may be a dynamic situation with the optimum skill level changing over time. Immediately following the introduction of a new service, for example, it may be desirable to have some experts available on the service desk to handle the initial rush of calls about the new system.

Once things have bedded down it may be possible to relocate them to more productive areas. So at the one end of the scale we may have an unskilled service desk, merely logging and routing calls – and at the other would be an expert desk capable of handing most, if not all, the conceivable issues at the first point of call. In between these would be what is often called the skilled or semi-skilled service desk – and this is considered by many to be the optimal solution. Achieving this optimal balance is an interesting and difficult task. As we have said, there are no hard and fast rules. There is a school of thought that says a good target is to have about 70% of all issues resolved at the service desk, without further referral. But this will vary considerably depending on the service being offered and the maturity level of that service. Whatever skill level is adopted, the use of diagnostic scripts will increase the rate of resolution at first call, as will access to knowledge databases, change schedules and so on. Service Level Agreements must also be accessible so that work can be prioritised depending on the SLA clauses. Regardless of the technical skills that are put in place on the Service Desk, all operators must have certain basic attributes to make them suitable for the job. These will include: • A customer-focussed attitude – where

helping the customer is far more important and satisfying than playing with the latest technology.

• An articulate nature – in particular the

ability to translate technical information into something that is meaningful to the business user. This can be particularly challenging when dealing with customers who are slow to catch on or who become frustrated, irate or even abusive.

• A methodical approach to questioning and

the recording of facts – and the ability to maintain that approach when under severe pressure or when handling a difficult customer.


19

• A good business perspective and understanding of what are the business critical services. This business culture is often helped by recruiting service desk staff from within the business itself.

• And finally - multi-lingual capability is

becoming an increasingly important attribute for some service desk staff. This is particularly true in the case of the virtual service desk, as discussed earlier, or in multi-national organisations.

Service Desk Technology For the service desk to work effectively, some investment in modern technology will be needed. Relevant technology can be categorised into two types, telephony and software. Examples of telephony technology might be Automatic Call Distribution systems, which ensure that a bank of service desk operators are used in an optimal order and that work is smoothed out as evenly as possible. Conference call facilities can be useful in allowing a second-line expert for example to be included in the conversation with the end-user. Computer-Telephony Integration can achieve major gains in efficiency. An example of this would be the identification of an incoming caller based on their telephone number and the linkage of this with configuration management. This would allow all the details of the user, their facilities and equipment, and possibly service history, to be brought to the operators screen before the call is even answered. Useful software technology would be Intelligent Knowledge-based systems that record incidents, learn from them and identify patterns over time and are able to suggest probable causes and solutions. Database access would provide fast identification of known errors, problems or anything that would help to provide a better answer. Automatic Referral or Escalation tools would divert an issue to a pre-determined list of second-line support staff after a certain period of time. And finally Automatic Tracking and Alert tools could be used to monitor the status of an incident as it progresses through the various stages towards a resolution.

As with all business investments - the costs of introducing all of this kind of technology must be carefully weighed against the benefits that they bring in terms of service improvements and operational efficiency. Benefits & Problems The benefits of and potential difficulties with Service Desk are listed on Page 14 of the little ITIL book and in Section 4.1.8 of the Service Support Manual. Summary In this lesson we have been looking at the reasons for and functions of an ICT Service Desk. We have seen how the Service Desk’s role is to act as a single point of contact and the users friend in IT. We have examined different strategies for structuring and resourcing a service desk and we have seen the skills and attributes that service desk staff must have if they are to operate effectively. Finally we have seen some of the new technology that can be employed to improve the efficiency of operation of the service desk.

Lesson 2b Incident Management

20

Lesson 2b Incident Management Objectives In this lesson we will be examining Incident Management, which is described in Chapter 5 of the Service Support book of the IT Infrastructure Library. • When you have completed this lesson you

will be able to: • Define the term Incident Management

according ITIL Best Practice. • Understand the difference between Incident

Management and Problem Management. • Identify the key stages in an Incident’s

Lifecycle. • Assess the priority of Incidents can be

prioritised based on a number of factors. Incident Management – Introduction ITIL defines an incident as “Any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in quality of, that service”. Historically, incidents were handled by a fragmented set of processes where users faced with a problem would contact IT staff direct and any resolutions would not be documented. Alternatively, system monitoring tools may have alerted technical specialists who would rectify the problem, but again with no central recording or control. This approach led to poor use of expensive resources – the IT experts – to a failure to learn lessons from previous incidents. ITIL Best Practice processes aim to resolve both of these issues. One of the main goals of Incident Management is to restore normal service as quickly as possible, with a minimum of disruption to the business. This has to be balanced against the efficient us of resources – and the prioritisation of different incidents that can occur simultaneously. It is important to distinguish between Incident Management and Problem Management – which is the subject of the next lesson.

Incident Management is more aimed at a “quick fix” or a workaround rather than a longer term structural resolution to any fault. The priority for Incident Management is recovery of service as quickly and painlessly as possible. Problem Management is more about identifying the underlying cause of faults and finding ways of engineering out these faults in the longer term. This can of course lead to some conflict between the two disciplines when Incident Management staff are driven get a system back up and running quickly. Their colleagues in Problem Management, on the other hand, would like to have the system down for longer so that they can conduct analyses and identify strategies for designing out any problems that may exist. The Scope of Incident Management As we mentioned in the previous lesson, the Service Desk often plays a key role in Incident Management; recording and monitoring their progress and retaining ownership on behalf of the user as long as the incident is still “open”. It is considered good practice to record all enquiries as incidents because they are often evidence of poor quality training and/or inadequate documentation. It may be that following the initial logging, a distinction is made between simple queries and an incident that relates to a failure or degradation of a system. A request for a new product or service is usually regarded as a Request for Change rather than an Incident. However, because the processes are essentially similar, many organisations include Requests for Change within the scope of incident Management. Automatically registered events, such as the failure of a disk drive or a network connection, are often regarded as part of normal operations. They are still included in the definition of Incidents though – albeit that the service to end-users may never be affected.


21

Incident Lifecycle It’s very important to understand the process that an incident goes through from its initial detection right through to its point of closure. The first step is the detection and recording of the incident. It is vital that every incident is logged with a unique ID reference – even if we know that the problem has already been reported and a fix is being produced. Apart from the basic details about the incident, the log will normally include details of how the incident was reported and the services and Configuration Items that are affected. Incidents can also be classified into different types for use in subsequent analysis. The example classifications given in ITIL are Hardware. Software and Service Requests – but what is sensible here will obviously depend on circumstances. Also included in this part of the process will be the matching of the details against previously reported incidents to check for known errors, and then assigning a priority to the incident. We will be returning to the subject of priority in a few minutes. Initial Support may involve the application of a work-around, some sort of temporary solution that we know about from the existing problem or incident database. Alternatively, a work-

around may come from the expertise of the Service Desk staff – in which case it should be recorded for future use. In the event that the incident cannot be immediately resolved at the Service Desk, one of the vital jobs at this point of the life-cycle is to identify the correct second-line support group to whom the incident should be functionally escalated. Investigation and Diagnosis may result in a direct resolution or the incident being routed to the identified second line support. This activity may be iterative, in that several attempts may be required in order to find the best resources to tackle the problem. This shuttling backwards and forwards of an incident between different support groups is one of the major issues for Incident Management. If this total process is taking too long then hierarchical escalation procedures may end up being used, as we discussed in the previous lesson. Resolution and Recovery may involve raising a Request for Change and getting that change implemented. Recovery itself may entail the business in further actions, such as re-entering or verifying data. For example, if a disk has crashed, the problem may have been resolved by replacing the disk drive, based on an official request for change. But the service has not been recovered until the data is brought up to date from the backup or archive copies. Incident Closure should involve some confirmation by the originating user and, where appropriate a revised classification. It is quite likely, for example that an initial report of a printer problem was classified as a hardware fault – but subsequent analysis determined that the fault was actually with the software. It is important that such corrections are made to the incident classifications so that an accurate record is maintained. It is possible for an incident to be closed whilst the underlying problem is still under investigation. This would be true where a work-around is available, for example. Some organisations have an extra category which is “Incident Closed and Underlying Cause Resolved”, which they don’t use until the final resolution of the underlying problem.


22

Whilst all this is going on there are the issues of ownership, monitoring, tracking and com-munication to be maintained. Additionally, there will be constant updating of the status of the incident as it moves through the various points of it’s life-cycle. All of these are proactive activities carried out by the incident management staff – which is usually the Service Desk, acting on the users behalf. It involves generating reports, keeping users informed and managing escalations. ITIL standard practice guidance says that all these activities remain with the Service Desk and the use of to help with automatic status tracking is very important in the incident lifecycle. Finally. Remember that everything should be logged as an incident – even if it is a Service Request ie. a request for a standard operational item, such as a password reset for example., If the Classification and Initial Support process determines that the incident is in fact a Service Request then the Service Request procedure will be invoked. Because the request was raised as an incident, however, it will eventually have to be brought back into the incident lifecycle at incident closure, in order to achieve the close down of that request procedure. In understanding the full lifecycle of an incident it is important to know what further records and processes may be generated as a result of an incident. When an infrastructure fault is first reported it is recorded as an incident, either by the Service Desk or direct to the incident management process by automated support tools. Incidents can spawn problems if they are recurring incidents, or if the Service Desk or second or third-line support cannot ascertain the underlying cause. Some problems will justify the generation of a “known error”, this being an admission or statement that we are aware of the problem and we have a resolution to it. In other cases, it may well be that a work-around is an adequate solution – at both the incident and problem levels. A good example of this might be ahead of a major infrastructure change, where making significant changes now would not be worthwhile.

If a “known error” is generated then in most cases this will lead to a Request for Change – in order for the underlying fault to be corrected. Unless, as we have just said, there are good reasons why we should just live with the problem for now because the cost of a short-term fix is not justified. Once a Request for Change has been through the Change Management process as defined by ITIL, then this will lead to the release of a structural solution to the problem. This will be a permanent fix to the underlying fault, not just a work-around. Whilst all this is going on, the Configuration Management Database should be being updated with information about the incident, any problems and their links to incidents, about any “known errors” and their links to problems, and about requests for change and their links to known errors. So an integrated Configuration Management Database not only contains configuration item information but also related support records, such as incidents, problems, known errors, requests for change, and release records. The absence of a Configuration Management Database will make it very difficult to harmonise separate incident recording, problem recording, and change recording systems. We will be looking in more detail at the Configuration Management process in Lesson 3. Assessing Priorities Assessing the priority of an incident is a very important process that needs to be carried out early in the incident’s lifecycle, since it determines what effort is going to be put into its resolution.

Priority is determined mainly by the impact and urgency of the incident or enquiry. However, other things can also come into play. Pragmatically, resource availability will also have a bearing. So if nobody with the right


23

skills to solve the fault is immediately available it may have to be put down the list a little. Another factor affecting priority may be the existence of a specific statement in a Service Level Agreement that is threatened by the incident. Impact - in this definition, is the measure of the effect of the incident on the business. This could be measured in terms of numbers of users affected or financial loss for example. So it is important to work very closely with the business in order to understand the factors that are considered high or low impact. Urgency concerns the time scale in which the incident needs to be resolved. For example, a fault with a payroll system that occurs on the 2nd of the month may well be considered less urgent than the same fault occurring on the 20th. These two factors together dominate the ITIL model for determining priority. So a high urgency does not always mean a high priority -if the impact is considered to be relatively low. For something to be high priority both the impact and urgency must be high. As we have already mentioned, Service Level Agreements can also influence priority. Lets say that Incident A occurs and that this is the fourth incident relating to a particular service in the current month. On the other hand Incident B occurs on a different service and this is the second incident to have occurred so far during the month. In both cases, The Service Level Agreement for the service states that only four incidents per month are permissible. In these circumstances – all other things being equal – it would be reasonable to give Incident a higher priority. The resources available are also likely to affect the priority given to an incident. Although if both the impact and urgency are high then it is likely resources will just have to be made available from whatever sources. Where there are a number of medium priority incidents to resolve then clearly the ones that have suitable resources immediately available will be tackled first. Note that when a major incident occurs – in other words one with a high impact, urgency

and SLA threat – Problem Management staff must be informed so that they can provide extra support to the Service Desk team.

Benefits & Problems of Incident Management The benefits of and potential difficulties with Incident Management are listed on Page 18 of the little ITIL book and in Section 5.4 of the Service Support Manual. Summary In this lesson we have been examining Chapter 5 of the Service Support Manual – Incident Management. We have seen how Incident Management is Defined, the scope of Incident Management and the differences between Incident Management and Problem Management, which is the subject of the next lesson. We have followed the main stages through which an Incident passes during it’s lifecycle and looked at the records that must be kept and the need for an integrated Configuration Management Database. We have also examined the different factors that must be considered in determining the priority of different incidents, which may be competing for limited resources.

Lesson 2c Problem Management

24

Lesson 2c Problem Management Objectives In this lesson we will be examining Problem Management, which is described in Chapter 6 of the Service Support book of the IT Infrastructure Library. When you have completed this lesson you will be able to: • Define the term Problem Management

according to ITIL best practice. • Identify Problem Management’s reactive

and proactive activities. • Recognise the standard set of activities for

problem control and error control. • List the benefits gained from this process The final component in the ITIL infrastructure library guidance for supporting the user of IT services is Problem Management. ITIL defines a problem as ‘the unknown underlying cause of one or more incidents.’ It goes on to define the goal of Problem Management, and that is to minimise the adverse effect on the business of incidents and problems caused by errors in the infrastructure, and to proactively prevent the occurrence of incidents, problems and errors. Broadly speaking, Problem Management exists to ensure that a process is in place which identifies once and for all the root causes of problems. It also helps minimise the effects as well as preventing potential problems occurring in the future, thereby attempting to minimise underlying problems and their causes. Problem Management processes are usually carried out by teams of technically focused specialists who work closely with Service Desk and Incident Management staff, and with other internal and external suppliers. As is common to other ITIL processes, Problem Management responds to incidents in a reactive way, but also has a proactive element. Proactive response adopts a forward-looking approach. Trying to prevent issues occurring by providing intelligent analysis of problem trends and statistics, they may even get involved in making decisions about purchasing, and IT provision.

As the term suggests a proactive response is an ongoing and methodical process. The intention is to minimise occurrences of incidents by identifying and resolving problems and known errors. We will define the difference between problems and known errors a little later in this lesson. The ‘reactive’ requirement of problem management is to resolve Problems quickly, effectively and permanently. It should identify the underlying problems, which are causing related incidents, and find an immediate workaround. Any workaround should allow the smooth continuation of business. When a resolution is implemented via the change management process, it should be a permanent solution that will resolve the problem and the related incidents. Once a problem has been identified, and a satisfactory resolution found to that problem, then the change will normally be implemented through change management procedures. Whether problem management acts reactively or proactively, it is important that resources to deal with them are prioritised on a ‘business needs’ basis. This prioritisation is sometimes referred to as ‘prioritising in pain factor order’. The pain factor relates to the number of people affected by incidents, and the related problem, and the seriousness of the impact on the business. Remember that we said that Problem Management processes are normally carried out by technical staff, and with a combination of Service Desk, Incident Management and Problem Management, we aim to use skilled, technical specialists in the most effective way possible, allowing them to concentrate on major incidents, where they support the incident management process and the service desk, and more of their time on resolving underlying causes of those incidents through problem management processes. As is common to other ITIL processes, the communication of management information between IT Service Management roles is very important. This information is used both internally, within the problem management team itself, and distributed to other IT Service Management roles, such as Availability Management. For example, if IT users were encountering lots of problems caused by poor quality software delivered and supported by a third party


25

supplier, then information gained from Problem Management would be very useful to the Contract Management team. They could use this to help the suppliers make improvements, or in evaluation or analysis of the software or supplied service. In some instances they could also revoke the contract. So how do we define the responsibilities of staff working in Problem Management? These responsibilities can be broken down into a number of focused areas. These are • Problem Control • Error Control • Assistance with handling major

incidents • Proactive prevention of problems • Providing management information

from problem data • Conducting major problem reviews Problem Control focuses on transforming Problems into Known Errors. It does this by identifying the root cause of the problem and providing a temporary workaround where possible. This process redefines a Problem as a Known Error. Error Control focuses on resolving Known Errors under the control of the Change Management Process. The objective of Error Control is to be aware of errors, to monitor them, and to eliminate them when feasible and financially justifiable. Error Control has become a common process in both the applications development, enhance-ment and maintenance environment and the live environment; Normally a service and its configuration items are introduced to the live environment with some Known Errors. It is important that these are recorded in a ‘Known Error Database, so that when related incidents are reported in the live environment they can easily be identified. Proactive Prevention of Problems, and Providing Management Information from Problem Data includes techniques such as trend analysis, targeting support action, and providing support to the organisation. Typically 80% of incidents are caused by 20% of the IT infrastructure components. This Configuration item information can prove useful when attempting to identify the underlying cause of incidents. The provision of management information from problem data to Availability Management for example, can provide vital information on expected levels of availability, and as a consequence, influence

statements made about availability in Service Level Agreements. Ultimately, by redirecting the efforts of an organisation from reacting to large numbers of incidents to preventing future Incidents, you provide a better overall service to you customers and make better use of the IT support organisation resources. Finally conducting Major Problem Reviews. These reviews take place after a problem causing major incident or multiple related incidents have been successfully resolved. It is the responsibility of the Problem Management process to review, identify and prevent the problem reoccurring in the future. Additionally, information from these reviews can identify weaknesses in problem management and incident management processes. These review procedures form part of a ‘Service Improvement Programme’ a key task for any ITIL conformant organisation which aims to improve value and quality. So let’s look at some problem management definitions in more detail. Firstly, the definition of a problem, which is ‘The unknown underlying cause of one or more incidents’. We defined how Problem Control focuses on transforming Problems into Known Errors. A problem only exists from the point of identification to the point when we have found the reason for the problem occuring. Once this point is reached the Problem becomes a ‘Known Error’. New Problem identification occurs when we are unable to find a match amongst the definitions of existing problems, or existing Known Error records. A Problem Record is then raised. One of the most effective Problem Management techniques is to match against a number of multiple related incidents, and realising that they have a common underlying cause. These Multiple related incidents are of particular concern to Service Managers, as they can threaten reliability clauses within Service Level Agreements or Contracts. For example, an SLA might specify that in any rolling month there will be no more than two breaks in service provision, and the duration of these breaks will be no greater than two minutes. So any train of events casing us to approach these parameters is a major concern. Hence Problem Management helps by providing a very important role in the ITIL Service Management structure, by providing early Identification of problems, and communicating this information to relevant management areas.


26

The Problem Control process set consists of a standard set of control activities. These are: • Identification • Recording • Classification • Investigation • Diagnosis • Review & Closure Each reported incident passes through this process set, so let’s take a few moments to define each of these in more detail. Identification Problems can be generated from many sources. An incident might be completely new and have no matching characteristics with records in either existing Problem or Known Error databases. It may also be a reoccurring incident, which has already been identified. Or it might come about as a result of Problem Management’s proactive work, where a trend has been identified and a problem identified as a result. Recording Once a problem has been identified, a record is created with a unique identifier, and a link is generated to any associated records, such as the incidents that caused it, and also to any Known Errors to which it might relate. It’s likely that the incident will pass through the change process, and at this point it will be linked to requests for change. Throughout this process records will also be linked to related configuration items, within the configuration management database.

Classification Problem Classification is often an extension of the incident classification, and is used mainly to determine an appropriate allocation of resources. For example, a problem might be identified in the Local Area Network. This leads to the creation of a team of problem solvers mainly drawn from network specialists. We will discuss this classification process in more detail later in the course. Investigation and Diagnosis These two stages are defined separately because they form an iterative process. Initial investigation results in initial diagnosis, which leads to further investigation and so on. Ultimately the outcome from this process should be a Known Error.

These two stages are complex, and require a good technical knowledge, supported by problem solving and diagnostic skills. ITIL recommends, amongst others, two techniques to help this process. These are Kepner and Tregoe analysis and Ishikawa fishbone diagrams. Both are important mechanisms, which allow those working in Problem Management to use a structured approach to problem diagnosis.

In general it is important to record everything, and to be able to track back. ITIL’s good practice guidance suggests that, regardless of the type of fault, Known Error records are kept, although there is no statement on how to do so. Problem Management is unlikely to implement the resolution of an error. Once a Known Error has been identified then it is handed to Error Control. Although Error Control remains part of the Problem Management Process Set, any resolution is likely to require some level of agreed change, hence the responsibility for the resolution will transfer to Change Management. However, for particular types of problems, there are occasions when Change Management may devolve authority to the Problem Management team. Importantly, Problem Management must still raise the necessary change records in order to do this.

Review and Closure On resolution of every major Problem, Problem Management should complete a major problem review. The appropriate people involved in the


27

resolution should be called to the review to determine. • What was done right • What was done wrong? • What could be done better next time? • And finally how can we prevent the

Problem from happening again Problem closure is the last of the Problem Control Activities and is often carried out automatically when a resolution to a Known Error is implemented. However we should point out that an interim closure status can exist. For example, when a Known Error has been identified and a solution put in place, a status of ‘Closed pending Post Implementation Review’ could be assigned to it in either the Incident, Known Error or Problem records. ‘Closed pending PIR allows us to confirm the effectiveness of the solution prior to final closure. For incidents, this may involve nothing more than a telephone call to the user to ensure that they are now content. For more serious Problems or Known Errors, a formal review may be required. Finally, remember an important part of Problem Management is to continually monitor its own progress, and the progress of those technical support staff that are called in when problem diagnosis, investigation and resolution is necessary. This can be particularly important when problem resolution is ‘time constrained’ by a Service Level Agreement. Problem Classification When a Problem is identified, the amount of effort required to detect and recover the failing Configuration Item has to be determined. It is also important to be aware of the impact of the Problem on existing service levels. This process is known as ‘classification’. One of the main reasons for problem classification is to ensure that any group of specialists that we bring together to solve a problem is the most appropriate. If a problem is generated by the local area network, then it’s important that we assemble LAN and desktop specialists. Problem classification is also used to prioritise the sequence in which problems are addressed. If we are experiencing a large number of incidents related to several different areas of the business then priority must be assigned appropriately. Every incident, problem or change will have both an impact on the business services and urgency.

Impact describes how vulnerable the business might be. For example, life threatening, or merely a small inconvenience. Urgency illustrates the time that is available to avert, or at least reduce, this impact. A problem’s classification may well change as a consequence of the diagnosis activity. This first classification of a problem is described as the ‘initial classification’. For example, what at first appeared to be a problem with a network might actually be the result of a database problem. The problem is then reclassified. However, it is usual to retain both the initial and final classifications, so that resource allocation to problem areas can be improved. Sources of Problem and Error Identification We discussed earlier in this lesson how problem management works reactively to identify problems, by checking knowledge bases for records of problems, Known Errors, changes etc. A proactive activity involves the analysis of past incidents, and the IT infrastructure as a whole. For example, analysis might identify that a pre-existing problem at one site, might reoccur at another site, which has a similar server, hardware and software configuration. Also involved is the broader analysis of the IT infrastructure itself. The examination of over complex relationships, or single points of failure, can identify any vulnerable points that are a potential threat to business. This analysis might indicate that a particular network route is more heavily used than expected, and as a consequence is a potential future risk. Often this work is carried out in conjunction with Availability Management staff, and involves careful analysis of paths through the component infrastructure that make up the various services. For example, a customer using on-line banking to read their balance may involve hundreds of different paths. Another element of proactive problem management involves working with third party suppliers, and our own internal staff, to ensure all procedures are adequate, for example testing procedures, release procedures and so on. Internal staff can be encouraged to take part in system reviews during development, ensuring a higher level of maintainability is designed into the system. And finally, providing access to ‘knowledge bases’. Service Desk staff will be able to link


28

recently occurring incidents to Known Errors and Problems in these bases, resulting in a better understanding of the underlying problems and Known Errors in the Organisation. Error Control consists of four defined processes and these are: • Error Identification and Recording • Error Assessment • Recording Error Resolution • Error Closure Error identification and recording only comes about when a root cause and, if possible, a temporary workaround has been found. Error assessment involves deciding on how to resolve the error and, if this is valid, raising a request for change to achieve this. Recording Error Resolutions in documents that the problem has ‘actually’ been resolved. Here Problem Management works closely with Change Management and Release management process teams, and the end-user. And finally Error Closure. Closure only occurs when the relevant change has led to the business finding a satisfactory resolution to the underlying errors, problems and related incidents. It’s worth noting that Problem Management is responsible for recording errors discovered in both the live and development environments. A situation might arise, where due to time or cost constraints, a product is released which contains Known Errors. For the Service Desk to match incidents in the faulty software to Known Errors, it is vitally important that the pre-existing Known Errors are recorded in a Known Errors Knowledge or Database. All four of these processes are classified as reactive. Error Control also has a proactive element. This proactive activity includes analysing and maintaining the Known Error Knowledge base, in order to provide support to the Service Desk, and identifying underlying trends in Known Errors. Assisting Incident Management is a fundamental responsibility of Problem Management. To identify incidents, and to assign actions to them, information management moves it through an Incident matching process model.

Let’s look at some example incidents and follow their path through the model. The first example is defined as a routine incident, and exits the model at the routine procedures level. The second example is defined as a non-routine incident, in other words, one which isn’t recognised at the Service Desk. Initially we will attempt to match it against our Known Error database. If a match is found, then the incident moves to ‘inform user of workaround’ status, and if the workaround exists the user is informed immediately. The incident process moves on to: Increase by one the incident count on the known error record. Update the category data in the incident, this could involve reclassification of the incident. An incident might have been initially identified as a network error, but recognised in the Known error database as a database related error. The next process is to extract any permanent resolution or circumvention knowledge from the known error database. If a permanent resolution exists, then the Service Desk can execute this, often with the support of change management. The third incident example has no match in the Known Error database. However, as it’s a pre-existing Problem it does have a match in the problem database. In this case the incident then follows a similar route to our Known Error example.


29

Finally, the fourth example has no matches in either the Known Error or Problem databases. This incident is identified as being caused by a new problem, and a new record is raised in the Problem database. The incident is then forwarded for further support to the problem management team. To achieve tangible benefits in an ITIL compliant organisation Problem Management cannot operate in isolation. To work effectively, it must coexist with a structured incident management process. If ITIL implementation resources are scarce, then it’s best to focus on the reactive elements of problem and error control, leaving implementation of proactive manage-ment areas until later. Ideally once service monitoring activities are in place, and useable knowledge bases exist. It’s also focusing on the key problems, which are causing the greatest ‘pain’ to the business. Remember Pareto? 20% of Problems may cause 80% of service degradation.

Benefits & Problems The benefits of and potential difficulties with Problem Management are listed on Page 22 of the little ITIL book and in Section 6.4 of the Service Support Manual. Summary In this lesson we have been examining Chapter 6 of the Service Support Manual – Problem Management We have examined in detail the standard set of control activities, and the Problem Classification, and Problem and Error identification processes. We finished by defining the four Error control processes, and to outline the benefits, and some possible drawbacks, of Problem Management implementation. We’ve looked at three interrelated areas, the Incident Management, Problem Management and Service Desk functions, and the reasons in favour of implementing Problem Management.

Lesson 3a Configuration Management

30

Lesson 3A Configuration Management Objectives In this lesson we will be examining the first of the three ITIL control Processes, Configuration Management, which is described in Chapter 7 of the Service Support book of the IT Infrastructure Library. In this lesson we will; • Examine the relationship between

Configuration Management and the Service Delivery and Service Support functions

• Define a Configuration Item in ITIL terms

• Look at the Configuration Management

Database, and the type of information and records it contains

• Describe the five Configuration

Management sub-process. Planning, Identification, Control, Status Accounting and Verification.

Configuration Management sits at the centre of the three ITIL Control Processes. The objective of these Processes is to; • Ensure that the organisation has accurate

records of its ICT assets • Changes to the IT services are executed

quickly and with the minimum of business risk

• To ensure an integrated set of data exists,

recording details about services, their ICT components and any related support records.

ITIL guidance considers this process as the foundation on which a stable organisation is built. In any organisation, knowing what assets we have and their current status is fundamental to business stability. After all, how can we build something without knowing what we are building on, and what we have to build with. This is how ITIL defines the four major Configuration Management goals. To account for all IT assets and configurations within the organisation and its services. To know the total cost of the IT infrastructure, where it was sourced from, who is responsible for maintaining it, and what dependencies exist between different assets. To provide accurate information on configurations and their documentation to support all other service management processes. This can be very useful for cost accounting of IT services. Knowing what we have, how much it cost and what depreciation model we are applying. It is critical for Configuration Management to support IT Service Continuity Management. But without a thorough understanding of what a ‘live site’ contains, then we can’t know what any ‘fallback site’ should contain. In the same way, effective Capacity Management planning and Availability Management planning can only take place if they fully aware of all ICT components and their relationship to each other. To provide a sound basis for Incident Management, Problem Management, Change Management and Release Management. Information provided by Configuration Management is very useful to other processes. For example, configuration information about a fault in one type of workstation could help Problem Management rectify future problems before they occur. Verify Configuration records against the infrastructure and correct any exceptions. Configuration Management provides organisational confidence, providing records that relate exactly to the real physical situation. So lets start by looking at how Configuration Management relates to Service Delivery and Service Support as a whole.


31

ITIL places Service Level Management at the very top of our objectives because it represents service delivery’s ‘shop window’ to customers and users alike. It’s also a service to which guarantees are applied, in the form of Service Level Agreements. Service level management is supported by several Support and Delivery processes, which amongst other things, enable Service Level Management to negotiate and comply with SLA’s. This whole support structure is underpinned by the configuration management process. ITIL guidance is explicit on this point and states that ’without effective configuration management we are not likely to effectively implement the other ITIL processes, and this will lead us to a failure to deliver a quality service.’

In ITIL terms Configuration Management can be defined as Asset Management plus relationships. By definition this statements broadens the scope of Configuration Management. Most organisations have some sort of asset management system in place, where they know the cost of equipment, where it was purchased, and its current status. This system may only cover hardware and bought-in software. However existing systems are unlikely to cover the ‘relationships’ or linkages between these assets. This linkage is very important, making changes to one, can have a knock-on effect to several others, so ITIL clearly focuses on Assets and their relationships. Because configuration management’s remit is wider than pure asset management, we tend to refer to the information that Configuration Management maintains as Configuration Items or CI’s, rather than IT assets. We have established that Configuration Management underpins all the Delivery and Support Processes, and it defines IT assets and services as Configuration Items. We’ve also established that it monitors the inter-relationships or linkages between CI’s. So how does Configuration Management store, manage and update this information. It does this by entering all this information into a Configuration Management database or CMDB.

A typical CMDB should contain information on: • Hardware, Software, Peopleware, and

related documentation. • Services, and the relationship between

Configuration Items. • Incidents, problems and known errors. • Changes and releases • Records at the highest level contain

• information about the organisations

hardware, including servers, workstations, communications equipment and networks.

• Information relating to Software, including

operating systems, application or script software, or any custom designed software.

• Details about Peopleware, including information related to IT service staff and their skills.

And finally, • information related to documentation,

including procedures, contracts and so on. The second level holds records related to IT services. A service might be made up of several CI’s. For example a service for the personnel dept might consist of hardware, software and related documentation, all of which are individual configuration items. These items together can provide a service, and the service itself can also be defined as a configuration item. ITIL suggests that we should be able to draw a map of how a service is assembled from its constituent components. This graphical representation can help us understand the impact of any changes we make to a CI on the service as a whole. The CMDB is also the ideal place to hold incident records, problem records and known error records if they are held on separate systems. ITIL guidance suggests trying to link these databases, so that we can link a record to any related configuration items. By doing so, future searches on a particular CI will return information relating to outstanding incident, problem or known error records. In the change and release section of the CMDB, we may hold requests for change, change records and so on. This information is used for tracking the progress of change and release records. A release record will contain information about a number of related CI’s,


32

which make up a new release, and will describe how to achieve a change defined in the change records. A CMDB can offer great benefits to an organisation. However the benefits might not be immediately obvious to senior management, who might suggest that a simple asset management system would be sufficient. However, asset management only addresses higher value issues in the infrastructure and doesn’t examine it to the same level of detail. Perhaps more importantly, asset management systems wouldn’t contain the linkages to incident, problem, or known errors, or to change and release management records, and critically wouldn’t document the relationships between CI’s and asset records that a CMDB would. We briefly defined earlier in this lesson what constitutes a CI, and ‘ITIL’ defines a Configuration Item as ‘any component of an IT Infrastructure, including a documentary item such as a Service Level Agreement or Request for Change, which is, or is to be, under the control of Configuration Management and therefore subject to formal Change Control’. CI’s will vary in type, distinguishing between hardware, software and documentation, and in some circumstances, will sub-define lower level configuration item records. For example hardware type might be made up of workstations, servers, network equipment and so on. Whatever the CI type, it will require a unique form of identification. Firstly, a unique identifier, which should comply with a predefined configuration policy. Also an ID type, which categorises the item into hardware, software, peopleware and so on. Other common CI attributes might include a manufacturers or developers id, its location, purchase date etc. In addition to the CMDB, Configuration Management has linkages to two other information repositories. These are the Definitive Software Library or DSL, and the Definitive Hardware Store or DHS. The DSL is the safe storage area for trusted software, and is managed by the Release Management process. The DHS houses spare parts for critical equipment, and replica configuration models in the IT infrastructure. For example the DHS might contain a fully configured standard server and workstation.

Again records relating to the contents of both the DSL and DHS are held in the Configuration Management Database. Also worth noting here is the management of software licences. This has become a major issue for many organisations, and the repercussions of illegal software use can be severe, so it’s considered good practice for configuration management and release management to work jointly on this process. In a fully ITIL implemented organisation, the configuration management team would be expected to hold information about licences, what they contain, and what it covers, as a CI in the CMDB. However, as with the DHS and DSL the physical licences might be held in a separate repository. ITIL suggests that Configuration Management is made up of five sub-processes. These are: • Planning • Identification • Control • Status Accounting • Verification Planning is carried out at the beginning of any process to establish a configuration management plan, and should be revisited regularly. The processes of Identification, Control, Status Accounting and Verification are on going. Let’s look at each of these processes in a little more detail. The first of the Configuration Management sub-processes is planning. ITIL suggests five key points which should be addressed in planning, and these are: • Strategy, policy, scope and objectives • Processes, procedures, guidelines and

responsibilities • The relationships with other ITIL processes • The relationship with other parties carrying

out Configuration Management • And finally tools and other resource

requirements We start by defining a strategy. For example, an organisation might want to establish a Configuration Management system, but for its ‘live systems’ only.


33

Another policy may define that all new bought-in or internally developed systems or services are to be brought under Configuration Management control at the point of hand over, but existing live systems will not be within scope. The scope might encompass desktop services, workstations and data centres, but not the communication network. Accurate definition of the scope is important in order to understand the amount of work involved, and the resources required. Once the strategy, policy and scope are defined, the objectives can be outlined, and a timeframe in which to achieve them. Remember the objectives should be ‘SMART’ objectives, in other words Simple, Measurable, Achievable, Realistic and Timely. Having dealt with strategy, policy, scope and objectives, our next action is to examine the processes, procedures, guidelines and responsibilities. The organisation might already have in place processes to control assets, or change management processes. Although these may not be formally identified as a Configuration Management process, but this could be adapted and improved upon. Planning procedures should be created and maintained along with other related guidelines. We will discuss this in more detail later in this lesson. And finally responsibility has to be allocated. After all, these plans, processes and changes have to be carried out. So work should be allocating to staff in either a configuration management group, or a wider configuration, change and release management group is necessary. If, in this example scenario, configuration management is being introduced into the organisation after other ITIL processes, then it is important to define how these other processes will have to change to accommodate the new configuration management process. Alternatively, if configuration management is implemented ahead of other processes, future inter-process relationships will need to be considered. Relationships with other parties who carry out Configuration management also requires particular attention. Suppliers, external software vendors, and developers might have their own CMDB with which we want to exchange information.

The final point on planning is the use of tools, and other resource requirements. Careful consideration needs to be given to CMDB implementation, whether to design and build a CMDB from scratch, or to purchase an off-the-shelf product. Vitally it should be possible to link the CMDB to system and network management tools, with the benefit of automatic CI recording to the CMDB via these tools. The second of the five Configuration Management processes is identification. The primary focus of the identification process is the establishment of the ‘Configuration Item Level. When defining a configuration item we need to establish what level of detail is appropriate. For example, a complete workstation might be considered as a configuration item, or it could be further categorised into its component parts, and make each of these a CI. This logic must also apply to software, defining a CI as a program as a whole, or a module or sub module of that program. Generally speaking, select a configuration item level which is most beneficial to the configuration management process. So, within any organisation, greater levels of CI detail exist in some areas than others. The greater the level of control required over an area or service, the greater the number of configuration management record detail. Be careful in choosing the most appropriate level, balance information availability and the level of independent control, against the resources and effort needed to support the CMDB at that level. The key target is ‘maximum control with minimum records’. It’s also worth noting that, the level of configuration hierarchy could be restricted by the support tools available. For example breaking down a workstation into its monitor and screen, and then further down into its motherboard, CPU and other component parts, may be impossible if the depth of our CMDB system hierarchy is specified to two levels only. A configuration item record may well contain information about candidate configurations items below it in its hierarchy. For example, in the event of a workstation failure, the policy might be to replace the whole workstation rather than the failed component. However, CI information about the failed component could be held in the CI for the workstation. Also consider that a candidate CI might have linkages to other CI’s other than its immediate parent. In these circumstances the CI


34

information would show its linkage to its parent, and also a ‘used by’ relationship to other CI’s. It would not be helpful to lose this level of detail by incorporating details into the parent CI. Documenting these linkages in the CMDB can have a huge impact on database size. Each new CI added might identify three or four linkages. It’s good practice to establish in advance the required levels of CI’s in the database, even if we don’t initially populate the database to this level. With most CMDB tools, it’s far easier to have empty elements in the database, than to have to restructure the database at a later date. Successfully building and maintaining a CMDB depends on accurately identifying and labelling its configuration structures and CI versions and types, and their linkages with other CI’s. This is termed as defining its scope. Defining scope identifies which items of hardware, software, peopleware and documentation are to be included. Part of this process involves identifying the number of ‘configuration types’, and what benefits their identification will bring. When identifying and refining CI types, we might come across candidate CI’s which are generally very similar, but have subtle differences. For instance, two workstations exist, which, except for having monitors of different sizes, are exactly the same. This slight difference in specification wouldn’t justify the specification of a new CI type. To help us accommodate these anomalies we can specify these as a ‘CI variant’. Version Identification needs to address the full lifecycle of the Configuration Item, so, in addition to those items already in the live environment, items in development and awaiting release are also included. At the same time version numbers are assigned. These numbers should be monitored carefully. If for example the development department assign their own version numbers, then it’s important that this information is transferred to the CMDB at the point of handover. In defining the inter-relationships between CI’s, there are a number of typical ‘types’ which can be used. The most frequently used in ITIL good practice are Composition, Connection and Usage. ‘Composition’ is the simple parent child relationship. A workstation being the parent, the monitor, keyboard or system box being the child.

‘Connection’ describes the relationship between hardware items. The relationship between a LAN and a server for example. ‘Usage’ describes the interdependency between application usage of a common software module, or the linkage from one category to the other. Finally having identified and documented information about CI items, items should be labelled. These might exist in electronic format, or might be printed labels which we apply to identify the relevant CI’s. During development we might want to capture information about CI’s and their relationships, to reflect the position at a particular time. This is known as ‘baselining’. This can be a very useful process, as baselining can provide a rollback point if things go wrong. It can provide a specification from which copies can be built, and can provide valuable review information after the implementation of a request for change. During the baselining process, we should include the relevant related items, including documentation, procedures, peopleware and so on. Baselines should be established at formally agreed points of time. For example, before making significant change to the infrastructure. At any point, the current configuration consists of the most recent baseline plus any approved changes that have been implemented. It’s very common to take baselines of standard workstation configurations to provide a ‘rollback’ position if recent changes prove unsatisfactory. The third Configuration Management activity is Control. The control of configuration items consists of three sub processes. These are: Register, Update and Archive. An additional function of the control process is to protect the integrity of configurations. CI’s are registered as they fall into the remit of IT service management. If we receive new equipment from an external supplier, at the point of handover, we should establish that information received from the supplier is accurate. In many organisations this activity has a direct link with procurement. There are many reasons for updating a configuration items status. For example, a change in the CI’s status from testing to ‘live’. A change of financial asset value. A change of ownership, or changes brought about by incidents, problems or known errors. All these


35

updates have to happen under the authority of the configuration management process. Archiving decommissioned CI’s takes place when a component is no longer in use. The definition of what constitutes a redundant CI, decommissioning and timing details, would usually be specified in a predefined policy document. Archiving involves the removal of CI’s from the CMDB and archiving onto secure storage, and not necessarily the destruction of the record. The protection process safeguards against illegal changes to CI’s, and procedures are maintained so that the CMDB and the information it contains are secure. Protecting the integrity of the configurations includes security against theft, protection against unauthorised change or corruption. Enforcing access control procedures. Guarding against any environmental damage. Protection against viruses, and making back-up copies of the CMDB information, and the secure storage of these back-ups. Configuration control scope must extend to ‘bought in’ CI’s, such as commercial ‘of the shelf’ software, sometimes known as ‘COTS’ packages. By definition this will involve software licence issues, and we will be examining this in more detail in the release management lesson. Importantly, the protection procedures should be in place for the definitive software library and definitive hardware store. The fourth Configuration Management activity is Status Accounting. ITIL defines status accounting as; ‘The reporting of all current and historical data concerned with each CI throughout its lifecycle.’ Status accounting allows us to reveal a CI’s past status. What has happened to it up to this point? Its present status, (what state is the CI in now?), and its future status. (What plans there are for this CI in the future?) This accounting procedure enables changes to CI’s and their records to be tracked, and to document changes in a CI’s status, for example the change from ‘live’ status to ‘withdrawn’. It can also help us establish ‘baselines’. By declaring a status of ‘trusted’ we save all the configuration items and relationships as a baseline. If we encounter problems at a later date, we can then retreat to this ‘baselined’ point. Status accounting can also be used to monitor organisational procedures, for instance,

that a request for change on a configuration item was properly authorised. The fifth and final configuration management activity is Verification. The primary function of Verification, or verification and audit as it is sometimes known, is to establish that the information in the CMDB exactly matches the real life environment. Configuration management offers little benefit if the information that it provides is out of date or inaccurate. This verification and audit procedure should be carried out regularly but randomly. Deliberate avoidance of the change, and configuration management process is most likely to be revealed by this ‘spot check’ approach. These audits involve checking the physical whereabouts of equipment, and installed software. In addition to the regular ‘spot checks’, verification and audit would usually be carried out at the following times: • Before a new release, or before the

preparation of a baseline. • After a disaster. To establish that our

records are accurate, following a major failure in the IT infrastructure.

• Following detection of unauthorised

changes to the infrastructure. A single unauthorised change might be concealing many others, with the result that the CMDB would not reflect the real life situation.

• And we would usually carry out an audit

before the live implementation of a new Configuration Management database.

Carrying out a manual verification and audit can be a time consuming and expensive procedure. ITIL recommends the use, where possible, of automated verification tools. These tools are able to roam networks and servers, reporting on installed hardware and software. Interestingly many manufacturers are building automated management functions into their PC’s. It’s also worth remembering that some verification can be carried out by the service desk staff. During calls from users, service desk staff can ascertain what hardware and software are being used, and whether this matches current configuration item records. Finally, it’s worth noting that in many large organisations, responsibility for the verification and audit process would rest with a Configuration Librarian.


36

As we discussed earlier in this lesson, configuration management is closely linked with the overall Service Support and Service Delivery process, both supporting, and depending on these processes. When an incident is identified it passes through these processes, and it’s important to realise how the CMDB, and configuration management as a whole, support this. The CMDB is used to read and write information by each of the service support process throughout the incidents lifecycle. For example, when an incident occurs we will record it in the CMDB. At the same time we could examine the CI’s which might be causing the incident. When the incident moves into the Problem process, we will be recording the problem information in the CMDB, and also looking at the CMDB for related incidents. The Known Error process will have links in the database to problem records, which in turn are linked back to the ‘underlying cause’ configuration items. When executing a Request for Change the Configuration Items, and their interrelationships, will be examined in order to asses the impact of the change. Change records will be stored, and their status changed as it moves through the tested, implemented, build stage of the change process. In this integrated environment we can see the fundamental role of the configuration management database and configuration management as a whole. The ultimate update authority always lies with the configuration management process, but this authority can be delegated in the case of incident and problem records. Configuration management also remains responsible for updating the CMDB during the change and release processes, often acting on behalf of the change and release management processes.

Benefits & Problems The benefits of and potential difficulties with Configuration Management are listed on Page 26 of the little ITIL book and in Section 7.4 of the Service Support Manual. Summary In this lesson we have been looking at the configuration management process. We have seen how configuration management forms the foundation on which service delivery and service support functions are built, and how all of these processes support service level management. In ITIL terms, configuration management can be defined as asset management plus relationships, and we looked at how these assets are defined as configuration Items or CI’s. We went on to examine the configuration management database or CMDB, its structure, and the type of information and records it should contain. We also looked at how the CMDB links to the Definitive Software Library and the Definitive Hardware Store. We discussed in detail the five Configuration management sub-processes, Planning, Identification, Control, Status Accounting and Verification, and we went on to look at the relationship between Service Support and Service Delivery and the CMDB. And finally we looked at the potential benefits and pitfalls when implementing configuration management.

Lesson 3b Change Management

37

Lesson 3b Change Management Objectives In this lesson we will be examining the second of the ITIL control processes, Change Management, which is described in Chapter 8 of the Service Support book of the IT Infrastructure Library. In this lesson we will; • Define what change is in ITIL terms, and

the goal of Change Management. • Examine the relationships between Change

Management and other ITIL processes. • Define a Request For Change or RFC, and

examine some of its potential sources • Look at the role of the Change Advisory

Board, and the Change Advisory Board Emergency Committee.

• Examine the Change Management process

in detail. The second control process within ITIL guidance is Change Management. So what is Change Management? Well let’s start by more accurately defining the term change. It has many definitions, but possibly the simplest one is the most apt. ‘Change is the process of moving from one defined state to another’. ITIL defines the goal of change management in the following way. ‘To ensure that standardised methods and procedures are used for efficient and prompt handling of all changes, in order to minimise the impact of any related Incidents upon service’. Change Management can either be restricted to changes to the ICT infrastructure and the current ICT services offered in the live environment, or it can be expanded to cover all changes, including those in development areas, or changes which are the result of strategic decisions. There are a number of key points here which highlight why the change management process is critical to a well run IT services organisation.

The first of these is the ability to handle changes promptly and efficiently. When a need for simple and routine change occurs, Change Management should handle them in a streamlined and pre-planned manner. Where more significant and complex changes arise, they should be dealt with efficiently, but to an appropriate level of detail. Change Management is responsible for implementing changes in the organisation with the minimum of disruption. Historically, making changes to the IT infrastructure has resulted in a loss of business, and lost production time. ITIL guidance addresses the potential impact of proposed changes by suggesting the use of fixed change slots in what’s termed a ‘forward schedule of change’ As a result users are informed about up coming changes, what the change entails, and when it will take place. As a further safety net, change management carries out impact analysis on proposed changes, and produces a backout plan, giving the organisation a point to which they can retreat if a change proves unsatisfactory. And finally, Change Management must balance the need for change against the risks on the IT infrastructure of implementing it. Change Management Relationships For change management to be effective it must work very closely with several other IT service management disciplines. These processes are; Release, Capacity, Availability and Configuration Management. We mentioned earlier in the course that its quite common for change, configuration and release management processes to be staffed and managed as a single team. So lets look at some of these relationships in more detail. Change Management has overall responsibility for assessing the potential impact of any changes on the ICT infrastructure. It’s supported in this role by Capacity, and Availability Management. Capacity Management will assess the impact on the business performance of any proposed change. On the other hand, Availability Management will be concerned about any impact the change has on service availability. Capacity and Availability Management should be involved as early as possible in the change process in order to judge the impact of the proposed changes.


38

Any change to the infrastructure involving software, hardware, services and so on, will result in changes to Configuration Items. As a consequence Change Management must work closely with Configuration Management. As we said earlier, part of Change Management’s responsibility is the analysis of any proposed change. To do this effectively it must understand what CI’s will be affected by the change, the way in which constituent CI’s are linked, and if linked, how they make up one or more services. So Configuration Management identifies CI’s which are likely to be affected, on behalf of Change Management. By exchanging information with Capacity, Availability and Configuration management, Change Management is able to ‘Asses the overall impact’ of the change. Once assessed we should be able to state; The impact is manageable, the cost of change is reasonable, and business benefits are worthwhile. At this point Change Management ‘authorises the change’. In many cases this authorisation is with the help of other experts who form a body known as the Change Advisory Board, and in some cases, where the change is a simple one, Change Management can be devolved In these cases it is common for the Change management process to be devolved to Problem Management, or even to operational staff. Throughout the change management process, there is an ongoing update of information within the Configuration Management database. For example, a CI status can now be moved to ‘under change’, or a new CI is created if we replace one piece of software with another and so on. And finally when a change is ready for release to the wider user community, be it effecting software, hardware, documentation or related infrastructure components, it falls to Release Management to manage the actual physical implementation. Remember however, that overall responsibility for any change remains in the hands of change management. The trigger for the Change Management process is the receipt of a Request For Change or RFC. ITIL defines a number of sources from which an RFC can be received. The most common and well documented are those that form part of the incident resolution lifecycle. For example, where a user identifies an incident and reports it to the service desk staff, who in turn generate an RFC. Or from Problem Management, which generates a RFC after investigation of multiple incidents

has lead to a known error and a proposed ‘structural’ resolution. Another source of RFC’s is the need for the introduction of new or upgraded CI’s. For example, your organisation has recently purchased new workstations, their installation, addition to the network, recognition by the server, providing the help and user documentation, will all generate RFC’s. We may have a ‘New or changed business requirement for an IT service’, often identified by the service level review process. Again this will generate a Request for Change, and be passed on to the Change Management Process. An RFC might arise because of customer or user dissatisfaction with a current service. This may not have been reported via incident or problem management, and it might not be outside our current Service Level Agreements. However, it’s important, where financially viable, to meet customers requests. Implementation of new or changed legislation might bring about an RFC. Particular examples include legislative changes relating to privacy, intellectual property rights, security and so on. A major change in business requirements may generate a significant Request for Change. Such a request may have already passed through a conventional investment appraisal process, and enters the ITIL Service Management process for a second review. The role of Service Management is to ensure full impact analysis against effects on existing services, and on the infrastructure as a whole. Typically, a request for change will contain such information as the sponsor, the requested date for implementation, an initial list of configuration items affected, services affected, the reason for change and initial costing information. The exact content will vary depending on the origins of the RFC. One of the main responsibilities of the Change Management Process is to establish a ‘Change Advisory Board’ or CAB. The role of the CAB is to consider RFC’s, and in the light of the business need make recommendations as to whether they should be accepted and implemented, or rejected. It also ensures that any RFC’s which don’t merit detailed consideration by the CAB are recorded. The CAB will also advise on the grouping of changes into ‘releases’ to minimise disruption to the organisation and maximise benefits.


39

Typically a CAB is made up of a Change Manager, who will typically chair the meeting. Plus representatives of the customer, users, developers, other experts, consultants, outside contractors, and of course IT service Management staff. At any CAB meeting there may be a different combination of staff attending, however the core members of the CAB should be the chairperson, customer, user, and ITSM representatives. In general a CAB is regarded as an advisory body, although in some organisations it is defined as an approval board. It’s role is considered as advisory because the ultimate responsibility for change lies with the change management process and hence the change management staff. As a consequence this provides a definitive mechanism for change approval, and makes changes traceable. When making decisions about a proposed change, the CAB should consider the business, financial, technical and risk implications. It should also consider the repercussions of not implementing the change at all. One other area for consideration when deciding whether or not to implement a change is its likely impact on IT continuity plans. Making changes to the IT infrastructure without making changes to any fall back sites can be very dangerous.

The CAB Emergency Committee In many large organisations IT provision is now 24 hours a day, seven days a week. In such environments the need for a RFC could occur at any time. In such organisations it is usual to have a Change Advisory Board Emergency Committee in place. The CABEC are usually called in at short notice to analyse the impact of a RFC, and authorise any correcting work. The committee would usually consist of the Change Manager, who acts as Chairperson, a senior business representative, and senior IT representative. A word of caution here about CABEC activities. Often due to time and business pressures, comprehensive testing of changes isn’t always possible. Nor are configuration items updated with status or change information. Ultimately the CAB is responsible, through the emergency committee, for ensuring that the change management and configuration management process work together to update relevant records, and logs as soon as possible.

Change Procedures We established earlier in this lesson that the trigger for the Change Management Process is the receipt of a request for change. To address these RFC’s, ITIL defines a comprehensive change management process, and we will spend the next few minutes looking at this process in some detail. So lets start with an incoming Request for change, remembering that RFC’s can come from many sources, including the business, other service management staff, or as a direct result of incidents or problems. The initial recipient of the RFC is the Change Manager. At this point RFC’s are filtered, with the Change Manager rejecting those which, for example, have been incorrectly requested, are requests for service modification rather than changes, or are repeats of earlier requests. It’s usual for RFC’s to be logged in the CMDB ahead of this filtering process. However, after filtering it’s common for RFC’s to change status and be redefined as change records. If the change is accepted it moves to the next process, and the Change Manager allocates a priority to the change. This involves assessing Changes for impact on the business and urgency. There are two possible states from this assessment, they are ‘urgent change’ or ‘standard change’. Whether changes are standard or urgent, the principles for processing them remain the same. However, urgent changes pass through a ‘streamlined’ version of the change management process, and we will be looking at this process later in this lesson. In this example, the change is considered non urgent, and so passes onto the ‘Categorisation’ process. Change categorisation involves an initial assessment of the actions and resources required to make the change. There are four possible outcomes from this process. These are: Standard, Minor, Significant and Major. A ‘standard’ categorisation is assigned when a frequently occurring change is identified. It can then be dealt with via a pre-existing set of processes and authorisations. These change types are usually considered low risk, and don’t require consideration by the CAB. An example of this might be a hard disk replacement or upgrade on a user workstation.


40


41

The definition of minor, significant and major will be defined by individual organisations, and will be dependent on the current status of the IT infrastructure, and the IT service management personnel’s current feelings about risk. A ‘minor change’ categorisation would usually be authorised by the Change Manager, who will report their actions to the CAB after completion of the change. The aim here is to reduce the number of RFC’s forwarded to the CAB by filtering out any low risk changes. If the change is defined as either significant or major, then the CAB will have a significant role. In both cases, the first action is for the Change Manager to circulate RFC’s to either the CAB, or in the case of a major change, to company Board or other senior management members. As we saw earlier in this lesson, the CAB’s role is to give advice, provide estimates on required resources and timescales, and put forward schedules for change based on priority and resource availability. The CAB will also perform detailed impact analysis, and this often requires input from ITSM specialists, for example the Capacity Manager. Eventually implementation dates and a schedule are decided upon, this information is contained in a ‘forward schedule for change’, which is passed to the relevant service management staff, and to the business as a whole. If changes are likely to cause disruption to the business, then this will be formally documented in a ‘Projected Service Availability Report’. Remember, not all RFC’s considered by the CAB will be accepted. After investigation, the potential risk or financial implications might be considered too high, and outweigh any potential benefits the change might bring. The CAB activities of estimating and scheduling may well be iterative, and the process continues until an approved change status is reached, or the RFC is rejected, in which case it might re-enter the process at the beginning. At the point of approval, the Configuration Manager updates the Change Management Database. The change has now reached the Change Building sub process. The Change Builder may actually consist of several groups of internal or external staff, who are involved in hardware, software, operating systems, documentation and so on. Change Builders are not normally permanent members of a Change Management Team, but are drawn from areas of technical expertise.

Note that a failure during the change building process will almost certainly result in the change returning to the CAB, possibly with a request to modify the scope of the change. It’s important that all changes have a back out plan, so that if an error occurs during implementation, the change can be reversed and the service restored. At this point the failed change will re-enter the process at the CAB level. Once the change is complete it moves to an Independent tester, where the change is tested and quality checks are carried out. If at this point a failure occurs, the change is returned to the Change Builder. If the Change is tested successfully it moves onto the Change Manager, who coordinates the implementation of the change. Remember that the Change Manager has overall responsibility for the change, but that Release Management normally has control at a detailed physical implementation level. Note that throughout the cycle of building and testing, and during implementation the Configuration Management process is updating the status of change records. Typical statuses include; accepted, in build, under test and so on. A change record will typically contain details of the back out plan, when it was built, CAB recommendations and scheduled implementation dates. As a consequence, the change record is frequently changed. It’s important to accurately manage the change record system within the CMDB, so that we can carry out traceability tests. Change records are usually linked to impacted infrastructure configuration item records, and also to any related incident, problem or known error records. If at the point of live implementation the change fails, then the Change Builder instigates the back out plans. If however, the change is implemented successfully, it’s important that the Change Manager reviews the change. The review process can provide valuable information about our change management process, and can also identify vulnerable areas in the IT infrastructure. A successful review will trigger the ‘closed’ status, and the request for change or change record will be updated in the CMDB. Note the CAB itself might be involved in the review process. A failure at the review stage would identify shortcomings in the implemented change. This in turn would result in new requests for change entering the process.


42

In the previous few pages we have seen how the Change Management process deals with a standard change. We will spend the next few minutes looking at how Change Management deals with a RFC, which has been given an Urgent priority by the Change Manager.


43

The first action is for the Change Manager to call either a CAB meeting, or in an emergency situation, the CABEC. The aim of this meeting is to quickly evaluate the request for change, by assessing its impact, the resources required and its urgency. The meeting should establish whether it’s urgent status is justified. If the outcome suggests that the RFC status isn’t urgent, then it will be rejected, and will be dealt with as a standard RFC. If, on the other hand, the RFC status is confirmed as urgent, then it passes on to the next process and in to the hands of the Change Building Team. The Change Building Team then build the change and where technically possible, prepares a back out plan. When the change is complete, as much testing as possible should be carried out. Completely untested Changes should not be implemented if at all avoidable. In this case, the Change Manager then coordinates the implementation of the change into the live environment. If the implemented change fails, the Change Manager implements the back out plan. If the change is successful, then the Change Manager firstly ensures that records are brought up to date, carries out testing in the live environment, and at a later date, reviews the change. If after the review, the change is considered successful, then it is closed, and the Configuration Manager closes the RFC and updates the CMDB. Lets take a few steps back, and look again at the process, assuming this time we have time to test the change. This time our built change passes from the Change Builder to the Independent Tester who carries out testing as quickly as possible. If tests are successful, then the change is forwarded to the Change Manager for coordination of implementation. If the change fails during testing, then it returns to the Change Builder process. The Change Management process deals with Requests For Change from many areas of the organisation, and with different levels of authorisation. Where RFC’s are frequent and repetitive, they can be dealt with via pre-existing and authorised processes. These processes are known as a ‘standard model for change’. Standard models needn’t be solutions to simple changes, often complex operations can have standard models. In general once a RFC is regularly repeated, we can create a standard model for that change.

We saw earlier in this lesson how the Change Manager examines RFC’s and categorises them as either, standard, using a standard change model, minor, significant or major. To assign one of these categories, the Change Manager examines the RFC, and considers the following: Impact The impact the request for change will have on the business, considering such factors as the number of users affected. Novelty Is the change familiar? Has it occurred before? Together, Impact and Novelty can provide us with some idea about the level of risk involved with the RFC. A RFC with high impact and high novelty is certainly a higher risk. Devolved Authorisation Has the responsibility for change been devolved from the CAB to the Change Manager? Or further devolved to say the Service Desk. Standard Model Can the request for change be dealt with via a standard model, with a pre-established implementation process? So lets add some content to our table, We’ll start with column 1. This RFC is regarded as low impact to the business, and is a well known change, so the novelty is also low. Authorisation has been devolved to the change manager, and a standard model exists. This is a high frequency RFC. Column 2 is slightly different, again the RFC is regarded as low impact, but it hasn’t been done before, so its novelty is high, and as a consequence, no standard model exists. Again authorisation is devolved, and it’s categorised as a minor RFC. This type of RFC could act as a trigger to build a new standard model. In our third example, the results are slightly different. Our RFC has a high degree of novelty, and no standard model exists. It will be forwarded to the CAB, so authorisation isn’t devolved to the change manager. This RFC falls into the significant category. The RFC in our fourth example has a standard model, however, business impact is considered high, so devolution to the Change Manager won’t take place, and it must be examined by the CAB before the standard model processes are implemented. Hence this is regarded as a significant RFC.


44

As both the impact and novelty are high, the RFC in our fifth example must also be considered by the CAB. This is also a ‘significant’ RFC. In example six, we are considering a change which has very high business impact. For example, changing from an ISDN based telephony system to ADSL. Changes of this magnitude would normally be authorised at a higher level than the CAB. It is categorised as a major RFC. Finally, lets examine a couple of examples, which in general should be avoided. Firstly, a Change which is regarded as high impact, but which has devolved authority, this is likely to be considered very risky. Secondly, a change which has no standard model but is low novelty, should, by definition, have a standard model in place, and shouldn’t be re-submitted to the CAB. Over time, we should expect the number of standard models, and the changes passing through them to increase. This should result in a reduction in the number of changes forwarded to the CAB, and reduce the number of ad-hoc change requests devolved to the Change Management Process. Metrics & Audit for Change Management Process We’ve seen in this lesson how Change Management improves the way in which an organisation implements changes. To clearly identify these improvements, Change Management measures process performance, and this is carried out in accordance with our own standards. Measuring performance usually takes place over time to show, for example, that the number of urgent changes is reducing. So that the results can be clearly understood at all levels in the organisation, this data is usually represented in graphical form. Regular summaries of the change process should be provided to service, customer and user management. Different management levels are likely to require different levels of information, ranging from the Service Manager, who may require a detailed weekly report, to senior management committees who only require a quarterly management summary. Typical metrics for measuring the change management process are:

• The number of changes implemented during the measured period

• Number of changes backed out by reason

code • Number of Staff Training records up to date • Cost per change verses estimated cost • Number of urgent changes By auditing the change management process we can check for compliance to procedures. In general a change management audit should investigate: All new software releases Checking that they have been through a proper authorisation process Incident Records Usually selected at random, and tracked through the change process Minutes of CAB meetings Not only to check that CAB meetings have taken place, but also to see if identified action points have been followed through Forward schedule for change To see if it has been accurately defined, and importantly, that its been published to the user community, and is being adhered to. And finally, that Change review records are in place for all changes. Efficient Change management requires an ability to change things in an orderly way, without making errors and making wrong decisions. Effective change management is indispensable to the satisfactory provision of services, and requires an ability to absorb a high level of change. Benefits & Problems The benefits of and potential difficulties with Change Management are listed on Page 33 of the little ITIL book and in Section 8.4 of the Service Support Manual.

Summary In this lesson we have been looking at Change Management, the second ITIL control process. We began the lesson by defining what change is, and the goal of Change Management, in ITIL terms.


45

We looked closely at the relationships between Change Management and other ITIL processes, particularly Release, Capacity, Availability and Configuration Management. We established that the trigger for the Change Management process is the receipt of a Request For Change, and we looked in detail at some of the sources of these requests. We examined the role of the Change Advisory Board or CAB, its make up, and the role it takes in the Change Management process. We went on to look at the role of the Change Advisory Board Emergency Committee. We studied in some detail the Change Management process for both a normal and standard and urgent RFC, and defined the standard, minor, significant and major RFC categories. Finally we discussed the use of metrics and auditing, in order to evaluate the change process, and highlighted the benefits, and potential pitfalls, of the Change Management process.

Lesson 3c Release Management

46

Lesson 3C Release Management Objectives In this final lesson on the ITIL control processes we will be looking at Release Management, which is described in Chapter 9 of the Service Support book of the IT Infrastructure Library. When you have completed this lesson you will be able to: • Describe why Release Management is

needed • List the major benefits, costs and possible

problems of this process

• Understand how the Release Management process functions, and its relationship with other IT and Service Management processes

• Describe what is meant by a Definitive

Software Library (DSL), a Definitive Hardware Store (DHS), a Relapse Schedule, a release policy and a release metric

Introduction The third and final ITIL control process is Release Management. ITIL defines the goal of this process is; ‘To take an holistic view of a change to an IT Service and ensure all aspects of a Release’, both technical and non technical, are considered together.’ Release Management implements new software or hardware releases into the operational environment using the controlling processes of Configuration Management and Change Management. So why do we need Release Management? Well in simple terms it’s the control process which ensures that all aspects of a release are handled properly, including the software, hardware and documentation required. It focuses on protecting the live environment and its services through the use of formal procedures and checks. This process requires technical competence and its sub-processes are often performed by technical staff under the overall authority of the Change Manager. A release is defined in ITIL as a collection of authorised changes to an IT service.

Releases are often divided into: Major Software Releases and Hardware Upgrades These would usually contain large amounts of new functionality, some of which may make intervening fixes to Problems redundant. A major upgrade or release usually supersedes all preceding minor upgrades, releases of emergency fixes. Minor Software Releases and Hardware upgrades Usually containing small enhancements and fixes, some of which may have already been issued as emergency fixes. A minor upgrade or release usually supersedes all preceding emergency fixes. And finally, Emergency software and hardware fixes, normally containing the corrections to a small number of known problems. Release Managements holistic approach to IT service change ensures that the business as a whole and any relevant technical areas are ready to accept, implement and use that release. It is the responsibility of the Release Management process to plan and oversee the ‘roll out’ of these changes. ‘Roll out’ includes distributing all the configuration items to wherever they are used. This could be done in a number of ways, either via the internet, by email, or by simply posting CD’s. In general, use whatever means best suits the business. This all sounds very simple, however the process becomes much more complex when hundreds of servers need to be upgraded simultaneously throughout a large geographic and cultural area. To ensure successful distribution, clear and repeatable processes as well as technical and business skills will be required. As part of the Roll Out activities, it is likely that you will need to provide scripts to help install the release, as well as passwords to activate the release when needed. Release Management is also tasked with ensuring that only the correct, authorised and tested versions are installed in to the ‘live’ infrastructure. Additionally Release Management ensures that we can trace where a particular version comes from, and the related changes it has undergone. This is especially important for “due diligence and governance”. To make this possible, software needs to be kept securely


47

before, during and after the move to the ‘live’ environment. Release Management also agrees the exact contents of any release and a detailed roll out plan with Change Management, The Release Management process encompasses three defined areas of the organisation. The development area, its own area of pre-production, and finally the production area, or live environment. The migration from one are to the next, is only permitted subject to satisfactory results from reviews, tests and other appropriate quality checks. Release management has full responsibility for the pre-production environment, which contains both the Definitive Hardware Store, or DHS, and the Definitive Software Library or DSL. Although we show the DHS & DSL within the Pre-production area, it is important that it remains detached from the development, pre-production and live environment. Remember, it’s just as important to control a hardware change and release, as it is to manage the software equivalent. Independent testing might include customers acceptance testing, operational acceptance tests and so on. It may well be that significant customer acceptance testing has already been carried out. However operational acceptance tests are very important – they ensure that anything that goes wrong in the live environment is supportable maintainable and robust.

Also worth noting is that any back out plans which have been prepared should also be tested. Part of Change Management’s role is to decide on the particular contents of the release and it is very important that the release management team are fully aware of the decisions that have been made by other organisational elements. Within the actual production environment we will have to deal with, distribution, potential rebuild and implementation, of software and hardware releases. There may be three separate stages, firstly to distribute software, secondly, build it or rebuild it in the live environment, and finally implementation. Each of these three stages should be verified as accurate. For example, before we attempt implementation, we should be absolutely certain that a rebuild process has been achieved correctly. Note that ITIL refers to specific steps called ‘Roll Out Management’ and this may take place after independent testing to manage in more detail the actual implementation stages that follow. Roll out management usually comes into play when we’re dealing with very large and complex implementations or ‘roll outs’. Throughout this process it is very important to update the CMDB. Information is held here on Release Records, and that any status changes to these records is documented.


48

Definitive Software Library and the Definitive Hardware Store. Release Management has responsibility for two critical repositories. These are the Definitive Software Library or DSL, and the Definitive Hardware Store, or DHS. Information related to the contents of the DSL and the DHS is held in the Configuration Management Database, and responsibility for keeping these records up to date belongs to Configuration Management. The DSL contains only trusted versions of software, for example software which has been developed from valid earlier versions via correct Change Management Processes. The DSL may consist of one disk containing all bought in and created software held in a single format. Commonly the DSL consists of separate disk volumes or servers containing software for individual environments. Additionally the DSL could contain other software media, such as diskettes, CD’s and so on, which might be stored in a separate cabinet. Software assets are particularly vulnerable to unintended loss or corruption, so it’s important to take very good care of the DSL. For example, employing adequate security and access controls. Appropriate protection against other threats, such as fire or flood should also be in place. Backup copies of critical elements of the DSL would usually be kept, often at another location. Finally protecting the DSL against virus infection, by running regular virus checks on any item entering the library. The definitive Hardware Store should be protected in a similar way, and should have specific protection against physical removal. The contents of the DHS should be updated as quickly as possible to reflect the live environment. Storing older versions of hardware can be useful if the organisation encounters significant problems with new configurations and software, then it’s possible to revert back, by cloning these older versions. Remember, responsibility for maintaining the contents of the DSL and the DHS is shared between Release Management and Configuration Management. One of the key activities of Release Manage-ment is deciding on the correct ‘release type’. Firstly it defines the ‘release unit’, which is

defined as ‘that set of Configuration Items within the infrastructure which is normally released together’. The general aim is to decide the most appropriate Release-unit level for each software item or type of software. This can be set at System, application suite, program, or module level. Different release units will exist in different parts of the infrastructure. For example an organisation may decide that a normal release unit for its order processing service should always be at system level, and as such a change to a CI which forms part of that system will result in a full release for the whole of that system. The same organisation may decide that a more appropriate Release unit for PC software should be a suite level, and so on. Once the ‘release unit’ is defined, Release management moves on to address the question of release type. Release types are defined in to 3 categories, these are, full release, Delta release and package release. A full release is where all components of the release unit are built, tested, distributed and released together. For example, if the release unit is at program level, then the whole program would have to be rebuilt. If it’s at suite level then the whole suite, which might include many applications, would have to be rebuilt. Consequently full releases are expensive to build, distribute and install. However they do give confidence that all the elements of a service work together successfully. They are most appropriate for major changes, and are usually scheduled over longer periods of time. Delta releases involves distributing only the components that have changed since the last release. Consequently this is a less expensive option. Delta releases are most appropriate for fixes and urgent or emergency changes, and as such form the most frequent form of release. To reduce the frequency of Delta and Full releases, and to provide longer periods of stability ‘Package Releases’ can be used. A ‘Package Release’ might consist of groups of delta or full releases, or a combination of the two. Defining Release Type involves deciding on a form of Release Identification. It’s normal to use a numbering structure, which applies to two or three levels. For example a new Payroll System might be assigned a release Id of V:1.0. An additional minor release which involves changes to some of its applications


49

would generate a release Id of V:1.1. An emergency fix to a small element of a module within that system might have a release Id of V:1.1.1. Remember there is no absolute limit to the levels used. Definitions of release Type and Release units should be documented in a Release Policy. This policy should also clarify roles and responsibilities, and information on Release frequency. The policy content is usually determined by the Release Manager, in conjunction with the Change Manager and the CAB. A Release Policy might also contain • Guidance on the level in the IT

infrastructure to be controlled • Details on release identification and

numbering conventions

• A definition on major and minor releases, plus a policy on issuing emergency fixes.

• Expected deliveries for each type of release We mentioned earlier in the lesson that Release Management is responsible for the detailed planning of releases. Amongst other things, release planning involves: • Gaining agreement on Release Content • Producing a high level release schedule

• Planning resource requirements Release planning is responsible for verifying all of the hardware and software in use is as standard, and has been derived from the necessary definitive software library and definitive hardware store. In addition the Release Planner develops a Release Quality Plan, to ensure all aspects of the release are quality managed, and produces a back-out plan Where a release is going to be particularly complex it may require a specific planning phase. To facilitate this, the Release Plan is extended to Rollout planning. This expands the Release plan produced thus far, and adds details of the exact installation process developed and the agreed implementation plan.

Roll out planning involves: • Producing a detailed timetable of events • Listing all the CI’s to be installed and

decommissioned • Producing Release notes and

communications to End Users • Planning Communication Roll out planning, together with Release Management decides on the type or rollout approach. This might be a ‘big bang’, phased or pilot approach. A Big Bang approach involves all sites receiving all functionality simultaneously. The benefit of this approach is that it offers consistency of use across the organisation. However, achieving a simultaneous upgrade can be problematic. In a phased approach all sites could receive some functionality at the same time, with more coming later. In a Pilot approach a single site receives all functionality ahead of other sites. Note however that combinations are possible, for example a ‘phased pilot’ approach. Compliance with software licence agreements has become critical to businesses. Ensuring these obligations are met is the joint responsibility of Release and Configuration Management. For example, when moving software to the DSL, it is important to check what has been purchased has arrived, that it has been virus checked, and that the licence agreement has been checked. Remember penalties for breaching the laws on software theft are applicable to any responsible officer of the company, including those at the highest level. There are many legal precedents for holders of software intellectual property rights arriving unannounced at premises, and impounding any equipment, which they believe, contains unlicensed copies of their software. Benefits & Problems The benefits of and potential difficulties with Release Management are listed on Page 39 of the little ITIL book and in Section 9.4 of the Service Support Manual.


50

Summary In this third and final lesson on the ITIL control processes, we have been examining Release Management. We started the lesson by defining ITIL’s Release Managements goals, and why Release Management is necessary. We saw how a release can be divided into Major, Minor and emergency releases, and discussed Release Managements holistic approach to IT service change, and how, as part of this approach it produces detailed release or rollout plans. We examined the Release Management process, and the linkages to its critical repositories, the Definitive Software Library and Definitive Hardware Store as well as the Configuration Management Database. We looked in some detail at release types, release units and release identification, and we concluded the lesson by identifying some of the benefits, an potential problems with the Release Management process.

Lesson 4a Availability Management

51

Lesson 4a Availability Management Objectives The topic for this lesson is Availability Management, which is described in Chapter 8 of the Service Delivery book. Once you have completed this lesson you will be able to define Availability Management and describe how it relates to other ITSM components. You will be able to recognise the main elements of the Availability lifecycle and understand the terms MTBF, MTTR and MTBSI. You will appreciate the main responsibilities of the Availability Management process and be able to recognise several techniques which are of use in this area. Introduction Despite the fact that the IT Infrastructure is becoming ever more reliable – and hence Availability levels are generally better than they have ever been – Availability Management is non-the-less a critical support process for Service Level Management. Availability is now regarded as one of the most important issues for IT service management because of the growing dependence of businesses on their IT services. Availability Management supports Service Level Management by actively managing the availability of services. For example it assists the Service Level Manager in negotiating and monitoring service levels. The Service Delivery states that: The goal of the Availability Management process is to optimise the capability of the IT Infrastructure, services and supporting organisation to deliver a Cost effective and sustained level of Availability that enables the business to satisfy its business objectives. The critical words here are ‘cost effective’. The business can have almost any availability it likes provided it is prepared to pay for it. One only has to look at the expenditure on safety critical systems and on general aeronautical systems to understand this. For most commercial and organisational systems there is a limit to the benefit in extra

availability that the business can afford by using more and more advanced techniques and equipment. Business of course is interested in the availability of its services, such as e-mail, personnel records and so on, and is not directly concerned about the availability of any components that may be vital in making up that service. In general, the availability of a service is influenced by the complexity of that service and the systems that it is based on, by the reliability of the items in the infrastructure, by both corrective and preventive maintenance procedures - and also by our incident, problem and change management procedures. It is important for all staff involved to understand that if a business service is unavailable because of an IT problem there will be a loss of business productivity. This may also lead to a loss of revenue, customer dissatisfaction and extra costs in having to pay staff overtime for the work they couldn’t do when the system was unavailable. Availability Management - Relationships and Definitions We will now explore the relationships that exist between Availability and the various elements of the support organisation, such as Service Level Agreements, IT Services and their customers. A customer will negotiate a Service Level Agreement with IT Services, and within the SLA there will be statements about service availability. These statements might say that we expect 99% availability from a service measured over a one month period, or they may say we expect no more than one hour’s lost service over a four weekly period. They may say we expect no more than three breaks of service totalling one hour over a monthly period. The definition of availability and the way we phrase that will be subject to local discussions. The current best practice view is to make this statement as business focused as possible and to think in terms of unavailability rather than availability. The generic definition of availability is: “The ability of an IT service or component to perform its required function at a stated instant or over a stated period of time.” (SD Manual 8.2.3)


52

Related terms, which are also defined is the same section of the Service Delivery manual are, Reliability, Maintainability and Serviceability. In Service Level Agreements and in clauses with suppliers through underpinning contracts, Availability is often expressed as a percentage - the percentage of the agreed service hours for which the component or service is available and that is often as a measure of how good or bad the availability is. To say that we require 99% availability of the service over a given period is a fairly common way of defining what is needed by the business. So, customers negotiate the SLA availability clauses with the IT service through service level management processes and then, as we will be seeing in later lessons, service level management processes require underpinning support. There are broadly two types of underpinning support, one through operational level agreements with internal suppliers, the other through underpinning contracts with external providers. In the case of the internal support, such as application support, hardware support and so on, then we’ll expect to find statements in the OLA on availability, reliability and maintainability of the components that this group is responsible for. When we are talking about underpinning contracts the word ‘serviceability’ is often used as a contractual term and that is seen as covering availability, reliability and maintainability when applied to components supported by external suppliers. You can review a definition of each of the terms “availability”, “reliability”, “maintainability” and “serviceability” by clicking on each of the buttons here. The word Serviceability, in ITIL, is reserved for use where support is provided by external parties and will incorporate statements about availability, maintainability, reliability of their managed components and services. Again, measuring the way the third party suppliers are achieving availability would be of value to the organisation and should be part of the role of availability management.

Availability Lifecycle It is useful to think of Availability as having a lifecycle. So imagine that we have a timeline with time running from left to right. Now for a particular component, lets say that a failure occurs at time X1. This will be recorded in ITIL as an Incident. There will then be a period of time that it takes to repair the faulty component – this is usually referred to as the Mean Time To Recover or MTTR. Be very careful here as the R in this acronym can have a number of alternate meanings. We have defined it as “Recover” – but it is also commonly taken to mean “Respond”, “Repair” or “Restore”. Imagine, for example, that the failure is a crashed hard disk. There will be a period of time that it takes to “Respond” to the incident, to get an engineer on site. Then there will be a further period during which the disk is being repaired or more likely replaced. Typically, it will then take some time to “Restore” the data to the point where normally business can be resumed. In this course we will be using the term “Recover” to encompass all of this – and the Mean Time To Recover is the average length of time that all of this takes to achieve. Be aware though, that it may be useful to understand these other measures as they are often captured by service management organisations to check on various aspects of the availability management process. Once normal service has been recovered there will then be a hopefully long period of time before the component fails again at time X2. The period of time between the fault being recovered and the next failure is known as the Mean Time Between Failure or MTBF. Hence it is easy to see that the sum of the MTTR and MTBF will give what is called the Mean Time Between System Incidents or MTBSI.


53

MTBF, MTTR and MTBSI We can now consider the relationships that exist between each of these three parameters and the terms Availability, Reliability and Maintainability that we have already discussed. It is obvious from the diagram that a high Mean Time Between Service Incidents implies high Reliability. If components don’t fail very often then the services on which are based on them will be reliable services. So high MTBSI is obviously a good thing. On the other hand, a low Mean Time To Recover is good news, since this implies a high Maintainability. This can be achieved, not only by technical means but by having good support procedures within the IT service management team so that there are no delays between an incident being detected and repair work starting. As you might expect – a high Mean Time Between Failure is very desirable and directly equates to a high Availability. So, typically we can see that if we want to achieve higher availability, then either increasing the Mean Time Between Failure or reducing the Mean Time To Repair – or a combination of the two can achieve this. All of these measures, MTBF, MTTR and MTBSI, can be applied at both the component and overall service level. Typically, if we want to increase the overall availability either of a service or of an assembly of components, then this can be done either by increasing the reliability of each component or the resilience of the assembly or by improving the maintainability and the procedural aspects. If an e-mail service is dependent on two servers and each has a MTBF of 5000 hours, what will be the MTBF of the e-mail service ? Increasing the MTBSI and MTBF figures and reducing the MTTR will all cost money. There will be a limit as to how much we can spend to achieve high reliability and high resilience and there will be a limit to how much we can spend to achieve instantaneous reporting and repair. As we said at the start of this lesson, the business can have almost whatever availability it wants – provided that it is prepared to pay for it.

The Business View of Availability All businesses rely on their IT services – but some services, or parts of services, will be more important to the business than others. For example, in an EPOS service, the critical requirement is that we are able to take payments. Other functions such as automatic updating of stock levels is important but not as vital as servicing the immediate customers. Therefore it may be necessary to aim for higher availability of the first part of the service than the second part. ITIL refers to such business-critical functions as Vital Business Functions or VBFs The concept of Vital Business Functions is widely used in IT Service Continuity Management and Availability Management within ITIL and is a way of highlighting the services to which the business must have almost 100% availability. Understanding each Vital Business Function allows the Cost of Unavailability of a service to be measured and reported. Such costs may be incurred through revenue loss, or overtime payments and so on, as we discussed earlier. Cost of Unavailability is a more effective way of reporting than percentage availability because it relates the true cost of the loss of service to the business directly. It is important to report on trends and to agree on the measurement period, for example, “Service was available for more than 98% of the agreed service hours during the last month” may be very useful when we’re reporting against service levels in Service Level Agreements, which are often expressed in the same way. Trends are very important in the whole of service management. Service improvement programmes, for example, set out to move things forward, and that relies on having some baseline against which to measure. So, for example, we might want to say that we’ve moved forward in terms of the number of breaches of Availability Agreements from last year to this, with the number decreasing from 10 to 5, say. Section 8.7.7 of the Service Delivery Manual uses what it calls an IT Availability Metrics Model (ITAMM) as a framework for deciding on the sort of reporting that needs to be done. Because it covers such a wide range, from details of component availability right through


54

to services, it is a basis for all reporting both internal and external. It is beyond the scope of a Foundation course to understand much more about the ITAMM, just the fact that it exists and is a basis for important reporting is what we need to know.

Responsibilities of Availability Management Page 64 of the Little ITIL Book gives a useful listing of the responsibilities of the Availability Management process. The first of these, concerning the optimisation of availability is self evident and much of this lesson concerns that particular point. The second point is about determining availability requirements in business terms. It is very important that we are able to work with the service level manager and the customer so that their requirements for availability can be expressed in terms with which they feel comfortable. They are often much more comfortable with discussing business lost, business downtime caused by loss of IT services, than they are in percentages and fractions. Hence we must be able to gather these requirements in the relevant terms and translate them into meaningful technical terms for discussion with suppliers of underpinning services, both internal and external. Conversely, if we are producing technical information about availability, MTBFs, MTBSIs and so on, it is our responsibility to help the service level manager to turn these figures back into meaningful business terms for the customer. The third point, Predicting and Designing for expected levels of availability and security, implies that availability management staff are involved in the systems development process right from the very beginning. It is an ITIL recommendation that Availability Management staff should be involved when the business case is being created for a new or extended service and that they remain involved all the way through the analysis and design process. The aim being to ensure that the needs of availability management, including maintainability and reliability, are built in along with security elements. This implies availability

management staff having some familiarity with system development processes. The Availability Plan should be a long-term plan for the proactive improvement of IT service availability within the imposed cost constraints. A good plan should have goals, objectives and deliverables and should look at all the issues of people, processes, tools and techniques as well as looking at the technology. In many ways the Availability Plan is analogous to the Capacity Plan and should take account of current levels of availability against the service level requirements, trends in terms of availability, new technological options and knowledge of the way the business is developing. There is no absolute guideline on how far ahead the plan should look, but following the capacity management analogy, it would reasonable to think in terms of one year at a time with a review at least every three months. The fifth item on the list of responsibilities is all about the collection, analysis and maintenance of availability data. Monitoring the various availability parameters can generate a large amount of data and because of this it is not unusual to find an Availability Management Database being created. This may be either as a separate entity or by adding extra information to Configuration Management database. Item six is arguably one of the most important areas and defines the role of the availability manager. This is all about monitoring service availability against the Service Level Agreements, for the benefit of the service level manager. The performance of internal and external suppliers against the serviceability requirements in any underpinning contracts and targets defined in the Operational Level Agreements and must also be monitored as part of this process. The final point refers to the need for the Availability Management process to be continually looking for improvements on a proactive basis. In other words, not waiting for targets to be threatened before taking action, but to be constantly reviewing current status and looking for cost effective ways of improving availability. As with many other of the ITIL processes this proactive work is critical but may be the last part of the process to be implemented.


55

There is an additional responsibility on the process owner, and that is to monitor the effectiveness and efficiency of the availability management processes. This can often be done by looking at how many SLAs have been breached because of availability issues and looking at how many components have got measurement in place. The Availability Management Process Section 8.3 of the Service Delivery manual describes the Availability Management process in some detail. The inputs to the process include: The Availability Requirements of the business, which are critical. A business impact assessment, so that the Vital Business Functions and the consequences of loss of availability are fully understood. This will help in determining priorities when setting up the Availability Management processes for the first time. Part of the service level negotiation process will be to determine the availability, reliability and maintainability requirements from the business. Some of these will be for existing services while others will be for services that are in conception. Incident and Problem data will also need to be examined. Part of the proactive work will be to investigate incidents and problems and to see which of those are caused by unavailable equipment and what the impact of these incidents or problems was on availability measures. Configuration data will be very important since that will show the relationships between configuration items and the chain of configuration items that makes up a typical service. This will enable us to look for sensible places where we might decide to replace equipment by higher quality equipment with a higher reliability. Or, for other areas where we might decide to mitigate against a possible single point of failure, or SPOF in ITIL terms, by looking for alternative routing in a network or perhaps duplicating of discs or processors. Remembering that one of the jobs of availability management is to ensure we achieve service

levels in the area of availability, then we’ll be constantly looking at records of service level achievement or service level breaches or potential breaches. Now let’s look at the key outputs from the process, which are: • Availability and Recovery Design criteria for

each new or enhanced IT Service. These are intended to help the development teams decide on how to achieve high availability.

• Details of the Availability techniques that

will be deployed to provide additional Infrastructure resilience to prevent or minimise the impact of component failure to the IT Service

• Agreed targets of Availability, reliability and

maintainability for the IT Infrastructure components that underpin the IT Services.

• Reporting of Availability, reliability and

maintainability to reflect the business, User and IT support organisation perspectives

• The monitoring requirements for IT

components to ensure that deviations in Availability, reliability and maintainability are detected and reported

• And finally, an Availability Plan for the

proactive improvement of the IT Infrastructure.

Security It can be argued that the most valuable assets of IT services are the data and the ability to process that data. This is why security is such an important part of IT service management. The basic logic behind managing these assets is: • Make sure that access is denied to

unauthorised people. In other words, maintain Confidentiality.

• Make sure that the assets are trustworthy.

That is, maintain Integrity. • And, make sure that assets are available to

authorised people when they need them. Or, maintain Availability.

This may lead to some conflict and possible trade-offs. For example, high availability is not


56

necessarily good if it compromises confidentiality or integrity. Within ITIL, availability aspects are the responsibility of availability management while the confidentiality and integrity issues are shared responsibilities with security management. Within an organisation, it may well be that the whole responsibility for CIA is devolved to the availability management team. It is very important that such responsibilities are clarified. Techniques for Availability Management One of the most basic techniques used in Availability Management is the calculation of availability in terms of a percentage. The basic calculation is straightforward, the availability of a service or of an individual component or of a grouping of components is given by the agreed service time minus the downtime, divided by the agreed service time – all times 100 to obtain a percentage value. Note that component availability, is very often expressed as a decimal value – always less than one - rather than as a percentage. In order to take account of the fact that one user losing access to the system is significantly less serious than 100 users all losing access, a weighted calculation can sometimes be more meaningful. The way this is calculated is to replace the variables AST and DT with End User Processing Time and End-User Down Time. End User Processing Time is defined as the Agreed Service Time multiplied by the total number of users (Nt). End User Down Time is found by multiplying the Down Time by the total number of users affected. So, if a system is meant to be available for 40 hrs in a week and there are 10 users of the system, EUPT will be 400. If just one of the users is affected for four hours but the other 9 users are not affected at all over that period of measurement, then End User Down Time would be equal to four hours downtime x 1, giving a value of 4. Therefore the overall availability would be 400, minus 4 divided by 400 all times 100 – giving a weighted availability of 99 percent.

Contrast this with the value given by the more simple basic calculation, which would be only 90%. Its important to note that whichever way of calculating availability is chosen has to be agreed with the users before it can be used as the mechanism that we measure and report on. Percentage availability may not always be the most useful measure from a business point of view. Absolute figures of up-time and down-time over an agreed period might be more appropriate and may be more acceptable for the business. So for example we could say that there were four hours of downtime out of 400 potential service hours in the last week, and that may be a more useful measure than turning that into a percentage value. This is all about agreement and trust between customer and supplier and whichever figures are chosen should be those most meaningful to the business. It is very important to understand and be consistent in the use of reporting periods. For example, an availability of 99% for a service to be achieved on each and every day is much more demanding than the same percentage averaged over a year long reporting period. It is possible to achieve a 99% availability whilst losing service for perhaps two whole days in the year. In order to achieve 99% on a daily basis, the allowable downtime on any one day would have to be reduced down to just a few minutes. Great care must be taken over the definition of what agreed service time is. For example, does it include downtime for maintenance? Is that already factored in? In most cases we would not want to be penalised for agreed downtime for maintenance or upgrades. In 24/7 systems however, where the requirement is for very high availability, the figures often do include and are meant to include any time for maintenance, which will need to be reduced to an absolute minimum. The pattern of downtime may also be critical and will need to be understood. For example, depending on business circumstances, 10 losses


57

of service each of 10 minutes duration may be more damaging than a single loss of service of 100 minutes for the same period of time. The reporting requirement to cover such differences will need to be closely examined and agreed with the business. In reporting and discussing availability with end users and customers, the main areas of interest will nearly always be based around services and not around components. However, internal reporting for service improvement purposes and for supplier management mechanisms will often require reporting at the component level. Calculating the Availability of Multiple CI’s A very common requirement is to be able to understand and calculate how the availability of an assembly of configuration items is governed by the individual component availability. An assembly is a grouping of more than one configuration item. The formulae for calculating End-to-End availability for items arranged in series is fairly simple. The overall availability AT is equal to the product of the availability of each of the individual components. So if we have two components, each of which is capable of delivering 90% availability – the End-to-End availability of the assembly will be 0.9 times 0.9 – or 81%. In other words, significantly less than each of the components making up the assembly. It is easy to see from this formula that the more items that are put in series, the lower will be the End-to-End availability figure. Calculating End-to-End availability for items arranged in parallel is a little more complicated – as shown. So for the same two components now arranged in parallel – the resulting End-to-End availability will be 99%. Again it is easy to see that, unlike components arranged in series, the more CIs that are put in parallel then the higher will be the overall availability – but such duplication of components, or duplexing, – will necessarily increase costs.

There may also be some technical limitations in terms of how easy it is to switch from one component to another when one fails, but the general principle is one of significant improvement to assembly availability achieved in this way. One difficulty in both cases is finding good values for A1 and A2. Assuming they are hardware components, this could be derived from a combination of manufacturers’ engineering specifications, (NOT from their sales literature), other similar installations and your own experience gained during testing or development. Using a combination of those three sources will tend to give realistic values for the availability of individual components. Once an initial base of figures has been established then monitoring of availability over a period of time using monitoring tools and records from the service desk of incidents can allow an iterative improvement in the component availability figures. Finally, there are a range of techniques designed to aid understanding of why availability problems are occurring in particular parts of the infrastructure and to find corrective ways of working. The first of these techniques that we will look at is Component Failure Impact Analysis (CFIA). This is represented normally in a matrix showing configuration items against the services supported. For example, here we can see that service ‘B’ is dependent on all four of the CIs 1 to 4 being available, whilst service ‘D’ only requires items 3 and 4. Looking another way, we can see that item 3 is essential to all 4 services, none of them can function without it.

It is important to realise that the CFIA matrix can be used by either reading down the columns or across the rows to give us different information. If ‘B’ is a service which has vital business functions within it, then it becomes critical to understand, at a more detailed level, how those VBFs are dependent on the components. As a first pass analysis of dependency and understanding of where single points of failure could be critical, CFIA is very useful.


58

So in the example shown, CI3 is a very good candidate for attention, such as replacement with a more reliable item or duplication by the addition of a parallel assembly as a replacement for the single component CI3. More sophisticated information can be put in the CFIA such as information that for service ‘B’ to run, either component 3 or component 4 need to be there but not necessarily both. This may require some extension to the notation - which is often home grown or company-specific and which is beyond the scope of this course. Another useful technique is called ‘Fault Tree Analysis’ or FTA. This is a diagrammatic technique drawn initially from the world of engineering, which identifies the chain of events leading to service failure. It is part of a family of techniques generally referred to as Failure Mode & Effect Analysis or FMEA and this is covered in more detail in the lesson on Problem Management. Risk analysis can be done in a variety of ways. The way that’s favoured in ITIL because it originally comes from the same development source, is known as CRAMM, CTTA Risk Analysis and Measurement Method. The CCTA – or Central Computer and Telecommunications Agency was the original name for the OGC or Office of Government Commerce. The name was changed in 2001. We’ll talk a bit more about CRAMM in the IT Service Continuity Management lesson. One of the key requirements of availability management is to be able to achieve an understanding of why a particular lack of availability is occurring and what to do about it. There are a couple of techniques that can help us here and they are called; System Outage Analysis, SOA, and Technical Observation Posts or T.O.P. SOA involves a detailed analysis of service interruptions. It is really a post-mortem about some of the more major incidents that have occurred in the infrastructure and trying to find some common underlying theme or cause for the availability losses. It requires significant inter-disciplinary work between different teams to make this work and tends to be managed as a small project with a particular budget and reporting period.

Setting up a Technical Observation Post or T.O.P. is an expensive process because it involves bringing together a team of people to look at a service at a vulnerable period of its life. If, for example, we know that on a monthly basis are availability problems while assembling data for end-of-month financial work, then a Technical Observation Post might be set up to look at this particular process. In effect the T.O.P. would be watching the process go wrong in order to more accurately understand what’s happening. This is particularly useful in cases where it proves difficult in test conditions to simulate the fault that is causing the loss of availability. It requires an inter-disciplinary team and an acceptance from the business that the only way of finding and resolving the issue is by allowing some availability losses to occur. It is worth noting that in addition to the techniques that we have discussed in this section, the Availability Management process will support and work closely with proactive problem management. So many of the same techniques used in Problem Management may also help with identifying the underlying reasons for lost availability.

Benefits and Problems of Availability Management The benefits of and potential difficulties with Availability Management are list on Page 68 of the little ITIL book and in Section 8.3.5 of the Service Delivery Manual. They are also summarised here for your convenience. Summary In this lesson we have been examining the Availability Management process Once you have completed this lesson you will be able to define Availability Management and describe how it relates to other ITSM components. You will be able to recognise the main elements of the Availability lifecycle and understand the terms MTBF, MTTR and MTBSI. You will appreciate the main responsibilities of the Availability Management process and be able to recognise several techniques which are of use in this area.

Lesson 4b Capacity Management

59

Lesson 4B Capacity Management Objectives In this lesson we will be examining Capacity Management, which is covered in Chapter 6 of the Service Delivery book in the IT infrastructure library. Once you have completed this lesson you will be able to;

• Define Capacity Management, and its three sub-processes of Business, Service and Resource Capacity Management

• Identify Capacity Management’s

• ongoing, ad hoc and regular activities

• Describe the contents of the Capacity

Database and the Capacity Plan What is Capacity Management? In order that Service Level Agreements are met, it is critical that sufficient capacity is available at all times to meet the agreed business requirements. Capacity Management ensures that IT processing and storage capacity provision match the evolving demands of the business in a cost effective and timely manner. Of all the ITIL processes this can be regarded as one of the most proactive. ITIL defines Capacity Management’s goal as: ‘To understand the future business requirements (the required service delivery), the organisations operation (the current service delivery), the IT infrastructure (the means of service delivery), and ensure that all current and future capacity and performance aspects of the business requirements are provided cost effectively.’ The Capacity Management process incorporates Performance Management, Capacity Planning, and monitoring and tuning activities. In a large organisation there may be many people working in a Capacity management team under the leadership of a specialist. In smaller organisations it might be the role of a single individual who is supported by technical specialists from Networking, desktop and so on. The Capacity Manager role requires excellent technical and business capabilities. The day-to-

day activities include dealing with technical specialists and service level managers. It’s not usual for the Capacity Manager to communicate with customers, or to be responsible for procurement of new equipment. However, Capacity Management will have a significant input on purchasing decisions. Capacity Management - a balancing act The Capacity Management Process can be regarded as something of a balancing act. The organisation must provide enough capacity to meet justified business demands, balanced against the costs that the organisation can afford to pay. There a two ‘laws’ associated with Capacity Management, which offer an insight into the demands placed on this process. The first is ‘Moore’s Law’, which suggests that ‘processing capacity doubles every 12 to 18 months. The second is a variation on ‘Parkinsons Law’, which states that data expands to fit the space available for storage. This highlights a second ‘capacity’ problem, the one of supply and demand. As greater capacity becomes available users will make use of it. There is continual pressure from the business and customers to increase capacity, but in doing so there a costs incurred to the business. Ultimately, a decision has to be made over whether the cost of capacity provision provides enough business benefit. However, Capacity Management must justify the cost of any capacity increases. Broadly speaking the objective is to provide the:

• Right Capacity, enough but not to much • At the right cost • And critically, at the right time

In theory, if Capacity Management processes are running well, providing the right level of capacity at the right time, then they should be invisible to the business, and to most aspects of Service Level Management. In any organisation, there can be a huge number of capacity elements to be managed, which could impact on business. Those shown in the question represent just a few of the IT components, which Capacity Management must address. Interestingly, people are not usually thought of in capacity terms, except where a shortage of people leads to other capacity problems. For


60

example, if we don’t have enough service desk staff to fulfill commitments made in Service Level Agreements. As we mentioned earlier in this lesson, providing capacity to the business at the right time is critical. If capacity upgrades are too late then the infrastructure could fail. Failures might already be occurring, for example, through incidents and complaints reported to the service desk. Or internal monitoring tools might indicate that we are operating close to capacity. Buying in extra capacity at short notice leaves little negotiating power with external suppliers, and as such, is likely to be very expensive. Conversely, upgrading the infrastructure to increase capacity, to then find it’s under used could in itself lead to financial problems. Capacity Management is also involved in the reduction of capacity or as it is sometimes known, ‘managing shrinkage’. In any organisation the capacity of certain components is being reduced whilst the capacity of others is being increased. An example of this might be where a mainframe-based environment is gradually being replaced by a distributed service. The capacity requirements on the mainframe will be falling while the capacity requirements on the servers will be increasing rapidly. Capacity Management Structure Capacity Management consists of three inter-related sub processes, each working at different levels in the organisational structure. The three sub-processes are, Business Capacity Management (BCM), Service Capacity Management (SCM) and Resource Capacity Management (RCM). Business Capacity Management (BCM) focuses on the future services required by the business and tries to predict future capacity. This process is responsible for the production of a Capacity Plan, which is intended to forecast the future requirements for resource to support IT Services that underpin the business activities. To work effectively, BCM requires an insight into the business as a whole, and should be able to gather medium term plans and predictions about growth or shrinkage. Service Capacity Management (SCM) is concerned with the services currently in place to support the business. It tries to ensure SLAs aren’t breached because of capacity problems, and tries to improve scarce resource utilisation through the use of Demand Management.

Finally, Resource Capacity Management concentrates on the underpinning technology resources that ‘enable’ business services. It also ensures that these resources, or Configuration items, are not over used. This sub process is also responsible for monitoring future development and capacity of technical components, and reporting these findings back to the business, so that they can be integrated into future plans. The Capacity Management process has a number of ongoing, iterative activities. These activities include: monitoring, analysis, tuning and implementation, and are carried out in Resource Capacity Management and Service Capacity Management. They are not normally used in Business Capacity Management, except during business reporting. For example, to show, through analysis of data gathered through these activities, that transaction responses are slowing down. The monitoring activity should include the monitoring of thresholds, and baselines or profiles of the normal operating levels. Thresholds and baselines are set from the analysis of previously recorded data, they are the ‘yardstick’ by which Capacity Management can measure utilisation of IT infrastructure configuration items. All thresholds should be set below the level at which a resource is over-utilised, or below the targets in an SLA. For example, a threshold might specify that the usage on any individual CPU does not exceed 80% for a sustained period of one hour. If these thresholds are exceeded, alarms should be raised and exception reports produced. In addition to exception reports, monitoring will also produce trend reports on a daily, weekly or monthly basis. Trend reports are intended to help predict future threshold breaches. Monitoring leads on to the analysis activity, where the monitoring data is analysed to try and identify problems, and what type of problems they are. Analysis then leads onto reporting, and then onto tuning, where the problems are addressed, and the technical parameters of the system are fine tuned to improve efficiency. Once a tuning decision has been made it is implemented through the change management process. Finally the activity returns to monitoring, and the iteration begins again. Note that tuning is an optional activity. If no problems are identified in analysis, then tuning will be unnecessary. Tuning is an expensive activity, as it involves high level of skill.


61

Tuning can improve service delivery without incurring costs associated with equipment purchase. However, using skilled resources will incur costs, particularly if they are sourced from outside the business. Tuning at service level can ensure that services don’t clash at times of peak demand. Any excess demand can be controlled by Demand Management, an activity that we will look at later in this lesson, or by sharing capacity, in a multi-server environment, across several servers. Importantly, tuning should be carried out initially in a test environment. Only when we are confident that the change will be a benefit to the business, should it be implemented through the conventional change management processes. Activities in Capacity Management – What does the Capacity Manager do? In the next few pages we will look at all of the capacity management activities in more detail, and how they relate to each of the Capacity Management sub-processes of Business Capacity Management, Service Capacity Management and Resource Capacity Management. Remember Business Capacity Management is concerned with future business requirements for IT services, its planning and timely implementation. Service Capacity Management is responsible for ensuring the performance of all services detailed in SLRs and SLA targets are monitored, measured, recorded, analysed and reported. Resource Capacity Management monitors and measures the individual components in the IT infrastructure. The Capacity Management activities can be sub divided in to three groups based on their frequency, and these are: Ongoing, the day-to-day activities, Ad hoc, carried out as a result of a particular need, and Regular, which are carried out at fixed intervals. Amongst the ongoing iterative activities, are those of Monitoring, Analysis, Tune and Implement, which we looked at earlier in the lesson. Remember this group of activities are mainly carried out at the Service and Resource sub-process level. Also note that these activities are

used in Business Capacity Management’s reporting activity. Another on-going Capacity Management activity is Demand Management. The main objective of Demand Management is to influence the demand for computing resource and the use of that resource. This activity can be carried out as a short-term requirement because there is insufficient current Capacity to support the work being run, or, as a deliberate policy of IT management, to limit the required IT capacity in the long-term. Short-term demand management might be needed if there is a partial failure of a critical resource in the IT Infrastructure. Service provision might have to be modified until a replacement or fix is found. Long-term Demand Management might be used when an expensive upgrade to the IT infrastructure can’t be cost justified. The aim in this case, is to influence patterns of use, by using mechanisms such as physical and financial constraints. Physical constraints might involve restricting the number of concurrent users to a specific resource, a network router for example. Financial constraints might involve the use of differential charging, an example of this might involve charging customers a premium to use network bandwidth during peak hours of demand. Demand Management must be carried out sensitively, without causing damage to the business, customers, or the reputation of the IT organisation. It is essential that customers are kept informed of all the actions being taken. Another ‘on-going’ Capacity Management activity is providing data to the Capacity Management Database or CDB. As you can see in the diagram, all of the other on-going and ad hoc Capacity Management activities provide information to the CDB. The CDB provides valuable information on who has used which resource and when. This data can be extremely useful for other ITIL processes, particularly IT Services Financial Management. The CDB is the cornerstone of a successful Capacity Management process. Data in the CDB is stored and used by all the sub-processes of Capacity Management, because it is the repository that holds a number of different types of data including; business, service, technical, financial and utilisation data.


62

However the CDB is unlikely to be a single database, and probably exists in several physical locations. We will look at the make up of the CDB later in this lesson. Ad hoc activities Modelling is an example of an ad hoc activity, which is used in all Capacity sub-processes. Modelling tries to predict the behaviour of components and services under a given volume of work, particularly at peak times, ant tries to understand the way in which current service and resources are used, and the impact of that usage on the IT infrastructure. It attempts to predict the future from our knowledge of the past. In order to do this we establish a ‘baseline’ model. The baseline model reflects accurately the performance that is being achieved. Once a baseline is created, predictive modelling can be done. We can ask the ‘what if?’ questions about planned changes to the IT infrastructure. If the baseline model is accurate then the results of the predicted changes should be accurate. The major modelling types used by Capacity Management are:

• Trend Analysis • Analytical Modelling • Discrete Simulation • Benchmarking

These modelling techniques vary in complexity and consequently cost, with Trend Analysis at the top being the simplest and cheapest, whilst benchmarking being the most complex and expensive. Lets look briefly at each of these modelling types. The Trend Analysis technique looks at various data over a period of time and attempts to draw a smooth curve through these figures, extrapolating the graph data forward into the future, as a way of predicting future trends. Analytical Modelling uses mathematics to represent computer system behaviour. Typically a model is built using a software package, which can recreate a virtual version of a computer system. When the software is executed, ‘queuing theory’ is used to calculate response times, and if virtual response times are sufficiently close to those recorded in the ‘real life’ IT infrastructure, the model can be regarded as accurate.

Although Analytical modelling requires less time and effort that other modelling types, typically the end results are less accurate. Simulation modelling involves the modelling of discreet events, in other words what actually happens millisecond by millisecond, as a transaction passes from local pc through the local area network, to server and so on. This type of modelling can be very accurate in predicting the effect of changes, but it is time consuming, and therefore costly, as it can involve numbers of staff in producing physical event simulations. However, Simulation Modelling can be cost justified in organisations with very large systems, where the cost and associated business implications are critical. Finally Benchmarking involves physically building a replica of part of the IT infrastructure and measuring such things as its response to a reduced workload, and extrapolating these results, to see how it would perform under the ‘real’ workload. Because Benchmarking involves the purchase of equipment, building software and simulating significant workloads, this is the most expensive modelling option, however, it does give the most accurate predictive figures. Another ad hoc Capacity Management activity is Application Sizing. The primary objective of Application sizing is to estimate the resource requirements to support a modified or new application, and to ensure that it meets its required service levels. Application sizing has a finite lifespan. It is initiated at the beginning of a new application, or when there is likely to be a major change to an existing one. Application sizing is complete when the completed application is accepted into the operational environment. This activity is performed together with colleagues in system and service development, to ensure that we are fully aware if the likely impact of services being development, designed or purchased, before they are implemented. This provides Capacity Management with important data on future resource requirements, and this can be integrated in to the Capacity Plan, as well as providing valuable information for purchasing and the development team. How we make programming, database design and architecture design more resource efficient, is also covered by in the ‘Best Practice’ guidance. Finally, a ‘regular’ Capacity Management activity is the production of a Capacity Plan, which is typically created annually. Information


63

gained from the activities of monitoring, demand management, modelling and application sizing will contribute to the production of a Capacity Plan. We will be looking at the Capacity Plan in more detail later in this lesson. Inputs and Outputs of the Capacity Management Process To fully appreciate the scope of Capacity Management, we will spend the next few minutes looking at the major inputs and outputs to the process, and how these relate to the sub-processes of Business, Service and Resource Capacity Management. Inputs to the BCM sub-process include, the external suppliers of new technology, existing service levels and current SLAs, along with proposed future services and related SLRs. Other important inputs to BCM include the Business Plans, and any strategic plans together with IS and ICT plans. Finally BCM requires the Capacity Plan as an input, if one exists. The important inputs to the Service Capacity Management sub processes are; the service levels and SLAs. Current information from monitoring tools related to systems, networks and services. The service review results, including any issues raised. Incidents and Problems related to capacity, and any SLA breaches. RCM’s key inputs include incidents or problems related to a particular component. Monitoring information related to component utilisation. It is considered important to keep utilisation below certain industry standard levels for a component type. Financial Plans and Budgets are a major input to all 3 sub-processes. Outputs from the sub-processes include a Capacity Database, Baselines and thresholds information, which we looked at earlier in this lesson. Capacity reports will be produced by all three sub-processes, including, Trend, Ad hoc and exception reports. Other outputs include recommendations for SLAs and SLRs, as Capacity Management activity will turn initial SLRs into achievable and cost effective service level quality clauses. Charging and costing recommendations are also produced. SCM and RCM will be suggesting ‘proactive changes’ and ‘Service Improvements’, to

improve levels of capacity, or reduce costs – preferably both! Carrying out ‘Effectiveness Reviews’ and creating ‘Audit Reports’ form a basis for checking that business benefits are being achieved, and the process users are following the ‘rules’. Contents of the Capacity Management Database and the Capacity Plan Although the Capacity Management Database is represented in the ITIL guidance as a single entity, it is unlikely to exist in this form in many organisations. The main reason for this is that much of the data held in a CDB is common to that in a fully integrated Configuration Management Database, therefore, there is an argument for making the CMD part of a ‘Super’ integrated CMDB. Software tools used by Capacity Management tools may have designed in to them, partial CMD functionality. If this information is accessible by other software, than a ‘virtual’ CDB can easily be created. Remember the data contributors to the CDB are the key to its success. Input from the business, includes the ‘business strategy’ and the business plan. Service Management will provide information about SLAs and a full definition of the quality processes in place. Data about manufactures specifications for existing and new technology, will be provided by the technical teams. And finally, the IT Financial Management team will provide fiscal data. Additional financial information will be provided from the CMDB, in its role as a ‘super’ asset register. The Capacity Plan The Capacity Plan is a major output of the Capacity Management process. It has a standard structure and includes; • Assumptions - about levels of growth. • A Management Summary • Business Scenarios • A Summary of Existing Services, problems

or issues with current services and current levels of utilization

• A Resource Summary – which will show what has happened to particular components over the last year and since the last Capacity Plan


64

• The Capacity Plan will also contain suggestions for cost effective service improvements.

• A Cost Model will illustrate some costed

recommendations • Recommendations for the business –

Capacity Management usually provides a number of alternatives for the business, and it should be produced in a timescale which allows the recommendations to be considered as part of the budget planning lifecycle.

One final note. Remember that the Capacity Plan should be updated regulary, in line with any revised business plan, or unexpected changes in the IT infrastructure, because new business is won or lost. Critical Success factors in Capacity Management. Managing the capacity of large distributed networks is becoming increasingly complex, and the financial commitment from business to IT continues to increase. A corporate Capacity Management process, ensures that the entire organisations capacity requirements are catered for. However making the process work successfully depends on several critical factors. These include;

• Accurate business forecasts

• An understanding of current and future technologies

• A cost effective Capacity Management

process

• Working closely with other effective Service Management processes, for example Problem and Change Management

• Effective financial management

• Links to Service Level Management - to

ensure that any business commitments are realistic

• And finally, the ability to plan and

implement the appropriate IT capacity to match business needs. This provides a longer-term proactive view.

There is a further list of potential benefits and problems, associated with the Capacity Management process on page 51 of the ITSMF’s little ITIL book.

Benefits & Problems The benefits of and potential difficulties with Capacity Management are listed on Page 57 of the little ITIL book and in Section 6.4 of the Service Delivery Manual. Summary In this lesson we have been looking at the ITIL process of Capacity Management. We have defined the goal of Capacity Management in ITIL terms, and we have looked in detail at the three Capacity Management sub-processes of Business, Service and Resource Capacity Management. We went on to examine the iterative Capacity Management activities, of Monitoring, Analysis, Tuning and Implementation, and the ad hoc and regular activities of Demand Management, Modelling and Application Sizing. We highlighted the major inputs and outputs of the Capacity Management process, and defined the contents of the Capacity Database and the Capacity Plan. We concluded the lesson by defining the critical factors for successful Capacity Management implementation.

Lesson 5a Service Level Management

65

Lesson 5A Service Level Management Objectives In this lesson we will be examining Service Level Management, which is covered in Chapter 4 of the Service Delivery book in the IT infrastructure library. When you have completed this lesson you will be able to: • Define Service Level Management

according to ITIL best practice.

• Identify the core Service Level Management sub-processes and activities

• Understand the relationships between

SLA’s, OLA’s and UPC’s, and recognise the main sections of a Service Level Agreement.

• List the benefits gained from the Service

Level Management process. What is Service Level Management? Service Level Management, is considered by many to be the heart of ITIL-driven service management. ITIL defines its goal as: “To maintain and gradually improve business aligned IT service quality, through a constant cycle of agreeing, monitoring, reporting and reviewing IT service achievements and through instigating actions to eradicate unacceptable levels of service.” Service Level Management exists to ensure that service targets, such as availability or services, response times and so on, are agreed and documented in a way that the business understands. It is also there to ensure service achievements are monitored and reviewed on a regular basis. Service Level Agreements, which are managed through the Service Level Management Process, provide specific targets against which the performance of the IT provider can be judged. The Service Level Management Process is responsible for ensuring Service Level Agreements and underlying Operational Level Agreements or underpinning contracts are met.

Why do we need SLM? Customers have become more aware of their dependency on IT for successful business operation. Hence they feel an increased need to formalise the contractual basis on which IT services are provided, and this is where Service Level Management can help. Often, Service Level Management is a driver for CSIP or SIP – or Continuous Service Improvement Programmes. Such programmes are aimed at achieving cost-effective improvements to the services offered by the IT service provider, in a rapidly changing technical environment, without necessarily being driven by customer demand. An example of this might be to take advantage of dramatically reduced networking costs to provide better response times than the customer originally specified. Or alternatively, by providing the same response times but at a much lower cost. It’s the responsibility of Service Level Management to be aware of service improvement opportunities, before the customers themselves begin to ask about them. Alternative Approaches to Service Provision There are a number of ways that IT services can be provided – each having their merits and draw-backs. In the simplest scenario there is just the external provider of the IT service and the customer organisation. Services will be provided on the basis of a contract between these two parties. Whilst this has the benefit of simplicity, it is a risky strategy and one that generally leads to poor support for the users and poor value for money for the corporate customer. The next approach is often said to involve an “intelligent customer” role. That is, somebody who negotiates on behalf of the customer with suppliers for service delivery. That customer has a Service Level Agreement with the Service Level Management process, and the service is underpinned by an ‘Underpinning Contract’ with the suppliers. In this situation, the internal IT department adds little or no value. Such arrangements are common where an ‘off-the-shelf’ package solution is being provided by the supplier.


66

Probably the most common arrangement is where the customer has a ‘Service Level Agreement’ with the Service Level Management team. In order for that service to be provided, it is necessary for the Service Level Management team to establish ‘Operational Level Agreements’ with their own internal IT departments, who in turn may have an ‘Underpinning Contract’ with the external suppliers of the various components. Note that for any one service there may be several Operational Level Agreements and several Underpinning Contracts. Finally, although it is much less common, the whole process can be purely internal, and no external contracts are therefore required. So the Customer has a Service Level Agreement with Service Level Management and they have an OLA with the internal IT department – and that’s it. This last arrangement is fairly unusual because most systems will depend on some external supply. It is, on the other hand, quite common for a total service to be provided on the basis of a combination of two or more of these strategies. The SLA Structure One of the early decisions that has to be made is the structure of the SLA procedure – which is a major determinant of how many SLAs will end up being produced. For example, if we had 1000 customers and 50 services we could theoretically produce 50,000 Service Level Agreements. This would clearly be impractical. Fortunately most businesses don’t have 1000 customers who are entirely independent of each other and so there is usually commonality of service requirements amongst groups of customers. As an example, lets suggest 10 major groups of customers, each of which has a common set of service requirements. So by producing SLAs at the Customer Group level the number required could be reduced to 500 – more manageable but still excessive. There are a number of ways in which this problem can be overcome – perhaps the most common one being the mapping of services onto customer groups.

Service Based Approach So a particular service, say Service A, will be provided in a generalised format to Customer Groups 2 and 4. And in a similar way, Service D will be provided to Customer Groups 1 and 2. This allows us to have just one SLA per service - so 50 in our previous example. The drawback of this approach is that it tends to make each SLA more complicated, since they may have to cater for the fact that not all groups covered by a service have exactly to same requirements. If there are geographical differences between the groups as well, then this will also add to the complexity. Despite this problem, this is the most common approach that you’re likely to encounter. Customer Based Approach An alternative approach is to turn the previous model on its head and map Customer Groups onto Services. An SLA is created for each customer group, describing all of the services that each customer group will receive. Here for example Customer Group 3 receives three services, however they would have just one SLA, admittedly quite a complex one, detailing how they would receive Services A, B & C. There are a couple of advantages to this approach. One is that the number of SLAs can be dramatically reduced – in our previous example with 10 customer groups and 50 services, we would end up with only 10 SLAs. Also, it becomes relatively straightforward to introduce variances on standard services between the different customer groups. The disadvantage is that the SLAs can be long and complex and contain a great deal of duplication from one to the other. Multi-Level SLAs A third approach to structuring SLAs, is to have a Multi-level or hierarchical structure. ITIL suggests three levels, namely: Corporate, Customer and Service. Corporate is the highest level and contains any common features that are true of all services across all customer groups. This might cover things like service desk hours, escalation


67

procedures, contact points, roles and responsibilities, and so on. The next level down is the Customer level. Each of the SLAs produced at this level is a description of the services for a particular group of customers. So in our previous example there would be 10 SLAs at this level. At this level SLAs would contain everything that was common for that particular group of customers, but different from the generic services that appeared in the higher Corporate level. Finally, Service Level sits at the bottom of the structure. Here we have a document representing each service used by that customer, and relevant to that particular customer group. It only contains information which differs from the corporate of customer level clauses. Consequently we would have a larger number of SLAs, but each would be relatively short. This in itself makes change management easier. If for example, we decided to change the standard hours of the service desk from 9am until 7pm, to 9am until 9pm, then that change would only appear in the corporate level SLA. It’s important when using the hierarchical structure, that the correct level of authority is assigned to each level. For Example, at Corporate level the document would be authorised at the highest management level liasing with IT. Customer level documents might be authorised by Department Heads, Finance, Planning, HR and so on. Individual Service Level Agreements would be authorised at the next management level down in each of these departments. The general principal is that SLA’s are authorised by paying customers on behalf of users in their part of the organisation.

So what exactly is an SLA? Well in structure SLA’s are rather like contracts, but they are not in themselves legal documents, However they can be included in a legal contract, particularly when establishing SLAs directly with external suppliers. In such cases an SLA would be included in the contract as a schedule. An SLA which is used internally between departments has no legal weight, it’s simply a document that has a contractual structure to it.

The purpose of an SLA is to document an agreement, and as such shouldn’t be an imposition on either the business or IT. Importantly, it must always be written in unambiguous business language, and shouldn’t contain any technical references, which make its intention unclear, and leaves the Business feeling uncomfortable authorising the agreement. So we have established what constitutes an SLA. So what exactly is an OLA or Operational Level Agreement. Well in simple terms OLAs are agreements that define the internal IT arrangements that support SLAs. OLAs are also known as back-to-back agreements. The most common use of an OLA is to define the relationship between Service Desk and internal support groups. OLAs are required to ensure that the SLA targets agreed between customer and IT provider can be delivered in practice. They describe each of the separate components of the overall service delivered to the customer, often with one OLA for each support group and a contract for each supplier. A further additional contract exists to ensure that SLAs are supported, and this is an ‘underpinning contract.’ Underpinning contracts are put in place with external suppliers or vendors. It’s important that all targets contained within both SLA’s, and OLAs that relay on these external suppliers are ‘underpinned’ by the appropriate level of maintenance and support contracts. For example, an internal software development team might have in place an OLA between themselves and Service Level Management. This OLA offers, amongst other things, a guaranteed response time to serious problems, of no more than 2 hours. In order to guarantee these service levels, the software development team might have an underpinning contract in place with their development software vendor, ensuring that problems can be resolved well within this 2 hour time frame. A word of warning here, it’s critical that any commitments made in an OLA are directly supported by the underpinning contract. For example, committing to a 4 hour fix time in an OLA would be useless if our underpinning contract only commits our supplier to a 6 hour fix time!


68

In the last few pages we have been looking at those agreements and contracts, which form an important part of Service level Management. But how do we establish which services are available for inclusion in these agreements and contracts, and which ones our customer or users would like? Well, there are two other important documents in Service Level Management, which can help us with this decision, and these are ‘A Service Catalogue’, and ‘Service Level Requirements or SLR’s.’ A Service Catalogue contains a list of all services used by each customer group. A service Catalogue could be used internally by the service provider, for example, the Service Desk might use it to help them identify those customers entitled to a higher level of service. It can also be used externally as a marketing tool, providing a shop window, showing all the services on offer to the business. Commonly, Organisations now make this available on their intranet as a form of advertising, and generating ‘buy in’ to the services. Service Catalogues exist in a number of forms. They are often created as an internal document, listing existing services when Service Level Management is initially established. At a later stage, it might be published to potential customers, and the wider business as a whole, in a more ‘glossy’ format. In order to establish their exact requirements, the customer develops a Service Level Requirement document. When doing so, the customer should be realistic about potential levels of service, and related costs. Remember this is not a wish list, and sensible advice should be offered from the Service Level Management team. There is no specific format for SLR’s, and each organisation will document it in their own way. It’s important to remember that these documents, along with SLA’s. OLAs and UPCs are all subject to the ITIL Change Management Process.

In the next few pages we will look in some detail at the Service Level Management sub-processes. These sub-processes can be grouped into 4 stages: • Initial Generic • Initial Per Service • On-going Per Service • On-going Generic

So lets look at these 4 stages individually, and see how they fit together to form a complete Service Level Management process. The first stage is Initial Generic. The first activity at this stage, assuming that a Service Level Management team is in place, is to build the initial Service Catalogue. As we mentioned on the previous page, this activity documents all currently available services, and which customers or users are using them. It also records whether they are formally documented in any SLA’s, and whether it’s a service which needs to continue. It isn’t possible to document every possible SLA clause in the catalogue, it’s more important to understand the scope of the catalogue, and the services within it, and also any major problems with services, and any suggested changes to them. The second related sub-process is planning the SLA structure and establishing which SLAs we need to create. This activity involves prioritising the modification of pre existing SLAs, in order to re work them into standard formats. Ask yourself – are there any new services being developed or purchased from a software provider that might provide a better starting point? Assuming, we’ve built the Service Catalogue, agreed the SLA structure, and prioritised the work, we can move onto the second stage of ‘Initial per-service’, and its related sub processes where we address customer specific issues. The first point is to establish Service Level Requirements or SLRs. Find out what users would really like from that service, and what customers are prepared to pay. We should try to establish SLRs by checking requirement documents that exist for new services in development. It’s not uncommon for organisations to arrange training programmes for senior customers, to help them understand what SLRs are, how they should be specified and what is a realistic request in service level terms. The second sub-process uses those SLRs to review the underpinning contracts and OLAs already in place with internal and external service providers. This might involve discussions about upgrading current statements on service level and provision. Once we are happy with both our OLAs and UPCs we can create a draft SLA. The intention is to put actual metrics against various service


69

quality clauses, including fix times for problems, transaction response times and so on. These statements should be supported by ITSM colleagues, such as Service Desk, Capacity, Availability and Problem Management, amongst others. When the draft SLA is available, agreement should be sort from customers and users that it represents an adequate specification of service. This is a process of negotiation, and might involve talking to external and internal suppliers about the cost of improving service quality parameters to the customer. It might require several iterations of the process before agreement can be reached. Usually, the cost of providing certain levels of service becomes apparent to customers fairly quickly, resulting in more realistic negotiations. Once the agreement is formally signed, the SLA must be implemented. This involves informing all parties constrained by the SLA, that it is in place. For example service desk staff, third party suppliers, users and so on. The third stage in the SLM process, includes the on-going per service activities of monitoring, reporting and review and modify. Monitoring involves using the technical tools available to those working in Service Management, to monitor the users important SLA clauses, such as response times for enquiries at the Service Desk. SLM isn’t responsible for the technical implementation of monitors, however SLM takes responsibility to ensure that the necessary monitors are in place. Monitors can provide useful reporting information to IT and the business, and we will be looking at reporting in more detail later in the lesson. Review and modification takes place at regular intervals via service review meetings. These meetings are held at regular intervals, weekly isn’t uncommon, but most likely monthly. The objective of these meetings is to produce short reports on the way the SLA is working, debate any problems or issues, and discuss any changes to the SLA, which might be needed. These reports should be written in simple business language, and state whether we have met the SLA or not, descriptions of where we failed, and explanations of how we are going to prevent the failure occurring again. Remember however, that any suggested changes to SLA’s should be authorised by the Change Management Process. The fourth and final Service Level Management process stage is defined as ongoing generic. It involves sub processes, which relate to SLA’s

and Service Level Management as a whole. These processes include maintaining the Service Catalogue and updating it with new services. Some organisations have automated document links from the Service Catalogue, to individual SLAs, so when an SLA is changed, then that change is reflected in the catalogue. Remember the Service Catalogue falls under the Change Management control process. A further activity is to review the Service Level Management process itself. By establishing Critical Success Factors (CSTs) we can measure performance, we can also set KPI’s or Key Performance Indicators for what is considered a successful service. The final activity is to consider a Service Improvement Programme or SIP. Service Level Management should look at all provided services and their associated quality requirements to see how we can improve service levels without significant increases in cost to the business. This proactive SLM activity involves talking with colleagues in Availability and Capacity Management, and IT Infrastructure Management, to identify ways of improving response times, and improving availability to the business. This activity uses SLA contents as a trigger for service improvement. Reporting on Service Level Achievements We briefly mentioned the activity of reporting earlier in this lesson. Reporting can be subdivided in to either external or internal reporting. Internal reporting involves monitoring service quality in SLAs and related OLAs and UPCs. This detailed monitoring of service quality, is normally set up by the Capacity and Availability Management processes. They will be interested in all activity which affects all service clauses, including breaks in service, time to repair, response time to users and so on. Monitoring OLAs and UPCs will help us understand why SLA breaches are occurring, and also to identify future trends, and possible future SLA breaches. Remember you can’t control things that you can’t monitor. External reporting should be written in a simple and clear way. An exception report is a typical example of external reporting, and it should simply point out when, where and why SLA breaches or near breaches occurred. It should also explain how we intend to prevent things from getting worse.


70

A Service Level Management Agreement Monitoring Chart, or SLAM chart, is a popular mechanism for external reporting, as are RAG, or Red, Amber, Green charts. Both devices offer simple to understand graphical representations of service level parameters, and show where breaches or potential breaches have occurred. Another important monitoring tool are trend graphs. Businesses are very interested in consistency of service as well as quality. For example, trend graphs can display graphically that over a three month rolling period, that the trend is for greater throughput of activity, and for less breaks in service. In displaying these trends to customers, we can convince them that we are achieving Service level targets, and are likely to continue to do so. Typical SLA contents So what does a typical Service Level Agreement consist of? Well, broadly speaking, its contents can be broken down into three sections. An introduction, Agreed Service Levels, and general extra statements. The SLA introduction describes the service, its scope, the intended customer group, the commencement date and its duration. It should be written in clear and concise business terms, and it should be authorised at an appropriate level, by both parties. ‘Agreed Service Levels’ will define a number of measurable clauses, for example, normal hours of service, availability and reliability of the service. Clauses related to ‘throughput’ are also common, detailing the number of transactions the service is expected to support in a defined period. SLA’s frequently contain clauses covering transaction response times. This is often broken

down into several response types, including system responses, a request via mouse click on a PC for example, or an incident response, detailing the maximum time allowable in responding to an incident report. There may be as many as 20 different measurable clauses in an SLA, against which, customers will want us to report. The third section in our SLA deals with the additional statements, such as service charges and how they are structured. Mechanisms for change should also be outlined in this section. Remember however, that changes to SLA clauses should be handled via the Change Management process. Statements on provision of service in case of a disaster are also important. It is the role of IT Service Continuity Management to create cost effective plans to deal with potential disasters, such as fire and flood. It’s common to state in SLA’s at what level, and how quickly service will be available after a disaster. Also included are statements of User and Customer responsibilities. Customer statements might include defining the maximum number of Users at any one time, or a commitment to provide data to the IT supplier in the event of weekend working for example. This can be a lengthy section of the SLA, and it’s important to remember that an SLA is an agreement between the business and IT with responsibilities on BOTH sides. If a request is received to amend an SLA clause it is important that the proposed change undergoes a thorough impact analysis. Changes in one SLA can impact on others, for example changing one SLA to allow more users on a network might have an adverse effect on other customers using the same network. This is where a Service Level Management process benefits the organisation, because each service isn’t treated in isolation, and the whole Service Level Management team work together to ensure quality of ALL services. Reviews

In order to establish customer’s perceptions of its service, Service Level Management should carry out regular service review meetings. Typically these meetings involve customers rather than users and consequently shouldn’t be used as a substitute for user questionnaires and so on. Ahead of these meetings Service Level Management staff should review customer related incident records from the service desk,


71

so that they are able to answer any questions about these incidents. Review meetings can lead to suggestions for change, remember however, they are not the place where changes are authorised. The Service Level Management process can carry out its own internal review. This review should be carried out by the head of the Service Level Management team, or process owner. A key activity in the review process is to review KPIs. Some typical example KPI’s might include Customer Perception ratings, the number of service reviews held, and how many are held at the right time. ITIL suggests that these reviews are held on an annual basis, although many organisations hold them more frequently.

The role of the Service Level Manager The SLM Process must be 'owned' in order to be effective and achieve successfully the benefits of implementation. This isn’t meant to imply that this should be a single post, unless that’s appropriate to your organisation. The Service Level Manager must be at an appropriate level to be able to negotiate with Customers on behalf of the organisation, and to initiate and follow through actions required to improve or maintain agreed service levels. This requires adequate seniority within the organisation and/or clearly visible management support. It’s important that the role acts as a conduit between IT specialists and the customer, interpreting technical language from the IT groups in to understandable business language, and vice versa. • In summary we could define the

characteristics of a Service Level Manager as being:

• A good negotiator, firm but fair • A good communicator, both written

and oral • Business orientated, customer focused and

technically aware • Good under pressure. This can be a

stressful role, as it interfaces between two very strong minded communities.

Benefits & Problems The benefits of and potential difficulties with Service Level Management are listed on Page 45 of the little ITIL book and in Section 4.2.1 of the Service Delivery Manual. Summary In this lesson we have been looking at Service Level Management. We have seen how ITIL defines the goal of Service Level Management, how it’s often driven by a Service Improvement Programme, and why it’s regarded as essential to the ITIL structure as a whole. We examined the relationships between the customer, the IT provider, and external suppliers, and went on to look at the structure of Service Level Agreements and the different way in which we can tailor service provision to customer needs. We went on to look at the structure of Service Level Agreements, and their relationships with Operational Level Agreements and Underpinning Contracts, and discussed how, by producing a Service Catalogue and a Service Level Requirement document, we can better satisfy customer’s requirements. We examined the Service Level Management sub-processes in detail, including, planning an SLA structure, and the monitor, report, review and modify activities. We listed the key characteristics of the Service Level Manager role, and highlighted some of the potential benefits and possible problems associated with implementing a Service Level Management process.

Lesson 5b Financial Management for IT Services

72

Lesson 5b Financial Management for IT Services When you have completed this lesson you will have a broad appreciation of Financial Management in an IT services context. • You will be able to explain the main reasons

why financial management is necessary and you will be able to recognise the three main elements that define the scope of financial management.

• You will be able to identify 6 types of cost

that are commonly encountered and then classify these into one of six accounting cost categories.

• Finally, you will be able to describe seven

different charging policies that can be applied to IT services.

Introduction The goal of Financial management for IT services is to provide cost effective stewardship of the IT assets and the financial resources used in providing IT services. In the vast majority of situations the Financial Management of IT Services will have to operate within boundaries, and adhere to policies that are set by the higher-level financial management authorities within the organisation. For this reason the topic of Financial Management is often regarded as a “cross-over” area for IT staff, because it requires some knowledge of accounting processes, but it is very difficult to make a success of the process without the full commitment of the management from the corporate accounting functions. Having said that, there are some special considerations in the management of IT services’ finances and that is where the ITIL guidance helps. Why Financial Management? The major reason for having financial management for the IT services area is to ensure the provision of value-for-money (VFM) IT services. That is providing maximum business value for the minimum financial outlay.

The mechanisms by which we achieve value for money services are: By facilitating decision making For example, ITSM decision making will include evaluating suggested changes and formulating business cases This work might include calculating return on investment and is done by the IT Services Financial Management team on behalf of the IT service management group. Financial forecasting is also a critical element of the decision making process and can help avoid cots over-runs, or resource shortages. Containing Costs This includes those costs incurred internally and externally, through any supply contracts we may have. It is very important that we know about ALL our costs and are able to manage them. Through this mechanism we take into account total lifecycle costs, sometimes called Total Cost of Ownership (TCO), where we look at the cost of both developing something and then supporting it over its lifetime. If as an IT organisation we don’t understand the costs of support, we won’t be able to make correct project decisions about balancing rapid development against high cost of support. We also can contain costs through demand management. For example, ‘Differential Charging’ may – subject to agreement with the business - be used to persuade people to use resources at different times. Optimising Service Value is all about helping the business balance the quality of the services it receives against the cost of providing that service quality. For example if users insisted on a 99.99% availability of a particular service during early SLA negotiations, IT financial management can help service level management by producing an accurate cost/benefit analysis of providing that level of availability as opposed to a reduced level. It is often important to demonstrate achievement, perhaps though some form of ‘benchmarking’, where costs are compared with similar organisations. Such information can be important when the IT function has to defend itself against claims that it is spending more money than it should be for the level of service being provided.


73

Finally, The Recovery of Costs from the users of the resources is also an important element in the Value for Money equation. Any decision to recover IT costs, either totally or partially, is a high level business decision usually taken at Board level and is not considered mandatory within ITIL. The Scope of Financial Management for IT Services IT financial management for services is normally considered as having three main areas, each of which has a number of sub-processes. Those areas are Budgeting, Accounting and Charging. Budgeting is concerned with: • Predicting the money needed to deliver the

IT service. • Seeking to secure that money from the

business, and; • Monitoring and controlling IT spend against

that budget over the given period. • Accounting on the other hand is the set of

processes that allows the IT service provider to demonstrate where, within IT, the money from that budget has gone.

Together, the budgeting and accounting processes identify all the costs incurred by the IT service management, and enable us to understand where that money is going to, in terms of business support. If we decide to use charging for IT services, then we are attempting to do is recover money from the customers of the services. These charges must be demonstrated to be equitable, between IT and the business. As well as being equitable, changes must also bear some relationship to the costs. How close that relationship is, is usually a matter of debate in organisations. The closer we want the charges to relate to the cost to the IT organisation of service provision, the more complex the charging process will be and the more the overhead in gathering the necessary data. Once they’ve been agreed between the customers and the service level management team, charges must be documented in the SLA for each service that is charged for.

There is a hierarchical dependency between Budgeting, Accounting and Charging which will often develop as an organisation’s financial policies become more mature and increase in both scope and complexity. The starting point might be to introduce budgeting on a one year ahead basis, for example. This would tell us how much IT is costing, but shed no light on how that figure is arrived at. Also, at this stage, we have done nothing to recoup the costs from the business. Knowing in detail where the money and the resources it buys are being used only becomes possible once we introduce the accounting processes. The ability to recoup this money only becomes possible once we move into charging processes. The ITIL guidance is that we should implement, at the very least, budgeting and accounting. Charging, as we have said is optional as far as ITIL is concerned. We should do this, almost certainly, before we attempt charging. It is theoretically possible to charge without understanding who is using what resource, but is unlikely to be acceptable to the business. Accounting without budgetary control makes little sense and in general charging without accounting is not a good option. Once accounting is in place, we have a vehicle for performing cost benefit analysis and return on investment calculations. IT financial management will expect such calculations to be done whenever there are proposals for significant changes, or for the creation new services. The exact models for this will be dependent on standards for accounting within the organisation. Types of Cost It is very useful when creating a budget to understand all of the resources that we have by breaking them down into various cost types. The suggested high level cost types that ITIL recommends are: Hardware, such as computers, networking equipment, data storage devices and so on. Software, which would include operating system software and applications.


74

People, in other words, salaries, taxes, expenses, benefits and other costs of employment. Accommodation, for example, offices, machine rooms, utilities, storage space and so on. External Service, covering items which might be outsourced, such as development work, ISPs, disaster recovery facilities and the like. And finally Transfer, which is used to account for the cross-charges that can take place between different parts of the business. For example, if it was necessary for an Excel expert in Finance to give two days training to someone in Human Resources, because IT lacked the resource to do this, then IT would expect a cross-charge for that persons time to come from the Finance Department. A useful aide-memoir for these cost types is the acronym HAS PET – as you can see. Cost Classification Once the cost elements have been identified and their types understood, they will need to be classified for accounting and financial purposes. As a minimum, ITIL recommends that costs need to be classified as either Capital or Operational costs. Capital expenditure is assumed to increase the total value of the company, while Operational expenditure does not. So capital costs relate to outright purchases of fixed assets and may apply to accommodation, computers, and workstations, for example. Operational costs, on the other hand, can be thought of as day-to-day running costs. Once money is spent on these it is no longer available to the company as an asset. Operational costs include salaries, rental of equipment or buildings, and licenses for software. It is sometimes the case that organisations make capital purchases but want to represent in their accounts the fact that this capital loses value over time. So if £10000, say, is spent on an item of equipment which is expected to last for three years, the assets of the company will be immediately increased by £10,000. But in each of the next three years £3333 will be taken out of the operational expenditure and the assets will be decreased by that amount, until at the end of three years the asset value

stands at zero and the full £10,000 has been recovered from operational costs. This is the process that accountants call depreciation. Conversely, some companies try to roll up some operational costs and classify those as capital, so that they too can be written off over a number of years. A good example of this is software development. A company may decide that is it has spent £100,000 on salaries to develop a software application then, once it is completed that application becomes an asset of the company with a value of £100,000 – and the depreciate that asset over the years of its life. Capitalisation and depreciation policies are very much a concern for the central accounting functions and are in many respects governed by laws to prevent fraud and tax evasion. ITIL suggests that we take advice from the main accounting section on the use of depreciation. Direct and Indirect Costs Costs can further be classified into direct and indirect costs. Direct costs refer to a cost that is directly attributable to a customer or a group of customers. For example, if we are asked to buy a package and a server for the use of Human Resources only, then we could regard these package and server costs as being a direct cost that can be ‘charged back’ to the HR function. Indirect costs cannot be allocated simply to one customer or group. They are costs that are shared amongst groups. There are commonly two types of indirect cost – absorbed costs where the costs can be apportioned across a number of different groups based on their respective usage of the resource concerned. And unabsorbed costs, where it is too difficult to determine who is using how much of the resource and so the cost is allocated as a simple percentage uplift to all costs – in other words an overhead. An example of this might be the cost of the service desk where, rather than attempt to work out which group was behind every call and how much time that took, we take the cost of the service desk in total and distribute it across all of the customer groups, based on their usage of other resources.


75

Finally, there are fixed and variable costs. Fixed costs remain constant regardless of usage, whereas variable costs increase in proportion to the usage made of a resource. An example of a fixed cost might be a leased communication line – the price of which does not change regardless of how much or how little it is used. On the other hand an ISDN line might be an example of a variable cost, because it may be charged for on the basis of the amount of traffic that uses it. The concept of fixed and variable can also be applied to charging. But there are potential pitfalls here. If a service that is charged for on a fixed price basis is based on cost elements that are variable then if the workload increases dramatically the cost of providing the service may end up being greater than the money being recouped. The converse of this is also true – if charges are variable, but costs are fixed, difficulties can arise if the volumes end up being less than predicted. Section 5.3 of the Service Delivery manual – or Page 50 of the little ITIL book contains a useful illustration of how the different cost types and categories that we have discussed can combine to build a cost model for arriving at the total cost figure for a given customer. It is worth spending some time in studying this cost model. Charging Policies Once budgeting and accounting procedures have been well established, then possible charging policies can be considered. The decision on whether to implement charging, and if so on what basis, is not normally a decision for IT financial management – those kind of high level business decisions are almost always made at very senior management levels within the business. There are a number of general charging policies which are usually considered. It is quite valid for the organisation to decide that they are not going to charge for IT services. One of the reasons for deciding on this policy, might be that there are costs involved in charging. There will need to be a mechanism for setting the charges, for sending out bills and

invoices and for resolving disputes. All of that requires gathering and processing of data, and a mix of financial and IT skills in order to be effective. What is important is that costs are understood and that there is budgetary control. People are then aware of how much their business is spending on IT services, but they are not charged. The problem with a no-charging policy is that it does not provide a means of managing customer expectations or manipulating demand. If it is decided to charge for services, then “Cost Recovery” - attempting to get back from the other business units just the cost of providing IT to them is known as the ‘zero-balance’ policy. Alternatively, a “Cost Plus” policy is where IT expects to recoup more than they spend, perhaps as a mechanism for dealing with potential variation in demand over a number of years, or possibly as a basis for funding investment in new infrastructure components, which will be a benefit to the business as a whole. It is also possible to subsidise the service and to go for a ‘Cost Minus’ policy. Here, we are not attempting to recoup all of the costs from the individual business units but do want to achieve some element of cost consciousness. The degree of ‘subsidy’ from the business as a whole will be a high level management decision. A “Going Rate’ approach, in ITIL terms, allows the charges to be based on what other internal departments charge for their services or what other IT departments in similar organisations charge their internal customers. ‘Market rate’ charging uses an external cost comparison, where we see what external providers would charge the business for the sort of services we’re offering and use that figure as our charge. This is often a useful policy when outsourcing is being considered. Some organisations allow their IT departments to sell their services externally to the company, in other words they become a profit centre in their own right. This will tend to mitigate in favour of market-rate pricing, and the business will need to decide how the extra money generated will be used. Finally we might decide on a negotiated “Fixed Price’ policy, where the actual price we charge


76

is a result of an agreement between ourselves and the customer group. Clearly it is very important to get those prices about right, otherwise an over-recovery might discourage users from using our services. Conversely, an under-recovery would mean that IT would have to be rescued at the end of the year by the business as a whole. Whatever charging policy is decided upon, when it comes to actual pricing for services, ITIL best practice advises that charges should be fair, understandable by the business, and subject to control by the business. Benefits and Problems of ITFM The benefits of and potential difficulties with Financial Management for IT Services are listed on Page 51 of the little ITIL book and in Sections 5.1.7 and 5.1.9 of the Service Delivery Manual.

Summary In this lesson we have learned what is meant by Financial Management in an IT services context and why it is a necessary process within ITIL We have explored in some detail the three main elements that define the scope of financial management – namely, Budgeting, Accounting and Charging. We have considered 6 types of cost that are commonly encountered and have seen how these can be classified into one of six accounting cost categories. Finally, we have evaluated seven different charging policies that can be applied to IT services.

Lesson 6a Continuity Management

77

Lesson 6A Continuity Management Objectives The subject of this lesson is IT Service Continuity Management, which is described in chapter 7 of the ITIL Service Delivery book. Once you have completed this lesson you will; • Understand the terms Business Continuity

Management and IT Service Continuity Management, and appreciate the relationship between these two processes

• be able to identify a Vital Business Function

is, in ITIL terms, and be aware IT Service Continuity Managements links to other ITIL processes.

• Have an understanding of the Business

Continuity Lifecycle and have seen ITIL’s risk analysis techniques, and recovery options

Let’s start this lesson by posing a question. What do you think would happen to your business immediately after a ‘disaster’, for example if your offices burnt down, or a local river flooded your offices? Your answer might be; ‘Well, the business is insured, so the insurance company will sort everything out.’ But what happens at start of business tomorrow morning? Firstly and pretty obviously, day-to-day business operations are going to stop, no office, means no staff accommodation, no staff means no ongoing business activity. As a result you can’t service customer accounts, take or despatch orders, collect payment and so on. It’s likely that you will lose existing and new customers, sales and revenue. Ultimately the business could fail. This all seems pretty unlikely, but if we consider some other scenarios, such as a computer virus infecting the servers via email, or a disgruntled ex employee deleting critical data, these potential threats seem more likely, particularly when you consider that statistics suggest that 80% of businesses that suffer a ‘disaster’, go out of business within six months of it happening. So how does a business prepare for such eventualities? Well, one very good way, is to implement Business Continuity Management or BCM.

The BCM activity incorporates two elements, a business focused element (Business Continuity Planning) and a technology element (IT Service Continuity Management Planning). Which of these processes is a sub set of the other depends on the nature of individual business, and the extent to which the business depends on IT. In the ITIL guidance, it is assumed that IT Service Continuity Management is a sub set of Business Continuity Management, so we’ll follow their example in this lesson. Business Continuity Management is concerned with evaluating business processes, and considering the impact, if any, if these processes can’t be performed. Amongst other things, BCM will need to look at cost effective ways of; • Reducing the likelihood of a threat occurring • Minimising the impact on the business if the

threat does occur, and

• Having a ‘disaster recovery’ mechanism in place to deal with any threat that does materialize, which prevents ‘business as usual activities.’

IT Service Continuity Management or ITSCM focuses on the IT services that support the business, and it’s this process, which the ITIL guidance concentrates on. Remember, however, that there is no point in making huge efforts to maintain IT services under disaster conditions, if the business has no Business Continuity Management process in place. So, if staff don’t know where they should go after a disaster, or the alternative office location hasn’t any chairs or desks, then there’s little point in having a ITSCM process in place. Put simply, it’s important that IT service management staff point out the critical need for the ‘business’ to have a Business Continuity Plan. Vital Business Functions -VBF’s Business Continuity Management, and so by association, ITSCM are primarily concerned with Vital Business Functions or VBFs. VBF’s are the critical parts or components of a service, and as such must be ‘reinstated’ as quickly as possible. For example, your bank has a network of ATM’s, which dispense cash and offer a selection of other services, including printing or


78

displaying a balance. The bank might consider that the only VBF performed by the ATM is the dispensing of cash, and not the other services. The role of ITSCM is to identify the IT VBF’s and services, and agree with the business how quickly those VBF’s and services need to be recovered. Sometimes a service, which is reinstated quickly, might have components missing, or the throughput performance of the network might be reduced. It’s important that agreement is sort from business that a ‘reduced service’ is better than no service at all. Not all aspects of the IT services will require contingency plans in the event of a disaster. The business may be prepared to live without certain aspects of the IT infrastructure in the short term. So the focus of ITSCM is directed at the Vital Business Functions, and a relevant amount of the available budget is assigned to each. This amount of money assigned to a VBF is proportional to its business importance. ITSCM has to have strong linkages with other ITIL disciplines, in particular Availability Management and Service Level Management. For example, statements in SLA’s should define what service levels are likely to be available under ‘disaster’ as well as ‘normal’ operations. Other linkages include: Availability Management - delivering risk reduction measures to maintain ‘business as usual.’ Configuration Management – defining the core infrastructure Capacity Management – ensuring that business requirements are fully supported through appropriate IT hardware resources Change Management – ensuring the currency and accuracy of the Continuity Plans through established processes and regular reviews And finally the use of statistics provided by Service Desk and the Incident Management process. The ITSCM Processes It’s not possible to develop an effective ITSCM plan in isolation, it’s important that it supports the requirements of the business. In the next few pages we will be looking at the Business

Continuity lifecycle and its four stages, which are; • Initiation • Requirements and Strategy • Implementation • Operational Management The initiation stage The activities to be considered during the initiation process will depend on the level of contingency facilities already in place with the organisation. Some parts of the business may have already established individual continuity plans based on manual workarounds, and IT may have developed contingency plans for systems perceived to be critical. This can provide a worthwhile starting point for the process, however, effective ITSCM is dependent on supporting vital business functions, and ensuring that the available budget is applied in the most appropriate way. The initiation process covers the whole of the organisation and consists of the following activities: • Policy Setting • Specifying terms of reference and scope • The allocation of resources • Defining the project organisation and control

structure • And finally, agreeing the project and quality

plans The Requirements and Strategy Stage, is, as the title suggests, split into two sections. Requirements, performs Business Impact Analysis and Risk Assessment, and Strategy determines and agrees risk reduction measures and recovery options to support the requirements. ‘How much the organisation stands to lose, as a result of a disaster or other service disruption, is a key driver in determining ITSCM requirements. Risk analysis techniques such as CRAMM the CCTA’s Risk Analysis and Management Methodology, and Business Impact Analysis, or BIA, are performed in the Requirements and Strategy Stage. From this, the business can establish the level of criticality of its services. We will discuss Risk Analysis in more detail later in the lesson. The implementation stage includes the detailed planning required to create the disaster recovery plan. This includes putting in place risk reduction and risk mitigation measures.


79

An example of this might be a smoking ban, and the introduction of an automated sprinkler system. Implementing counter measures can be very costly, so a business case might be required to justify the level of investment. Also during the implementation phase, contracts will be signed with third party standby facility providers, if they are required. The final stage ‘Operational Management’ is responsible for educating all users and IT about the service continuity processes, and specifically what will happen in the event of a disaster. Also remember that people will need to be trained in their ‘disaster recovery plan’ roles. For example, somebody will have to liase with the press in a public relations role, and training might be needed for this. Risk Assessment and Counter Measures There a number of other approaches to assessing risk, perhaps the simplest looks at the probability of something occurring, and the impact if it did. This approach can be represented in a matrix format as shown here. The highest risk status being one with both a high probability and impact. Conversely, a low impact and a low probability would mean a ‘low risk’ category. A business response to this low level risk might be to ‘just deal with it if it happens.’ We mentioned the CCTA’s Risk Analysis and Management Method or CRAMM earlier. This involves the identification of risks, any associated threats, vulnerabilities and impacts, together with the subsequent implementation of cost justifiable counter measures. CRAMM is a very useful method for looking at threats that might affect the availability of service, as it focuses on asset values. Assets could be hardware, software, people, buildings, telecommunications and so on. It then examines the various threats that could exist, and how vulnerable the assets are to these threats. The results can provide a ‘risk rating’ which is very useful to the business. For example, we are generally aware there is a threat of flood. We might then find that our mainframe computer systems are vulnerable to this threat, because they are housed in a site, which is below the water level of a tidal river. The asset would be significant in terms of the computer equipment and the services based on it, and therefore this would give us a ‘high risk’

rating. The business then has to take measures to deal with that risk. In order to do this type of risk analysis, it is useful to have a Service Catalogue available. You’ll remember that the service catalogue featured in the lesson on Service Level Management, and it contains a list of services available to customers or users. We can use information from this document to help asses the risk levels on different IT services. Risk Analysis can also be applied at component level, by looking at Configuration Items, and judging what risks they are subject to. This analysis could identify a component failure risk in a particular service. We could mitigate the risk, by sharing it with another service, that is made up of the same components. Any component within the IT infrastructure that has no backup capability, and can cause impact to the business and users when it fails, is known as a SPOF, or Single Point Of Failure. A particular concern of ITSCM, are ‘Hidden SPOF’s’. An example of a hidden SPOF might be the point where multiple data cables enter an office via an underground duct. A Significant failure would occur if the cable were severed during building works. Contingent Risk Countermeasures ITIL suggests a number of possible options when dealing with a ‘disaster recovery’ situation. The first option is ‘do nothing’. Surprisingly this can be a valid response, if the business has decided that the complete loss of some service in a disaster is acceptable. For example, the business might have insurance in place to cover any potential ‘loss of business’. ‘Manual back up’ can be an effective interim measure until the IT service is resumed. Any procedures should be well documented and understood. This is possibly the most unlikely option suggested by the ITIL guidance. Would it be possible for example, to go back to manual ordering for a short period of time, rather than a computerised system? The third option is a ‘Reciprocal Arrangement’, where organisations agree to back each other up in an emergency. This is rarely used now except for the off-site storage, as it assumes that both organisations have enough spare capacity to fully support the other.


80

There are examples of Reciprocal Arrangements working effectively on an international basis, where there are significant time differences. It relies on a network switch to allow communication to an alternative processing environment ‘out of hours’. The next three options are Gradual, Intermediate or Immediate recovery. Any of which can be provided either internally, by the business itself, or externally by a contracted third party. In all cases the alternative environment that the business moves to, can be either fixed or portable. If it is fixed, then the business goes to a particular location to make use of the services. If it is portable then the services may be brought to the business premises. An example of a portable solution might consist of a ‘mobile computer room’ which is placed adjacent to the business’s existing building. The main difference between these options is the time scale of recovery. Gradual recovery is also known as ‘Cold Standby’. This option involves providing only the essential services such as power, air conditioning, network wiring and so on. The facility doesn’t contain any computer equipment. This option is used when a business can function for a period of 72 hours or more without IT services. Intermediate Recovery is also known as ‘Warm Standby’ and is used where recovery is needed between 24 and 72 hours. A ‘Warm Standby’ facility will have the required computer equipment in place, but it wouldn’t be configured or loaded with current operational software. Immediate Recovery is also known as ‘Hot Standby’. This would normally involve the use of an alternative site with continuous mirroring of the live environment and data. Recovery could be almost instantaneous, but the general definition of immediate recovery is to allow up to 24 hours for full recovery. There are potential risks from having a ‘hot standby’ site in very close proximity to the business’s main site. Although it reduces logistical and network issues, the whole site, including the ‘hot standy’ could be at risk from a disaster. So combinations of these options are sometimes used, and might include the use of a ‘hot standby’ third party site for a two or three days, while the internal intermediate site is configured. This would reduce third party costs, but would involve moving site twice.

Testing IT Service Continuity Plans We have seen so far in this lesson, how ITSCM activities prepare the business against any internal or external threats, and documents recovery procedures in an IT Service Continuity Plan. The question now is, how can we be sure that the plan will work successfully when we actually use it? Well the obvious answer is to test it! The ITIL guidance suggests that plans should be tested; • After the plan has been written • After any major changes, either in the plan,

or to the IT infrastructure itself • At least annually • And after we’ve actually had to use the plan

and have restored ‘normal’ service. This sounds a bit odd, but this is a very good time to perform a test, to make sure that any lessons learned from the disaster, and the way the plan worked in practice, have been put into place. The plans can be tested altogether as a ‘big-bang’ approach, or on a service by service basis. Test types include ‘dry runs’, where we walk through the stages of the plan, and each staff member ‘plays’ their designated role. Next we can plan a test on an agreed future date, which might involve visiting the remote site, and trying to run critical services. The most difficult and expensive test type, is the full, unannounced test. This can be the most effective way of finding flaws in the plan, but it’s the most disruptive to the day-to-day business activities. Key ITSCM Decisions There are a number of critical decisions, which must be made by the ITSCM process. An important one is deciding on how many copies of the Continuity Plan we should have, and where they are going to be kept. For example, it would be very risky to have just one copy of the plan stored at the site it provides contingency for! Many organisations keep plans at the alternative site, or a local bank. The IT Service Business Continuity Manager may well keep copies at home. Remember that all copies should be kept in ‘sync’ to reflect any changes to the infrastructure or the plan. Another key decision is how and when to invoke the contingency plan. How long are we going to wait before we act after a major failure?


81

Invoking a disaster recovery plan is an expensive and complex process, so the temptation is to wait and hope that it’s an Availability Issue rather than a Continuity Management issue. Deciding how long to wait before invoking the plan is difficult. Ultimately it will be driven by the business, and by the criticality of the services that are being disrupted. Who does what during the disaster recovery period, should also be documented. Questions like, which team members go to the alternative site, who books hotel accommodation, should be addressed. Importantly, everyone should understand their role. The process of leaving the recovery site, and returning to normal working at the original site, should also be documented. Make sure that all the necessary work has been done at the home site before the returning, and that clearly defined processes are in place for the move. Errors are easily made at this point, so particular attention should be made to removing all commercial or confidential data from the back up site before departure. A comprehensive list of all third party infrastructure suppliers should also be drawn up. Including those for operational and recovery systems. It’s also important to tell them to visit the back up site if they are called out. Similarly, the details of third party contractors, particularly those who are providing recovery services, should also be to hand. ITIL helps here, by providing a Pro Forma disaster recovery plan that can be used as a basis for creating our own version. This pro forma contains an annex for all the contact details. And finally. Ask the question, does our contingency supplier, have in place, their own contingency plan. There have been several recent examples, where third party recovery service providers, have been literally ‘deluged’ by demand. For example, serious flooding has resulted in them receiving multiple requests for help. Leaving them unable to fulfil their contractual obligations. Some Service Recovery Organisations have in place, a switching facility, where they can transfer demand to other sites in other countries. Ultimately this makes their service more robust.

Benefits and Possible Problems with ITSCM The key benefits of the ITSCM process include; The management of risk, and as a consequence, a reduced impact from failures in the IT infrastructure Potentially lower insurance premiums as a result of implementing good counter-measures And the fulfilment of mandatory and regulatory requirements. You can see a comprehensive list of benefits and potential problems associated with ITSCM on pages 62 and 63 of the ITSMF’s, IT Service Management guide, or the little ITIL book for short. Lesson Summary In this lesson we have been examining the Business Continuity Management and IT Service Continuity Management processes. We looked at the relationship between the two processes, and have seen how ITSCM defines key IT activities as Vital Business Functions. We saw how ITSCM links to other ITIL processes, and went on to look in some detail at the Business Continuity Lifecycle. We have seen some of the Risk Analysis techniques used in the Requirements and Strategy stage of the lifecycle, and listed all of the ITIL Recovery options. And finally, we looked at how to test IT Service Continuity Plans, and at some of the key decisions required by the IT Service Continuity Management process.

Lesson 7a Passing the ITIL Foundation Exam

82

Lesson 7a Passing The ITIL Foundation Examination Objectives So far in this course, we have concentrated on your knowledge of ITIL – what it is, what it contains and how it works in practice. Well, do your remember at school, that there were always some kids who may not have been that bright, and may not have worked that hard – but they always got through the exams OK. Every class had at least one of them! The reason that they did get through was not down to luck - they were just good at taking exams. They had the right mental approach, and they worked out how to stack the odds in their favour. In this session we will be looking at how you too can increase your chances of getting a good result in the Foundation exam – not just by knowing ITIL well – but by approaching the examination in an objective and systematic manner. Introduction & Background The ITIL examinations are administered, on behalf of the OGC, by the ISEB who are based in Swindon in England and EXIN who are a Dutch organisation. There are currently three examination levels and associated qualifications, they are: There are three internationally recognised ITIL certificates; Foundations, Practitioner’s and Manager’s. This course only addresses the first of these, which is, as you would expect, the entry-level qualification. It is a prerequisite for going on the take the more advanced certificates. The objective of the Foundation exam is to confirm a very broad-brush knowledge across the whole of ITIL and therefore does not demand a very detailed knowledge within any specific area. In simple terms this is a test that you are broadly familiar with the contents of the Service Delivery and Service Support manuals.

You have already seen questions which are typical of those asked in the Foundation exam as we have worked through this course. The exam itself consists of 40 such multiple choice questions and one hour is allowed. In order to achieve a pass at least 26 questions must be answered correctly – in other words 65% or more of the questions asked. The examination is “closed book” – in other word you can take no notes or documentation of any kind into the exam room with you. The first tip for doing well at Foundation level is therefore to do your homework. Study this course material, read the manuals and the “little ITIL” books and practice in the exam simulator. Once you are regularly scoring in excess of 30 out of 40 in the exam simulator you can be reasonably sure that will you pass the real Foundation exam. The Foundation Exam Assuming that you have done all your preparation and that you have all the required knowledge, the next step is to focus on the examination itself. As we have seen, you are allowed 60 minutes to answer 40 questions. The vast majority of people finish the exam well within this time – so Tip 1 – Don’t feel under time pressure. Remain cool and stick to your game plan. You have plenty of time. The Foundation questions can be categorised into three types: Those that you find really easy and can answer without too much thought. Although do be careful with the exact semantics of some of the questions and make sure that you have properly read the questions. Those that you probably know the answer to but the wording of the question needs some digesting. There are a lot of “negative” type questions so do be careful over these. Those that, even though you understand the question, you are not 100% sure of the answer. A good strategy is therefore to do the exam paper in three passes. This is something that you have not had the luxury of doing in the exam simulator.

Lesson 7a Passing the ITIL Foundation Exam

83

When you are first presented with the paper, work your way through, answering all the questions where the right answer is immediately obvious to you. Avoid any temptation to deliberate too long over any question. If in doubt move on to the next one. This first pass will ensure that – in the unlikely event that you do run out of time – at least you will have answered all the easy questions. For anybody who has done the right level of background study and preparation this alone will probably be enough to secure a pass. Now go back to the beginning of the paper and start work on all the second category of questions. Once you have worked out what a question means, if you know the answer then answer the question, otherwise move on to the next unanswered question. This time when you get to the end of the paper you should have answered all the questions that you understand and that you are confident of the answers – hopefully by now you will have answered the majority of the 40 questions. Now it’s time for the third and final pass. Go back to the start of the paper again and consider each of the questions that you have not yet answered. At this stage you may need to be careful over the timing – what you don’t want to do is run out of time and leave any questions unanswered – even if you have to guess the answers. Marks don’t get subtracted for wrong answers so if you have 4 or 5 questions that you just don’t know the answer to – make guesses – by random chance you will get at least one of them right. So, count up how many questions still remain unanswered and allocate a maximum time for each one so that you will just get them all answered. If you have 5 unanswered questions and 5 minutes left – don’t spend more than one minute on any one question. Again – never submit a paper unless all the questions have been answered. One final tip – be very careful about changing any of your answers. Experience has shown that about two thirds of changes that candidates make to their answers are in fact changing a correct answer to an incorrect one. Often your first instincts are the right ones.

Now one last exercise. ITIL is nothing if not full of acronyms – and many of the questions in the Foundation exam assume that you are familiar with most of them. So it is worthwhile running through the list of acronyms given in the little ITIL books and the manuals themselves and memorising the less obvious ones. In the meantime try this little test. Use your mouse to drag and drop the right words into position to correctly interpret these acronyms.

84

ACD Automatic Call Distribution

AMDB Availability Management Database

ASP Application Service Pro vider

AST Agreed Service Time

ATM Automatic Teller Machine

B

BCM Business Continuity Management

BCP Business Continuity Plan(ning)

BIA Business Impact Analysis

BITA Business IT Alignment

BQF British Quality Foundation

BRM Business Relationship Management

BSC Balanced Scorecard

BSi British Standard Institution

C

C&CM Configuration and Change Management

CAB Change Advisory Board

CAB/EC Change Advisory Board/Emergency Committee

CASE Computer-Aided Systems Engineer

CCTA Central Computer and Telecommunications Agency

CDB Capacity Database

CFIA Component Failure Impact Analysis

CI Configuration Item

CIA Confidentiality, Integrity and Availability

CMDB Configuration Management Database

CMM Capability Maturity Model

COP C d f P ti

Acronyms

Acronyms

85

CRAMM CCTA Risk Analysis & Management Method

CRM Customer Relationship Management

CSBC Computer Services Business Code

CSF Critical Success Factor

CSS Customer Satisfaction Survey

CTI Computer Telephony Integration

D

DBMS DataBase Management System

DHS Definitive Hardware Store

DHL Definitive Hardware Library

DISC Delivering Information Systems to Customers

DR Disaster Recovery

DRP Disaster Recovery Plan(ning)

DSL Definitive Software Library

DT Down Time

E

EDI Electronic Data Interchange

EFQM European Foundation for Quality Management

EUA End User Availability

EUDT End User Down Time

EUPT End User Processing Time

EXIN Examen Instituut (Dutch Examination Board)

F

FSC Forward Schedule of Change

FTA Fault Tree Analysis

G

GUI G hi l U I t f

Acronyms

86

H

HD Help Desk

I

ICAM Intergrated Computer-Aided Manufacturing

ICT Information and Communication Technology(ies)

ID Identification

IDEF ICAM Definition

IP Internet Protocol

IR Incident Report

IS Information System(s) / Information Service(s)

ISEB Information Systems Examination Board

ISO International Standards Organisation

ISP Internet Service Provider

IT Information Technology

ITAMM IT Availability Metrics Model

ITIL Information Technology Infrastructure Library

ITSC IT Service Continuity

ITSCM IT Service Continuity Management

ITSM IT Service Management

itSMF IT Service Management Forum

IVR Interactive Voice Response

J

JD Job Description

K

KE Known Error

KEL Known Error Log

KER K E R d

Acronyms

87

KPI Key Performance Indicator

KSF Key Success Factor

L

LAN Local Area Network

M

MBNQA Malcolm Baldrige National Quality Award

MIM Major Incident Management

MTBF Mean Time Between Failures

MTBSI Mean Time Between System Incidents

MTTR Mean Time To Repair

O

OGC Office of Governnment Commerce

OLA Operational Level Agreement

OLTP On-line Transaction Processing

P

PAD Package Assembly/Disassembly device

PC Personal Computer

PER Project Evaluation Review

PIR Post-Implementation Review

PM Problem Management

PKI Public Key Infrastructure

PR Problem Record

PRINCE2 Projects IN Controlled Environments

PSA Projected Service Availability

Q

QA Quality Assurance

QMS Q lit M t S t

Acronyms

88

R

RAG Red-Amber-Green

RAID Redundant Array of Inexpensive Disks

RCM Resource Capacity Management

RFC Request For Change

RFS Request For Service (Service Request)

ROCE Return On Capital Employed

ROI Return On Investment

RWO Real World Object

S

SAC/D Service Acceptance Certificate/Document

SCI Software Configuration Item

SCM Software Configuration Management

SIP Service Improvement Programme

SLA Service Level Agreement

SLAM SLA Monitoring

SLM Service Level Management

SLO Service Level Objective

SLR Service Level Requirement

SMO Service Maintenance Objectives

SMT Service Management Team

SOA System Outage Analysis

SPICE Software Process Improvement Capability dEtermination

SPOF Single Point of Failure

SQP Service Quality Plan

SSADM Structured Systems Analysis and Design Method

Acronyms

89

T

TCO Total Cost of Ownership

TOP Technical Observation Post

TOR Terms of Reference

TP Transaction Proccessing

TQM Total Quality Management

U

UPS Uninterruptible Power Supply

V

VBF Vital Business Function

VOIP Voice Over Internet Protocol

VSI Virtual Storage Interrupt

W

WAN Wide Area Network

WFD Work Flow Diagram

WIP Work in Progress

90

Absorbed overhead

Overhead which, by means of absorption rates, is included in costs of specific products or saleable services, in a given period of time. Under- or over-absorbed overhead. The difference between overhead cost incurred and overhead cost absorbed: it may be split into its two constituent parts for control purposes.

Absorption costing A principle whereby fixed as well as variable costs are allotted to cost units and total overheads are absorbed according to activity level. The term may be applied where production costs only, or costs of all functions are so allotted.

Action lists Defined actions, allocated to recovery teams and individuals, within a phase of a plan. These are supported by reference data.

Alert phase The first phase of a business continuity plan in which initial emergency procedures and damage assessments are activated.

Allocated cost A cost that can be directly identified with a business unit.

Apportioned cost A cost that is shared by a number of business units (an indirect cost). This cost must be shared out between these units on an equitable basis.

Asset Component of a business process. Assets can include people, accommodation, computer systems, networks, paper records, fax machines, etc.

Asynchronous /synchronous

In a communications sense, the ability to transmit each character as a self-contained unit of information, without additional timing information. This method of transmitting data is sometimes called start/stop. Synchronous working involves the use of timing information to allow transmission of data, which is normally done in blocks. Synchronous transmission is usually more efficient than the asynchronous method.

Availability Ability of a component or service to perform its required function at a stated instant or over a stated period of time. It is usually expressed as the availability ratio, i.e. the proportion of time that the service is actually available for use by the Customers within the agreed service hours.

B

Balanced Scorecard

An aid to organisational performance management. It helps to focus, not only on the financial targets but also on the internal processes, Customers and learning and growth issues.

Baseline A snapshot or a position which is recorded. Although the position may be updated later, the baseline remains unchanged and available as a reference of the original state and as a comparison against the current position (PRINCE 2).

Bridge Equipment and techniques used to match circuits to each other ensuring minimum transmission impairment.

Glossary of Terms

Glossary of Terms

91

BS7799 The British standard for Information Security Management. This standard provides a comprehensive set of controls comprising best practices in information security.

Build The final stage in producing a usable configuration. The process involves taking one of more input Configuration Items and processing them (building them) to create one or more output Configuration Items e.g. software compile and load.

Business function A business unit within an organisation, e.g. a department, division, branch.

Business process A group of business activities undertaken by an organisation in pursuit of a common goal. Typical business processes include receiving orders, marketing services, selling products, delivering services, distributing products, invoicing for services, accounting for money received. A business process usually depends upon several business functions for support, e.g. IT, personnel, accommodation. A business process rarely operates in isolation, i.e. other business processes will depend on it and it will depend on other processes.

Business recovery objective

The desired time within which business processes should be recovered, and the minimum staff, assets and services required within this time.

Business recovery plan framework

A template business recovery plan (or set of plans) produced to allow the structure and proposed contents to be agreed before the detailed business recovery plan is produced.

Business recovery plans

Documents describing the roles, responsibilities and actions necessary to resume business processes following a business disruption.

Business recovery team

A defined group of personnel with a defined role and subordinate range of actions to facilitate recovery of a business function or process.

Business unit A segment of the business entity by which both revenues are received and expenditure are caused or controlled, such revenues and expenditure being used to evaluate segmental performance.

C

Capital Costs Typically those costs applying to the physical (substantial) assets of the organisation. Traditionally this was the accommodation and machinery necessary to produce the enterprise's product. Capital Costs are the purchase or major enhancement of fixed assets, for example computer equipment (building and plant) and are often also referred to as 'one-off' costs.

Capital investment appraisal

The process of evaluating proposed investment in specific fixed assets and the benefits to be obtained from their acquisition. The techniques used in the evaluation can be summarised as non-discounting methods (i.e. simple pay-back), return on capital employed and discounted cash flow methods (i.e. yield, net present value and discounted pay-back).

Capitalisation The process of identifying major expenditure as Capital, whether there is a substantial asset or not, to reduce the impact on the current financial year of such expenditure. The most common item for this to be

li d t i ft h th d l d i h h d

Glossary of Terms

92

Category Classification of a group of Configuration Items, Change documents or problems.

Change The addition, modification or removal of approved, supported or baselined hardware, network, software, application, environment, system, desktop build or associated documentation.

Change Advisory Board

A group of people who can give expert advice to Change Management on the implementation of Changes. This board is likely to be made up of representatives from all areas within IT and representatives from business units.

Change authority A group that is given the authority to approve Change, e.g. by the project board. Sometimes referred to as the Configuration Board.

Change control The procedure to ensure that all Changes are controlled, including the submission, analysis, decision making, approval, implementation and post-implementation of the Change.

Change document Request for Change, Change control form, Change order, Change record.

Change history Auditable information that records, for example, what was done, when it was done, by who and why.

Change log A log of Requests for Change raised during the project, showing information on each Change, its evaluation, what decisions have been made and its current status, e.g. Raised, Reviewed, Approved, Implemented, Closed.

Change Management

Process of controlling Changes to the infrastructure or any aspect of services, in a controlled manner, enabling approved Changes with minimum disruption.

Change record A record containing details of which CIs are affected by an authorised Change (planned or implemented) and how.

Charging The process of establishing charges in respect of business units, and raising the relevant invoices for recovery from customers.

Classification Process of formally grouping Configuration Items by type e.g. software, hardware, documentation, environment, application. Process of formally identifying Changes by type e.g. project scope change request, validation change request, infrastructure change request. Process of formally identifying incidents, problems and known errors by origin, symptoms and cause.

Closure When the Customer is satisfied that an Incident has been resolved.

Cold stand-by See 'Gradual Recovery'.

Command, control and communications

The processes by which an organisation retains overall co-ordination of its recovery effort during invocation of business recovery plans.

Glossary of Terms

93

Computer Aided Systems Engineering

A software tool for programmers. It provides help in the planning, analysis, design and documentation of computer software.

Configuration Baseline (see also Baseline)

Configuration of a product or system established at a specific point in time, which captures both the structure and details of the product or system, and enables that product or system to be rebuilt at a later date.

Configuration control

Activities comprising the control of Changes to Configuration Items after formally establishing its configuration documents. It includes the evaluation, co-ordination, approval or rejection of Changes. The implementation of Changes includes changes, deviations and waivers that impact on the configuration.

Configuration documentation

Documents that define requirements, system design, build, production, and verification for a configuration item.

Configuration identification

Activities that determine the product structure, the selection of Configuration Items, and the documentation of the Configuration Item's physical and functional characteristics including interfaces and subsequent Changes. It includes the allocation of identification characters or numbers to the Configuration Items and their documents. It also includes the unique numbering of configuration control forms associated with Changes and Problems.

Configuration Item (CI)

Component of an infrastructure - or an item, such as a Request for Change, associated with an infrastructure - which is (or is to be) under the control of Configuration Management. CIs may vary widely in complexity, size and type - from an entire system (including all hardware, software and documentation) to a single module or a minor hardware component.

Configuration Management

The process of identifying and defining the Configuration Items in a system, recording and reporting the status of Configuration Items and Requests for Change, and verifying the completeness and correctness of configuration items.

Configuration Management Database

A database which contains all relevant details of each CI and details of the important relationships between CIs.

Configuration Management plan

A document setting out the organisation and procedures for the Configuration Management of a specific product, project, system, support group or service.

Configuration Management Tool (CM Tool)

A software product providing automatic support for Change, Configuration or version control.

Configuration Structure

A hierarchy of all the CIs that comprise a configuration.

Contingency Planning

Planning to address unwanted occurrences that may happen at a later time. Traditionally, the term has been used to refer to planning for the recovery of IT systems rather than entire business processes.

Glossary of Terms

94

Cost The amount of expenditure (actual or notional) incurred on, or attributable to, a specific activity or business unit.

Cost effectiveness Ensuring that there is a proper balance between the quality of service on the one side and expenditure on the other. Any investment that increases the costs of providing IT services should always result in enhancement to service quality or quantity.

Cost Management All the procedures, tasks and deliverables that are needed to fulfil an organisation's costing and charging requirements.

Cost unit In the context of CSBC the cost unit is a functional cost unit which establishes standard cost per workload element of activity, based on calculated activity ratios converted to cost ratios.

Costing The process of identifying the costs of the business and of breaking them down and relating them to the various activities of the organisation.

Countermeasure A check or restraint on the service designed to enhance security by reducing the risk of an attack (by reducing either the threat or the vulnerability), reducing the Impact of an attack, detecting the occurrence of an attack and/or assisting in the recovery from an attack.

Crisis management

The processes by which an organisation manages the wider impact of a disaster, such as adverse media coverage.

Customer Owner of the service; usually the Customer has responsibility for the cost of the service, either directly through charging or indirectly in terms of demonstrable business need. It is the Customer who will define the service requirements.

D

Data transfer time The length of time taken for a block or sector of data to be read from or written to an I/O device, such as a disk or tape.

Definitive Software Library (DSL)

The library in which the definitive authorised versions of all software CIs are stored and protected. It is a physical library or storage repository where master copies of software versions are placed. This one logical storage area may in reality consist of one or more physical software libraries or filestores. They should be separate from development and test filestore areas. The DSL may also include a physical store to hold master copies of bought-in software, e.g. fire-proof safe. Only authorised software should be accepted into the DSL, strictly controlled by Change and Release Management. The DSL exists not directly because of the needs of the Configuration Management process, but as a common base for the Release Management and Configuration Management processes.

Delta Release A release that includes only those CIs within the Release unit that have actually changed or are new since the last full or Delta Release. For example, if the Release unit is the program, a Delta Release contains only those modules that have changed, or are new, since the last full release of the program or the last Delta Release of the modules - see also 'Full Release'.

Glossary of Terms

95

Dependency The reliance, either direct or indirect, of one process or activity upon another.

Depreciation The loss in value of an asset due to its use and/or the passage of time. The annual depreciation charge in accounts represents the amount of capital assets used up in the accounting period. It is charged in the cost accounts to ensure that the cost of capital equipment is reflected in the unit costs of the services provided using the equipment. There are various methods of calculating depreciation for the period, but the Treasury usually recommends the use of current cost asset valuation as the basis for the depreciation charge.

Differential charging

Charging business customers different rates for the same work, typically to dampen demand or to generate revenue for spare capacity. This can also be used to encourage off-peak or night time running.

Direct cost A cost that is incurred for, and can be traced in full to a product, service, cost centre or department. This is an allocated cost. Direct costs are direct materials, direct wages and direct expenses.

Disaster recovery planning

A series of processes that focus only upon the recovery processes, principally in response to physical disasters, that are contained within BCM.

Discounted cash flow

An evaluation of the future net cash flows generated by a capital project by discounting them to their present-day value. The two methods most commonly used are:

• Yield method, for which the calculation determines the internal rate of return (IRR) in the form of a percentage

• Net present value (NPV) method, in which the discount rate is chosen and the answer is a sum of money.

Discounting The offering to business customers of reduced rates for the use of off-peak resources (see also Surcharging).

Disk cache controller

Memory that is used to store blocks of data that have been read from the disk devices connected to them. If a subsequent I/O requires a record that is still resident in the cache memory, it will be picked up from there, thus saving another physical I/O.

Duplex (full and half)

Full duplex line/channel allows simultaneous transmission in both directions. Half duplex line/channel is capable of transmitting in both directions, but only in one direction at a time.

E

Echoing A reflection of the transmitted signal from the receiving end, a visual method of error detection in which the signal from the originating device is looped back to that device so that it can be displayed.

Elements of cost The constituent parts of costs according to the factors upon which expenditure is incurred viz., materials, labour and expenses.

E d U Th h th i d t d b i

Glossary of Terms

96

Environment A collection of hardware, software, network communications and procedures that work together to provide a discrete type of computer service. There may be one or more environments on a physical platform e.g. test, production. An environment has unique features and characteristics that dictate how they are administered in similar, yet diverse manners.

Expert User In some organisations it is common to use 'Super' Users (commonly known as Super or Expert Users) to deal with first-line support problems and queries. This is typically in specific application areas, or geographical locations, where there is not the requirement for full-time support staff. This valuable resource however needs to be carefully co-ordinated and utilised.

External Target One of the measures, against which a delivered IT service is compared, expressed in terms of the customer's business.

F

Financial year An accounting period covering 12 consecutive months. In the public sector this financial year generally coincides with the fiscal year which runs from 1 April to 31 March.

Forward Schedule of Changes

Contains details of all the Changes approved for implementation and their proposed implementation dates. It should be agreed with the Customers and the business, Service Level Management, the Service Desk and Availability Management. Once agreed, the Service Desk should communicate to the User community at large any planned additional downtime arising from implementing the Changes, using the most effective methods available.

Full cost The total cost of all the resources used in supplying a service i.e. the sum of the direct costs of producing the output, a proportional share of overhead costs and any selling and distribution expenses. Both cash costs and notional (non-cash) costs should be included, including the cost of capital. See also 'Total Cost of Ownership'

Full Release All components of the Release unit are built, tested, distributed and implemented together - see also 'Delta Release'.

G

Gateway Equipment which is used to interface networks so that a terminal on one network can communicate with services or a terminal on another.

Gradual Recovery Previously called 'Cold stand-by', this is applicable to organisations that do not need immediate restoration of business processes and can function for a period of up to 72 hours, or longer, without a re-establishment of full IT facilities. This may include the provision of empty accommodation fully equipped with power, environmental controls and local network cabling infrastructure, telecommunications connections, and available in a disaster situation for an organisation to install its own computer equipment.

Glossary of Terms

97

Hard charging Descriptive of a situation where, within an organisation, actual funds are transferred from the customer to the IT organisation in payment for the delivery of IT services.

Hard fault The situation in a virtual memory system when the required page of code or data, which a program was using, has been redeployed by the operating system for some other purpose. This means that another piece of memory must be found to accommodate the code or data, and will involve physical reading/writing of pages to the page file.

Host A host computer comprises the central hardware and software resources of a computer complex, e.g. CPU, memory, channels, disk and magnetic tape I/O subsystems plus operating and applications software. The term is used to denote all non-network items.

Hot stand-by See 'Immediate Recovery'.

I

ICT The convergence of Information Technology, Telecommunications and Data Networking Technologies into a single technology.

Immediate Recovery

Previously called 'Hot stand-by', provides for the immediate restoration of services following any irrecoverable incident. It is important to distinguish between the previous definition of 'hot stand-by' and 'immediate recovery'. Hot stand-by typically referred to availability of services within a short timescale such as 2 or 4 hours whereas immediate recovery implies the instant availability of services.

Impact Measure of the business criticality of an Incident, Problem or Request for Change. Often equal to the extent of a distortion of agreed or expected Service Levels.

Impact analysis The identification of critical business processes, and the potential damage or loss that may be caused to the organisation resulting from a disruption to those processes. Business impact analysis identifies: · the form the loss or damage will take · how that degree of damage or loss is likely to escalate with time following an incident · the minimum staffing, facilities and services needed to enable business processes to continue to operate at a minimum acceptable level · the time within which they should be recovered. The time within which full recovery of the business processes is to be achieved is also identified.

Impact scenario Description of the type of impact on the business that could follow a business disruption. Usually related to a business process and will always refer to a period of time, e.g. customer services will be unable to operate for two days.

Incident Any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service.

Indirect cost A cost incurred in the course of making a product providing a service or running a cost centre or department, but which cannot be traced directly and in full to the product, service or department, because it has been incurred for a number of cost centres or cost units. These costs are apportioned to cost centres/cost units. Indirect costs are also

f d t h d

Glossary of Terms

98

Informed Customer

An individual, team or group with functional responsibility within an organisation for ensuring that spend on IS/IT is directed to best effect, i.e. that the business is receiving value for money and continues to achieve the most beneficial outcome. In order to fulfil its role the 'Informed' Customer function must gain clarity of vision in relation to the business plans and assure that suitable strategies are devised and maintained for achieving business goals. The 'Informed' Customer function ensures that the needs of the business are effectively translated into a business requirements specification, that IT investment is both efficiently and economically directed, and that progress towards effective business solutions is monitored. The 'Informed' Customer should play an active role in the procurement process, e.g. in relation to business case development, and also in ensuring that the services and solutions obtained are used effectively within the organisation to achieve maximum business benefits. The term is often used in relation to the outsourcing of IT/IS. Sometimes also called 'Intelligent Customer'.

Interface Physical or functional interaction at the boundary between Configuration Items.

Intermediate Recovery

Previously called 'Warm stand-by', typically involves the re-establishment of the critical systems and services within a 24 to 72 hour period, and is used by organisations that need to recover IT facilities within a predetermined time to prevent impacts to the business process.

Internal target One of the measures against which supporting processes for the IT service are compared. Usually expressed in technical terms relating directly to the underpinning service being measured.

Invocation (of business recovery plans)

Putting business recovery plans into operation after a business disruption.

Invocation (of stand-by arrangements)

Putting stand-by arrangements into operation as part of business recovery activities.

Invocation and recovery phase

The second phase of a business recovery plan.

ISO9001 The internationally accepted set of standards concerning quality management systems.

ITIL The OGC IT Infrastructure Library - a set of guides on the management and provision of operational IT services.

K

Known Error An Incident or Problem for which the root cause is known and for which a temporary Work-around or a permanent alternative has been identified. If a business case exists, an RFC will be raised, but, in any event, it remains a Known Error unless it is permanently fixed by a Change.

Glossary of Terms

99

L

Latency The elapsed time from the moment when a seek was completed on a disk device to the point when the required data is positioned under the read/write heads. It is normally defined by manufacturers as being half the disk rotation time.

Lifecycle A series of states, connected by allowable transitions. The lifecycle represents an approval process for Configuration Items, Problem Reports and Change documents.

Logical I/O A read or write request by a program. That request may, or may not, necessitate a physical I/O. For example, on a read request the required record may already be in a memory buffer and therefore a physical I/O is not necessary.

M

Marginal Cost The cost of providing the service now, based upon the investment already made.

Maturity level/Milestone

The degree to which BCM activities and processes have become standard business practice within an organisation.

Metric Measurable element of a service process or function.

O

Operational Costs Those costs resulting from the day-to-day running of the IT Services section, e.g. staff costs, hardware maintenance and electricity, and relating to repeating payments whose effects can be measured within a short timeframe, usually less than the 12-month financial year.

Operational Level Agreement

An internal agreement covering the delivery of services which support the IT organisation in their delivery of services.

Opportunity cost (or true cost)

The value of a benefit sacrificed in favour of an alternative course of action. That is the cost of using resources in a particular operation expressed in terms of foregoing the benefit that could be derived from the best alternative use of those resources.

Outsourcing The process by which functions performed by the organisation are contracted out for operation, on the organisation's behalf, by third parties.

Overheads The total of indirect materials, wages and expenses.

P

Package assembly /disassembly device

A device that permits terminals, which do not have an interface suitable for direct connection to a packet switched network, to access such a network. A PAD converts data to/from packets and handles call set-up and addressing.

Page fault A program interruption that occurs when a page that is marked 'not in l ' i f d t b ti

Glossary of Terms

100

Paging The I/O necessary to read and write to and from the paging disks: real (not virtual) memory is needed to process data. With insufficient real memory, the operating system writes old pages to disk, and reads new pages from disk, so that the required data and instructions are in real memory.

PD0005 Alternative title for the BSI publication 'A Code of Practice for IT Service Management'.

Percentage utilisation

The amount of time that a hardware device is busy over a given period of time. For example, if the CPU is busy for 1800 seconds in a one hour period, its utilisation is said to be 50%.

Phantom line error

A communications error reported by a computer system that is not detected by network monitoring equipment. It is often caused by changes to the circuits and network equipment (e.g. re-routing circuits at the physical level on a backbone network) while data communications is in progress.

Physical I/O A read or write request from a program has necessitated a physical read or write operation on an I/O device.

Prime cost The total cost of direct materials, direct labour and direct expenses. The term prime cost is commonly restricted to direct production costs only and so does not customarily include direct costs of marketing or research and development.

PRINCE2 The standard UK government method for project management.

Priority Sequence in which an Incident or Problem needs to be resolved, based on impact and urgency.

Problem Unknown underlying cause of one or more Incidents.

Process A connected series of actions, activities, Changes etc, performed by agents with the intent of satisfying a purpose or achieving a goal.

Process Control The process of planning and regulating, with the objective of performing the process in an effective and efficient way.

Programme A collection of activities and projects that collectively implement a new corporate requirement or function.

Q

Queuing time Queuing time is incurred when the device, which a program wishes to use, is already busy. The program therefore has to wait in a queue to obtain service from that device.

R

RAID Redundant Array of Inexpensive Disks - a mechanism for providing data resilience for computer systems using mirrored arrays of magnetic disks. Diff t l l f RAID b li d t id f t ili

Glossary of Terms

101

Reference data Information that supports the plans and action lists, such as names and addresses or inventories, which is indexed within the plan.

Release A collection of new and/or changed CIs which are tested and introduced into the live environment together.

Request for Change (RFC)

Form, or screen, used to record details of a request for a change to any CI within an infrastructure or to procedures and items associated with the infrastructure.

Resolution Action which will resolve an Incident. This may be a Work-around.

Resource cost The amount of machine resource that a given task consumes. This resource is usually expressed in seconds for the CPU or the number of I/Os for a disk or tape device.

Resource profile The total resource costs that are consumed by an individual online transaction, batch job or program. It is usually expressed in terms of CPU seconds, number of I/Os and memory usage.

Resource unit costs

Resource units may be calculated on a standard cost basis to identify the expected (standard) cost for using a particular resource. Because computer resources come in many shapes and forms, units have to be established by logical groupings. Examples are: a) CPU time or instructions b) disk I/Os c) print lines d) communication transactions.

Resources The IT Services section needs to provide the customers with the required services. The resources are typically computer and related equipment, software, facilities or organisational (people).

Return to normal phase

The phase within a business recovery plan which re-establishes normal operations.

Risk A measure of the exposure to which an organisation may be subjected. This is a combination of the likelihood of a business disruption occurring and the possible loss that may result from such business disruption.

Risk Analysis The identification and assessment of the level (measure) of the risks calculated from the assessed values of assets and the assessed levels of threats to, and vulnerabilities of, those assets.

Risk Management The identification, selection and adoption of countermeasures justified by the identified risks to assets in terms of their potential impact upon services if failure occurs, and the reduction of those risks to an acceptable level.

Risk reduction measure

Measures taken to reduce the likelihood or consequences of a business disruption occurring (as opposed to planning to recover after a disruption).

Role A set of responsibilities, activities and authorisations.

Roll in roll out (RIRO)

Used on some systems to describe swapping.

Glossary of Terms

102

Rotational Position Sensing

A facility which is employed on most mainframes and some minicomputers. When a seek has been initiated the system can free the path from a disk drive to a controller for use by another disk drive, while it is waiting for the required data to come under the read/write heads (latency). This facility usually improves the overall performance of the I/O subsystem.

S

Seek Time Occurs when the disk read/write heads are not positioned on the required track. It describes the elapsed time taken to move heads to the right track.

Self-insurance A decision to bear the losses that could result from a disruption to the business as opposed to taking insurance cover on the risk.

Service One or more IT systems which enable a business process.

Service achievement

The actual service levels delivered by the IT organisation to a customer within a defined life-span.

Service Catalogue Written statement of IT services, default levels and options.

Service Desk The single point of contact within the IT organisation for users of IT services.

Service Improvement Programme

A formal project undertaken within an organisation to identify and introduce measurable improvements within a specified work area or work process.

Service Level Agreement

Written agreement between a service provider and the Customer(s), that documents agreed Service Levels for a Service.

Service Level Management

The process of defining, agreeing, documenting and managing the levels of customer IT service, that are required and cost justified.

Service Management

Management of Services to meet the Customer's requirements.

Service provider Third-party organisation supplying services or products to customers.

Service quality plan

The written plan and specification of internal targets designed to guarantee the agreed service levels.

Service Request Every Incident not being a failure in the IT Infrastructure.

Services The deliverables of the IT Services organisation as perceived by the Customers; the services do not consist merely of making computer resources available for customers to use.

Simulation modelling

Using a program to simulate computer processing by describing in detail the path of a job or transaction. It can give extremely accurate results. Unfortunately, it demands a great deal of time and effort from the modeller. It is most beneficial in extremely large or time-critical systems

h th i f i ll

Glossary of Terms

103

Soft fault The situation in a virtual memory system when the operating system has detected that a page of code or data was due to be reused, i.e. it is on a list of 'free' pages, but it is still actually in memory. It is now rescued and put back into service.

Software Configuration Item (SCI)

As 'Configuration Item', excluding hardware and services.

Software Environment

Software used to support the application such as operating system, database management system, development tools, compilers, and application software.

Software Library A controlled collection of SCIs designated to keep those with like status and type together and distinctly segregated, to aid in development, operation and maintenance.

Software work unit

Software work is a generic term devised to represent a common base on which all calculations for workload usage and IT resource capacity are then based. A unit of software work for I/O type equipment equals the number of bytes transferred; and for central processors it is based on the product of power and CPU-time.

Solid state devices Memory devices that are made to appear as if they are disk devices. The advantages of such devices are that the service times are much faster than real disks since there is no seek time or latency. The main disadvantage is that they are much more expensive.

Specsheet Specifies in detail what the customer wants (external) and what consequences this has for the service provider (internal) such as required resources and skills.

Standard cost A pre-determined calculation of how much costs should be under specified working conditions. It is built up from an assessment of the value of cost elements and correlates technical specifications and the quantification of materials, labour and other costs to the prices and/or wages expected to apply during the period in which the standard cost is intended to be used. Its main purposes are to provide bases for control through variance accounting, for the valuation of work in progress and for fixing selling prices.

Standard costing A technique which uses standards for costs and revenues for the purposes of control through variance analysis.

Stand-by arrangements

Arrangements to have available assets which have been identified as replacements should primary assets be unavailable following a business disruption. Typically, these include accommodation, IT systems and networks, telecommunications and sometimes people.

Storage occupancy

A defined measurement unit that is used for storage type equipment to measure usage. The unit value equals the number of bytes stored.

Super User In some organisations it is common to use 'expert' Users (commonly known as Super or Expert Users) to deal with first-line support problems and queries. This is typically in specific application areas, or geographical locations, where there is not the requirement for full-time support staff. This valuable resource however needs to be carefully co-

di t d d tili d

Glossary of Terms

104

Surcharging Surcharging is charging business users a premium rate for using resources at peak times.

Swapping The reaction of the operating system to insufficient real memory: swapping occurs when too many tasks are perceived to be competing for limited resources. It is the physical movement of an entire task (e.g. all real memory pages of an address space may be moved at one time from main storage to auxiliary storage).

System An integrated composite that consists of one or more of the processes, hardware, software, facilities and people, that provides a capability to satisfy a stated need or objective.

T

Terminal emulation

Software running on an intelligent device, typically a PC or workstation, which allows that device to function as an interactive terminal connected to a host system. Examples of such emulation software includes IBM 3270 BSC or SNA, ICL C03, or Digital VT100.

Terminal I/O A read from, or a write to, an online device such as a VDU or remote printer.

Third-party supplier

An enterprise or group, external to the Customer's enterprise, which provides services and/or products to that Customer's enterprise.

Thrashing A condition in a virtual storage system where an excessive proportion of CPU time is spent moving data between main and auxiliary storage.

Total Cost Of Ownership

Calculated including depreciation, maintenance, staff costs, accommodation, and planned renewal.

Tree structures In data structures, a series of connected nodes without cycles. One node is termed the root and is the starting point of all paths, other nodes termed leaves terminate the paths.

U

Underpinning contract

A contract with an external supplier covering delivery of services that support the IT organisation in their delivery of services.

Unit costs Costs distributed over individual component usage. For example, it can be assumed that, if a box of paper with 1000 sheets costs £10, then each sheet costs 1p. Similarly if a CPU costs £lm a year and it is used to process 1,000 jobs that year, each job costs on average £1,000.

Urgency Measure of the business criticality of an Incident or Problem based on the impact and on the business needs of the Customer.

User The person who uses the service on a day-to-day basis.

Utility cost centre (UCC)

A cost centre for the provision of support services to other cost centres.

Glossary of Terms

105

V

Variance analysis A variance is the difference between planned, budgeted or standard cost and actual cost (or revenues). Variance analysis is an analysis of the factors that have caused the difference between the pre-determined standards and the actual results. Variances can be developed specifically related to the operations carried out in addition to those mentioned above.

Version An identified instance of a Configuration Item within a product breakdown structure or configuration structure for the purpose of tracking and auditing change history. Also used for software Configuration Items to define a specific identification released in development for drafting, review or modification, test or production.

Version Identifier A version number; version date; or version date and time stamp.

Virtual memory system

A system that enhances the size of hard memory by adding an auxiliary storage layer residing on the hard disk.

Virtual storage interrupt (VSI)

An ICL VME term for a page fault.

Vulnerability A weakness of the system and its assets, which could be exploited by threats.

W

Warm stand-by See 'Intermediate Recovery'.

Waterline The lowest level of detail relevant to the customer.

Work-around Method of avoiding an Incident or Problem, either by a temporary fix or by a technique that means the Customer is not reliant on a particular aspect of the service that is known to have a problem.

Workloads In the context of Capacity Management Modelling, a set of forecasts which detail the estimated resource usage over an agreed planning horizon. Workloads generally represent discrete business applications and can be further sub-divided into types of work (interactive, timesharing, batch).

WORM (Device) Optical read only disks, standing for Write Once Read Many.