lessons learned from a lims outage · lessons learned from a lims outage chris wilson microbiology...

Lessons Learned from a LIMS

Outage

Chris Wilson Microbiology

IT and Automation Coordinator

Leeds Teaching Hospitals NHS Trust

• Leeds Teaching Hospitals has an annual budget of

£789m

• Provides local and specialist services for an immediate

population of 780,000 and regional specialist care for up

to 5.4 million people

• The trust employs more than 14,000 people across 6

sites with around 2,500 beds

Background

• Leeds Pathology acquired Telepath in the 1980s

• Started small but now supports all of

– Microbiology

– Blood Sciences

– Blood Bank

– Specialist Laboratory Medicine

– Immunology

– Anticoagulation clinic work

– Transplant Immunology.

• Across all sites in-patient, out-patient, 106 GP practices, tertiary service for specialist tests. ~17,000 samples per day.

Background

• What happened?

• ..It was approximately 12:15 on Friday 16th

September 2016. I was proceeding along

the high street (Automation Lab) when a

cry attracted my attention

• “I can’t log into Telepath”

• “Neither can I”

Sequence of Events

• It became clear this was a crash, no big deal,

this happens from time to time. Probably filled

the database memory with some corrupted file.

• Ascertain a couple of things and log a job with

DXC

• Something didn’t quite add up

• Sent a colleague into the server room to check

the server was still alive

• Server is still alive

• There are three orange lights below the server

• Not sure what this means

• Go over to the server room

• Looks like the disk array

• Phone call from DXC, “you have a disk failure”

• 3 disk failures!

• RAID array

• Terrible coincidence??

• From remote health check data DXC can see that

– Disk 9 failed at 16:57 2nd August 2016

– Disk 6 failed at 07:43 11th August 2016

– Disk 7 failed at 10:40 16th September 2016

• Assumptions were made

• We can deal with that later

• Let’s fix the issue

• Now heading into the weekend

• An engineer is on the way to assess the situation

• Put out comms

• Now heading into the weekend

• An engineer is on the way to assess the situation

• Put out comms

• Implement continuity plan

• Disks were not easy to come by since the system is so

old

• Disks eventually arrived in the evening after being

sourced from the back of a warehouse somewhere in the

South of England

• Engineer arrived late Friday night

• Replaced one disk and waited to see if the array would

rebuild itself

• This did not happen

• On-Call engineer from DXC, On-Call support from Trust

IT and since there is no provision for On Call within

Pathology IT,

the idiot who found the problem

• Replaced the other two disks

• Needed to now restore the database from a backup

which should have been run in the early hours of the

morning

• The back up which was run on the disk array was

obviously destroyed

• But, we had a backup of this backup on a remote server

• On-Call IT staff did not know where this backup was kept

• Contacted the only person who knew the location who

was on annual leave.

• Agreed to reconvene Saturday morning and restore the

database from the backup

• The back up restore ran for 18 hours and did not work

• It became apparent that the backup was not a complete

file.

• In fact, we couldn't find any backup that was a complete

file.

• Later discovered that there was a reason for this.

• After stitching back together pieces of backed up files,

we ended up with

– All of the archive

– All of Blood Bank

– Nearly all of Chemistry

– Nothing else

• Stopped using Haematology as a separate database

some time ago

• When the disciplines were merged the data was backed

up

• No Haematology lost

• Microbiology was broken!!

• Horribly broken

• Some work to do to get Blood Bank up and running and

all of the interfaces working

• Some more work to do to get Blood Sciences back up

and running

• Testing and assurance took some time

• Blood Bank was back up in just over two weeks

• Blood Sciences was up shortly after that

• We had a number of issues with the database crashing

repeatedly whilst we performed this work

• The only Microbiology back up we could find was from

May 2010 when we upgraded the operating system

• Using this fragment it took three months to rebuild

Microbiology

• Because of the specialist nature of the work and the lack

of staff within Pathology IT work did not start on the

Microbiology rebuild until Blood Bank was up and

running and most of Blood Sciences was functional

• Had a fall back process for IT downtime

• Some of the process was developed on the fly

• Longest previous outage was probably about four days

• We had no process designed to last for more than a

week

• Not three weeks

• Certainly not three months

Summary – what went wrong?

• An assumption was made that DXC were monitoring the

system

• Disk failures could have been detected early enough to

avoid a crash

• The hardware was not updated.

• Nearly 7 years old at the time of the crash

• The backup was not valid

• Only one person knew where the backup was kept

• We did not have a valid back up and we didn’t know this

until crunch time


• “The Telepath system …. Is predominantly managed by

a dedicated team of Pathology based IT staff.”

• “Traditionally those colleagues now managing the

Telepath system began their professional careers as

Medical Laboratory Scientific Officers (MLSO’s) …

developing their interest in computer science alongside

the automation of laboratory services from the mid

1970’s ….”

• The Pathology IT team lost several key staff over the

course of several years leading to a loss of knowledge

and a loss of manpower


• The Trust may have underestimated the importance of

pathology until it didn’t have pathology

How did we deal with it?

• Reverted to a system of manually reporting results

• Out sourced some of our work, mostly blood sciences, to

other labs within our STP region and Sheffield

• This wasn’t without difficulty since different labs have

different capacities and different methodologies


• Had a lot of planning meetings and risk assessments,

focus on patient safety

• Kept a constant stream of communication going out to

users

• Trust IT took control of everything hardware related


• Sent the disks away for data recovery

• (no useful data recoverable)

• Attempted to write some code to help us utilise the

functionality of our EPR and automate some of the

process

• (So time consuming, the rebuild was quicker)

• Commissioned an independent enquiry

What has happened as a result?

• Detailed SLA between Pathology and Trust IT

• Remote monitoring of the server and disk array

• Physical checks on the server and disk array

• Back up schedule altered

• Roles and responsibilities defined

• Testing of backup (finally) carried out

What has happened as a result?

• Comprehensive disaster recovery and business

continuity policy drawn up

What (still) hasn’t happened?

• The hardware has still not been upgraded

• It is now approaching 10 years old

• We still don’t have enough staff in IT

• We still don’t have a new LIMS

What were the positives?

• We did have some kind of business continuity

• And now we have a tried and tested one

• We had three years worth of detailed change logs for

Telepath

• People now know a lot more about Pathology’s role in

patient care in our Trust

• We served as a warning for other Trusts and other

departments within our Trust

• Other Trust’s IT business cases mysteriously accepted

• DXC were extremely helpful and supportive

• Other Trusts were incredibly helpful and supportive

What were the positives?

• Our staff were amazing

• Recommendations from independent

investigation• R1 The Service Level Agreement that describes the managed service

currently provided by CSC/SCC should be reviewed and strengthened

where required. The Trust must be assured that the level of service covered

by this and similar contracts is adequate and the they do not expose the

Trust to any unnecessary risks.

• R2 All contracts with external service providers must contain a statement of

compliance with regard to adherence with Standard Operating Instructions

covering the major aspects of work for which the third party is

responsible/liable.

• R3 The Trust should work closely with CSC to seek confirmation that the

successive failure of three disks in the RAID 10 array was indeed

coinidental and not something more sinister that could reoccur in the future.

• R4 Where possible system back-ups should be part of a centrally managed process. Regular checks (daily) must be in place to ensure that back0ups are complete and are capable of running a full restore. This process should be extended to all critical systems across the Trust or at the very least major departmental applications.

• R5 A clean image of the current version of the Telepath operating system and application across all laboratory disciplines must be retained and stored off site. This process should be extended to all systems across the Trust or at the very least major departmental applications.

• R6 A system must be in place with immediate effect to accurately the names of those entering any of the Trust’s data centres as well as all activities that took place during access periods.

• R7 Electronic alerting and warning systems must be in place to ensure that all those who have responsibility for data centres, system back-ups and electronic storage receive automated assurance that back-ups are intact and complete.

• R8 Large Departmental systems, their associated networks and data centres should be managed within the structure of the corporate Information Management & Technology Service. Current members of IT Pathology team who spend more than 50% of their time managing information systems should equally move to the corporate service.

• R9 The Trust corporate IT team should work with the Pathology IT team and Primary Care system suppliers (EMIS and SYSTM ONE) to understand how technology can provide a more cohesive solution to recognise the relationship between Practices and the hospital providers of Pathology Services. The NPeX (X-Lab) may offer some Lab to Lab connectivity option that the Trust would find useful.

• R10 In relation to R9 the Trust should also re-examine any Memorandum of Understandings or reciprocal arrangements that it has in place with other NHS providers to take account of critical system and service failures in the future. Included in this review should be any system service that relies heavily on technology.

Thank you

lessons learned from a lims outage · lessons learned from a lims outage chris wilson microbiology...

Documents