lessons learned from a lims outage · lessons learned from a lims outage chris wilson microbiology...
TRANSCRIPT
Lessons Learned from a LIMS
Outage
Chris Wilson Microbiology
IT and Automation Coordinator
Leeds Teaching Hospitals NHS Trust
• Leeds Teaching Hospitals has an annual budget of
£789m
• Provides local and specialist services for an immediate
population of 780,000 and regional specialist care for up
to 5.4 million people
• The trust employs more than 14,000 people across 6
sites with around 2,500 beds
Background
• Leeds Pathology acquired Telepath in the 1980s
• Started small but now supports all of
– Microbiology
– Blood Sciences
– Blood Bank
– Specialist Laboratory Medicine
– Immunology
– Anticoagulation clinic work
– Transplant Immunology.
• Across all sites in-patient, out-patient, 106 GP practices, tertiary service for specialist tests. ~17,000 samples per day.
Background
• What happened?
• ..It was approximately 12:15 on Friday 16th
September 2016. I was proceeding along
the high street (Automation Lab) when a
cry attracted my attention
• “I can’t log into Telepath”
• “Neither can I”
Sequence of Events
• It became clear this was a crash, no big deal,
this happens from time to time. Probably filled
the database memory with some corrupted file.
• Ascertain a couple of things and log a job with
DXC
• Something didn’t quite add up
• Sent a colleague into the server room to check
the server was still alive
• Server is still alive
• There are three orange lights below the server
• Not sure what this means
• Go over to the server room
• Looks like the disk array
• Phone call from DXC, “you have a disk failure”
• From remote health check data DXC can see that
– Disk 9 failed at 16:57 2nd August 2016
– Disk 6 failed at 07:43 11th August 2016
– Disk 7 failed at 10:40 16th September 2016
• Now heading into the weekend
• An engineer is on the way to assess the situation
• Put out comms
• Implement continuity plan
• Disks were not easy to come by since the system is so
old
• Disks eventually arrived in the evening after being
sourced from the back of a warehouse somewhere in the
South of England
• Engineer arrived late Friday night
• Replaced one disk and waited to see if the array would
rebuild itself
• This did not happen
• On-Call engineer from DXC, On-Call support from Trust
IT and since there is no provision for On Call within
Pathology IT,
the idiot who found the problem
• Replaced the other two disks
• Needed to now restore the database from a backup
which should have been run in the early hours of the
morning
• The back up which was run on the disk array was
obviously destroyed
• But, we had a backup of this backup on a remote server
• On-Call IT staff did not know where this backup was kept
• Contacted the only person who knew the location who
was on annual leave.
• Agreed to reconvene Saturday morning and restore the
database from the backup
• The back up restore ran for 18 hours and did not work
• It became apparent that the backup was not a complete
file.
• In fact, we couldn't find any backup that was a complete
file.
• Later discovered that there was a reason for this.
• After stitching back together pieces of backed up files,
we ended up with
– All of the archive
– All of Blood Bank
– Nearly all of Chemistry
– Nothing else
• Stopped using Haematology as a separate database
some time ago
• When the disciplines were merged the data was backed
up
• No Haematology lost
• Some work to do to get Blood Bank up and running and
all of the interfaces working
• Some more work to do to get Blood Sciences back up
and running
• Testing and assurance took some time
• Blood Bank was back up in just over two weeks
• Blood Sciences was up shortly after that
• We had a number of issues with the database crashing
repeatedly whilst we performed this work
• The only Microbiology back up we could find was from
May 2010 when we upgraded the operating system
• Using this fragment it took three months to rebuild
Microbiology
• Because of the specialist nature of the work and the lack
of staff within Pathology IT work did not start on the
Microbiology rebuild until Blood Bank was up and
running and most of Blood Sciences was functional
• Had a fall back process for IT downtime
• Some of the process was developed on the fly
• Longest previous outage was probably about four days
• We had no process designed to last for more than a
week
• Not three weeks
• Certainly not three months
Summary – what went wrong?
• An assumption was made that DXC were monitoring the
system
• Disk failures could have been detected early enough to
avoid a crash
• The hardware was not updated.
• Nearly 7 years old at the time of the crash
• The backup was not valid
• Only one person knew where the backup was kept
• We did not have a valid back up and we didn’t know this
until crunch time
Summary – what went wrong?
• “The Telepath system …. Is predominantly managed by
a dedicated team of Pathology based IT staff.”
• “Traditionally those colleagues now managing the
Telepath system began their professional careers as
Medical Laboratory Scientific Officers (MLSO’s) …
developing their interest in computer science alongside
the automation of laboratory services from the mid
1970’s ….”
• The Pathology IT team lost several key staff over the
course of several years leading to a loss of knowledge
and a loss of manpower
Summary – what went wrong?
• The Trust may have underestimated the importance of
pathology until it didn’t have pathology
How did we deal with it?
• Reverted to a system of manually reporting results
• Out sourced some of our work, mostly blood sciences, to
other labs within our STP region and Sheffield
• This wasn’t without difficulty since different labs have
different capacities and different methodologies
How did we deal with it?
• Had a lot of planning meetings and risk assessments,
focus on patient safety
• Kept a constant stream of communication going out to
users
• Trust IT took control of everything hardware related
How did we deal with it?
• Sent the disks away for data recovery
• (no useful data recoverable)
• Attempted to write some code to help us utilise the
functionality of our EPR and automate some of the
process
• (So time consuming, the rebuild was quicker)
• Commissioned an independent enquiry
What has happened as a result?
• Detailed SLA between Pathology and Trust IT
• Remote monitoring of the server and disk array
• Physical checks on the server and disk array
• Back up schedule altered
• Roles and responsibilities defined
• Testing of backup (finally) carried out
What has happened as a result?
• Comprehensive disaster recovery and business
continuity policy drawn up
What (still) hasn’t happened?
• The hardware has still not been upgraded
• It is now approaching 10 years old
• We still don’t have enough staff in IT
• We still don’t have a new LIMS
What were the positives?
• We did have some kind of business continuity
• And now we have a tried and tested one
• We had three years worth of detailed change logs for
Telepath
• People now know a lot more about Pathology’s role in
patient care in our Trust
• We served as a warning for other Trusts and other
departments within our Trust
• Other Trust’s IT business cases mysteriously accepted
• DXC were extremely helpful and supportive
• Other Trusts were incredibly helpful and supportive
• Recommendations from independent
investigation• R1 The Service Level Agreement that describes the managed service
currently provided by CSC/SCC should be reviewed and strengthened
where required. The Trust must be assured that the level of service covered
by this and similar contracts is adequate and the they do not expose the
Trust to any unnecessary risks.
• R2 All contracts with external service providers must contain a statement of
compliance with regard to adherence with Standard Operating Instructions
covering the major aspects of work for which the third party is
responsible/liable.
• R3 The Trust should work closely with CSC to seek confirmation that the
successive failure of three disks in the RAID 10 array was indeed
coinidental and not something more sinister that could reoccur in the future.
• R4 Where possible system back-ups should be part of a centrally managed process. Regular checks (daily) must be in place to ensure that back0ups are complete and are capable of running a full restore. This process should be extended to all critical systems across the Trust or at the very least major departmental applications.
• R5 A clean image of the current version of the Telepath operating system and application across all laboratory disciplines must be retained and stored off site. This process should be extended to all systems across the Trust or at the very least major departmental applications.
• R6 A system must be in place with immediate effect to accurately the names of those entering any of the Trust’s data centres as well as all activities that took place during access periods.
• R7 Electronic alerting and warning systems must be in place to ensure that all those who have responsibility for data centres, system back-ups and electronic storage receive automated assurance that back-ups are intact and complete.
• R8 Large Departmental systems, their associated networks and data centres should be managed within the structure of the corporate Information Management & Technology Service. Current members of IT Pathology team who spend more than 50% of their time managing information systems should equally move to the corporate service.
• R9 The Trust corporate IT team should work with the Pathology IT team and Primary Care system suppliers (EMIS and SYSTM ONE) to understand how technology can provide a more cohesive solution to recognise the relationship between Practices and the hospital providers of Pathology Services. The NPeX (X-Lab) may offer some Lab to Lab connectivity option that the Trust would find useful.
• R10 In relation to R9 the Trust should also re-examine any Memorandum of Understandings or reciprocal arrangements that it has in place with other NHS providers to take account of critical system and service failures in the future. Included in this review should be any system service that relies heavily on technology.