incident command for it: what we can learn from the fire … · 2019-02-25 · incident command for...
TRANSCRIPT
Incident Command for IT:What We Can Learn from the
Fire Department
Brent [email protected]
Great Circle Associates, Inc.http://www.greatcircle.com
Slide 2 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
IT managers often need tomanage incidents
Security incidentsService outagesInfrastructure failures
Power failuresCooling failuresConnectivity failures
… and so forth
Slide 3 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Who manages emergenciesdaily?
Public safety agenciesFire departments
Urban & suburbanForest & wildland
Police departmentsCoast Guard… etc.
Slide 4 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
How do public safetyagencies…
Organize themselves on the fly to deal witha major incident?
Quickly and effectively coordinate theefforts of multiple agencies?
Evolve the organization as the incidentchanges in scope, scale, or focus?
What can IT professionals learn from that?
Slide 5 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
For example…
A car hits a fire hydrantOccupants are trapped and injuredWater from hydrant floods an underground electrical
transformer, causing a short circuit & an outageWho might be involved in response?
Fire department – rescue trapped occupantsAmbulance service – treat & transport victimsPolice department – direct traffic & investigateWater department – shut off hydrantElectric company – deal with flooded transformer
How to coordinate all that?
Slide 6 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
What needs to get done?
Ambulance crew needs to treat & transportvictims
But first, fire department crew needs toextricate them from wreckage
But before they can do that, water companyneeds to shut off water
Which they can’t do until electric companysafes the flooded transformer
Slide 7 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
How do you organize this?
Who is in charge?How do they figure out what needs to be
done, and who can do it?How do assignments get made, so that
Everything necessary gets doneNo effort gets duplicatedEverything is done safely
Slide 8 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
An even bigger example:Southern California WildfiresFast-changing situation
Fire grows and moves as weather and winds shiftPlan evolves as situation & resources change
Many agencies involvedFirefighters from dozens of cities, plus CDF, USFS,
BLM, and militaryAirborne water drop, transport, & scoutingLaw enforcement to deal with residentsSupport units (medical, kitchens, camps, fuel, etc.)
Slide 9 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
How about an IT example?
Data center outage — total power failureUtility service dropped, UPS didn’t take load, generator
didn’t start in timeAll systems went down hard (no shutdown)
Need toEnsure services transferred to alternate data centerCold-start everything; figure out startup orderCheck/fix systems as they’re brought back upDiagnose and permanently fix power problemTransfer services back from alternate data center
Might take days, involve dozens of people
Slide 10 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Other IT examples
Service outagesSecurity incidents
DoS attacksVirus/worm outbreaksBreak-ins
Adversarial terminations; layoffsNot just emergencies
Facility movesService deploymentsMajor upgrades
Slide 11 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
What do these types ofincidents have in common?
Timing might be a surpriseSituation not perfectly understood at start
Learn as you go, and adjust on the fly
Resources change over timePeople come and go; not all together at startNeed ways to bring newcomers up to speedNeed ways to transfer responsibilities
Slide 12 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
So, what is ICS?
Incident Command SystemStandardized organizational structure and set of
operating principlesTools for command, control, and coordination of a
response to an incidentProvides means to coordinate efforts of multiple
parties toward common goalsUses principles that have been proven to improve
efficiency and effectiveness
Slide 13 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
History of ICS
Developed in 1970’s to coordinate agenciesdealing with yearly SoCal wildfires
Has evolved since into national standardNow used by nearly all US public safety
agenciesOften required, to obtain state/Federal
funding
Slide 14 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key ICS Principles
Modular & scalable organization structureManageable span of controlUnity of commandExplicit transfers of responsibilityClear communicationsConsolidated incident action plansManagement by objectiveComprehensive resource managementDesignated incident facilities
Slide 15 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #1: Modular &scalable organization structure
Operations Logistics Planning/Status Admin/Finance
Command
Functions are activated as needed for a particular incidentAll incidents will have a Command SectionAlmost all will have an Operations SectionRest of sections are only used on larger/longer incidents
On small incidents, multiple functions often handled by single person
Slide 16 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Command Section
Incident Commander (IC) responsible foroverall management of incident
IC initially also performs all 4 section chiefroles (Operations, Logistics,Planning/Status, Admin/Finance), untildelegated to somebody else
Slide 17 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Operations Section
This is where the real work happensOperations develops and executes plans to
achieve the objectives set by CommandAssists Command in development of
Consolidated Incident Action PlanTypically the biggest section, by number of peopleOps focus is now; Planning worries about later
Slide 18 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Planning/Status Section
Collects & evaluates info needed to prepareaction plan
Forecasts probable course of incidentPlans for next day, next week, etc.Keeps track of what has been done, and
what still needs to be doneKeeps “current status & plans” info up to
date, so that new arrivals can brief selves
Slide 19 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Logistics Section
Responsible for obtaining all resources, services,and support required to deal with the incident
Responsibilities include facilities, transportation,supply, equipment maintenance & fueling,feeding & medical care of incident responsepersonnel, etc.
Is more important on big, long-running incidents;may not be needed on small or short incidents
Slide 20 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Admin/Finance Section
Responsible for tracking incident-relatedcosts (including time & materials, ifnecessary for reimbursement)
Also administers procurements arranged byLogistics
Usually only activated on the very largestand longest-running incidents
Slide 21 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Growing the ICS organization
Initially, the senior-most first responder is theIncident Commander (IC)IC responsibility may transfer to somebody else later,
as incident grows, but that isn’t automaticGenerally better to keep the same IC, if feasible
Stuff gets lost during handoffs
If IC transfer does happen, it needs to be explicit
One person often fills multiple slots on org chartInitially, IC also heads other sections (Ops, etc.)Delegates to others as necessary and possible
Slide 22 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #2:Manageable span of controlEach supervisor should
have 3-7 subordinates5 is ideal
When necessary, asorg grows, create newlevels
Division might beFunctionalGeographic
ColleenHoltMarkJohn
NetworkingRich
JeffMaryTodd
ServersPaul
OperationsBryan
LogisticsJonathan
Planning/StatusBryan
Admin/FinanceBryan
CommandBryan
Slide 23 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #3:Unity of commandOn incident, each person has 1 boss
Strict tree structure, all the way to the topEverybody knows who they work forEvery supervisor knows who works for them
Works better than matrix in an emergencyDoesn’t assume folks normally work together, or even
know each other
Makes communication & coordination easier,up/down tree, as organization grows & changes
Reduces freelancing
Slide 24 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #4: Explicittransfers of responsibilityChanges to organization are made explicitlyMore senior person doesn’t automatically take
over upon arrivalMight, but only after briefing on status/plans from
person they’re replacing, and explicit turnover(including notifying subordinates and superiors)
Person already in place is often better suited to handlecurrent situation, and certainly is more up to speed
Planning/Status keeps overall org chart updated
Slide 25 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #5:Clear communicationsCommunicate clearly and completely, not in code
Reduces potential for confusionReduces time spent clarifyingLets other people (including management) monitor
Talk directly to resources, when possibleUse the tree to find, then work with them directlyUsing tree also helps keep management informed
Slide 26 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #6:Consolidated action plansCommand communicates top-level action plan for
current operational period (hour, shift, day, etc.)Plan states, at a high level, what organization is trying
to accomplish right nowSection chiefs (Ops, Logistics, etc.) help develop plan
Written plan is bestMakes it easier to keep everybody on targetMakes it easier for new arrivals to brief selves
Rule of thumb: if it crosses organizational orspecialty boundaries, write it down
Slide 27 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #7:Management by objectiveTell people what you want to accomplish, not how
Let them figure out how to get it doneGives them room to flexibly and creatively cope with
changing circumstances
For example, say “get a public web server backonline with an ‘out of service’ notice for ourcustomers”, not “take host xyz123, reload it withRedHat and Apache, move it to rack 7, …”
Is generally faster to communicate, and the folksdoing the work may know a better way than you
Slide 28 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #8: Comprehensiveresource management
All assets & personnel need to be trackedSo new resources can be used most effectivelySo existing resources can be relieved
Folks should “sign in” through Adminfunction, then wait for assignmentHelps ensure they’re put to best useMight want to designate a “report to” siteAlso simplifies briefing new arrivals
Slide 29 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Key Principle #9:Designated incident facilitiesCommand Post (CP) is key facility to
identify – that’s where everybody canexpect to find ICIf IC needs to leave CP, needs to transfer IC
responsibility (temporarily or permanently) tosomeone who’ll still be there
Also useful to designate “staging area” fornew resources to report to upon arrival,for sign-in and assignment; may be at CP
Slide 30 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
ICS for IT in action…
It’s a Tuesday morning, and everything is normalThe company’s load is split 50/50 between its two
data centers, in Sunnyvale and MesaAt about 9:30am, the NOC loses all monitoring of
Sunnyvale, and the load doubles in MesaThe NOC suspects a network outage, begins to
troubleshoot, and pages all NetOps managers,per their SOP
Bryan, the Director of Operations, happens to benearby, and diverts to the Sunnyvale DC
Slide 31 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
9:45am
Bryan arrives at the DC at about the sametime as Joe and Tom, two of thecompany’s installers
In the parking lot, they notice that thefacility’s generator is running
Inside, they find that the lights are on, butall of the UPS-powered equipment(servers, network, etc.) is without power
Slide 32 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
First steps…
Bryan calls the NOC:1) Informs them he’s activatingICS plan [clearcommunication]2) Asks them to page allNetOps personnel to report toDC conference room forassignment [staging area]
Bryan directs Joe and Tom toswitch off all systems, theninvestigate power problems.[serving in multiple roles;management by objective]
TomJoe
OperationsBryan
Logistics Planning/Status Admin/Finance
CommandBryan
Slide 33 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
10:15am
Cary, the facilities manager,arrives. Bryan asks him to takecharge of investigating theUPS failure, while Joe andTom continue to switch offsystems to prevent unplannedrestarts.
Paul (the server team manager),Dave, and Karl (serversysadmins) arrive. Bryan asksPaul to direct them in preparingto bring servers back online.[span of control]
DaveKarl
ServersPaul
JoeTomCary
OperationsBryan
Logistics Planning/Status Admin/Finance
CommandBryan
Slide 34 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
10:30am
Chris, the NetOps VP, arrives.After a brief discussion withBryan, they decide it makesmost sense for Bryan to remainas IC, and for Chris to serve asliaison to rest of company.[explicit transfer ofresponsibility, not automaticupon arrival of more seniorpersonnel]
LiaisonChris
DaveKarl
ServersPaul
JoeTomCary
OperationsBryan
Logistics Planning/Status Admin/Finance
CommandBryan
Slide 35 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
10:45am
Rich, Colleen, and John from theNetworking team, and Bob (theNetworking team manager)arrive.
Bryan asks Rich to take chargeof Colleen and John as theNetworking team on thisincident, and asks Bob tohandle Planning/Status for theoverall incident.[comprehensive resourcemanagement, using folkswhere most needed]
LiaisonChris
DaveKarl
ServersPaul
JohnColleen
NetworkingRich
JoeTomCary
OperationsBryan
Logistics Planning/StatusBob
Admin/Finance
CommandBryan
Slide 36 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
11:00am
Paul needs more help withservers, so Bryan reassignsTom and Joe to Paul’s team.[comprehensive resourcemanagement]
Cary determines that they mayneed to run on generatorpower for several days, but thatthe fuel tank isn’t big enoughfor that. Bryan calls Jonathan,the group’s purchasing agent,and asks him to take on theLogistics role and arrange forrefueling (& lunch!). [modular,expandable organization]
LiaisonChris
DaveKarlJoeTom
ServersPaul
JohnColleen
NetworkingRich
Cary
OperationsBryan
LogisticsJonathan
Planning/StatusBob
Admin/Finance
CommandBryan
Slide 37 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
And so forth…
The organization changes, as the situationand resources change
Following the ICS principles gives you away to keep it all under control
Could keep this going indefinitely, if needed
Slide 38 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
ICS Tips
Establish ICS early in an incidentIf you get off to a disorganized start, you’ll be playing
catch-up forever
Think of ICS as a toolbox full of toolsChoose the tools you need for the incident at handKeep it simple
Practice ICS at every opportunityIf you use it for “routine” and pre-planned events like
moves, upgrades, and deployments, your team willbe more comfortable using it for “surprise” eventslike outages and security incidents
Slide 39 Incident Command for IT — Brent Chapman — [email protected] — USENIX/SAGE LISA — 8 Dec 2005 — © 2005 Great Circle Associates, Inc.
Learning more about ICS
Free materials and online courses (FEMA):http://training.fema.gov/EMIWeb/IS/ICSResource
Wikipedia entry describing ICS:http://en.wikipedia.org/wiki/Incident_Command_System
UC Davis introduction to ICS:http://planit.ucdavis.edu/howto/incidentCmd.html
Amateur Radio (ARRL) perspective on ICS:http://ema.arrl.org/fd/ICS_TM.htm
Please support disaster reliefgroups such as Radio Response
(http://www.radioresponse.org)
These slides available in my bloghttp://www.greatcircle.com/blog/
Brent [email protected]
Great Circle Associates, Inc.http://www.greatcircle.com