ero ea database - nerc 2012 cause analysis and d… · sam holeman, duke energy, chair of ea...
TRANSCRIPT
Event AnalysisERO Event Analysis Data BaseERO Event Analysis Data Base
August 1, 2012
Sam Holeman, Duke Energy, Chair of EA Subcommittee
Stuff Happens
2 RELIABILITY | ACCOUNTABILITY
Stuff Happens
My Daughter’sMy Daughter s Weddingg
3 RELIABILITY | ACCOUNTABILITY
Goals of the EA Process
• Promoting Reliability• Developing a Culture of Reliability Excellence• Collaboration• Collaboration• Being a Learning Organization •HELPING SYSTEM OPERATORS IMPROVEG S S O O S O•IT IS NOT COMPLIANCE
4 RELIABILITY | ACCOUNTABILITY
History
• EA Field Test began in• EA Field Test began in October 2010• Phase II Field Test began gMay 2011• Approved ERO Process in February 2012
5 RELIABILITY | ACCOUNTABILITY
Categorization of Events
CAT 1 CAT 2 CAT 3 CAT 4 CAT 5
• Based on varying levels of significance • Impacts on the interconnected system
6 RELIABILITY | ACCOUNTABILITY
Brief Report Template
• Draft within 5 business days• Draft within 5 business days• Final Report in 10 days• Quality steadily improvingQ y y p g• Focus on collaboration with Regions
7 RELIABILITY | ACCOUNTABILITY
Event Analysis Report
• CAT 3 and above require more• CAT 3 and above require more information • Timeline established but negotiable
8 RELIABILITY | ACCOUNTABILITY
Events by Category
9 RELIABILITY | ACCOUNTABILITY
Event Analysis Report (EAR) Submittals25
30
24
26 26(October 25, 2010 ‐ February 25, 2012);
28 EAR's submitted since end of Field Trial = 121 total
20
16
15 13
5
10
5
7
4
0FRCC MRO NPCC RFC SERC SPP TRE WECC
10 RELIABILITY | ACCOUNTABILITY
Candidate Lessons Learned
Not every event on the bulk power system (BPS) has a quality “Lesson” to sharequality Lesson to share
• NERC looked at 230 qualifying events (Category 1 and above) and received 119 “candidates” for Lessonsabove) and received 119 candidates for Lessons Learned 55 of these came from the Cold Snap event of 2011
• Excluding the Cold Snap event, there were 64 other events which resulted in a Lesson Learned being submitted for consideration
• Twenty‐two Lessons Learned published in 2011, and h d
11 RELIABILITY | ACCOUNTABILITY
thirteen to date in 2012
Lessons Learned – Published (2012)
Region Lessons Learned Brief Description DateTRE TRE-LL-05 – Plant Onsite Material and Personnel Needed for a Winter
Weather Event 1/06/2012
TRE TRE-LL-06 - Plant Operator Training to Prepare for a Winter Weather Event 1/06/2012
TRE TRE-LL-07 - Transmission Facilities and Winter Weather Operations 1/06/2012
NPCC LL 54 DC Supply and AC Transients 3/06/2012NPCC LL-54 - DC Supply and AC Transients 3/06/2012
WECC LL-58 – Saturated Bus Auxiliary Current Transformer causes Bus Differential Operations during Line Fault 3/06/2012
TRE TRE-LL-34 – Rotational Load Shed 3/06/2012
WECC LL-59 - Auxiliary Relay Contact Contamination 6/19/2012
WECC LL-60 – Remote Terminal Units not on DC Sources 6/19/2012
WECC LL-61 – EMS Database Corruption Problem 6/19/2012
WECC LL-62 – Unmanned Forklift contact with Energized Bus 6/19/2012
RFC LL-65 – Excessive Resource Utilization 6/19/2012
TRE LL-66 – Alarm Interpretation Leads to Generator Stator Coil Failure 6/19/2012
12 RELIABILITY | ACCOUNTABILITY
p
NPCC LL-67 – Protective Relaying Digital Input Board Loading 6/19/2012
Event Trending *
30Qualified events (October 25, 2010 ‐ June 25, 2012)
23
19.8020
25
10
15
8.305
10
Monthly average = 14.05 events
0
13 RELIABILITY | ACCOUNTABILITY
* Control chart of monthly events, with control limits calculated by using 3‐month Moving Average method
Cause Code Definitions
Short Title DefinitionDesign/Engineering Problem An event or condition that can be traced to a defect in
d i th f t l t d t fi tidesign or other factors related to configuration, engineering, layout, tolerances, calculations, etc.
Equipment/Material Problem Is defined as an event or condition resulting from the failure, malfunction, or deterioration of equipment or parts, , , q p p ,including instruments or material.
Individual Human Performance LTA
An event or condition resulting from the failure, malfunction, or deterioration of the individual human performance associated with the processperformance associated with the process.
Management Problem An event or condition that could be directly traced to managerial actions, or methodology (or lack thereof).
Communications LTA Inadequate presentation or exchange of informationCommunications LTA Inadequate presentation or exchange of information.
Other Problem The problem was caused by factors beyond the control of the organization
14 RELIABILITY | ACCOUNTABILITY
LTA = Less Than Adequate
A L l C C d
Root Cause determinationsA‐Level Cause Code
(of 127 Total "Qualified" events with CC "entered")
9%Design/Engineering Problem
Equipment/Material Problem
20%37% Individual Human Performance LTA
Management Problem37% of the reports did not contain sufficient informationto determine causal factors.
3%
2%
Communication LTA
Other Problem
No Causes Found
22%2%6% Information to determine cause
LTA
NERC has “Cause Coded” 174 Qualified Events (as of 6‐25‐2012) Of these events
15 RELIABILITY | ACCOUNTABILITY
NERC has Cause Coded 174 Qualified Events (as of 6 25 2012). Of these events, we were able to assign some type of “Root Cause” coding for 127 events (~72%).
Identified Root Causes
Identified Root Causes(80 events)(80 events)
14%4%
9%
Design/Engineering Problem
Equipment/Material Problem
Individual Human Performance LTA
See Deeper dive Ch t See Deeper dive
30%35%
LTA
Management Problem
Communication LTA
Chart pChart
5%
Other Problem
16 RELIABILITY | ACCOUNTABILITY
Root cause for 80 events.
Deeper Dive into Management
7
"Management Problem" Cause Factors
A4B3C08 = Job Scoping did not identify special circumstances or conditionsA4B5C04 = Risks/consequences associated with change not adequately reviewedA4B1C04 Management follo p did not identif problems
5
6A4B1C04 = Management follow‐up did not identify problemsA4B1C05 = Management assessment did not determine cause of previously event or known problemA4B1C06 = Previous Industry or in‐house experience was not effectively used to prevent recurrenceA4B5C05 = System interactions not considered
3
4
2
3
0
1
17 RELIABILITY | ACCOUNTABILITY
A4B3C08 A4B5C04 A4B1C04 A4B1C05 A4B1C06 A4B5C05 A4B1C03 A4B1C08 A4B1C09 A4B3C09 A4B5C02 A4B5C03
A4
Deeper Dive into Equipment
10
"Equipment/Material Problem" Cause Factors
A2B6C01: Defective or failed part
7
8
9p
A2B6C07: Software failureA2B3C03: Post-maintenance/post-modification Testing LTAA2B6C04: End-of-life failureA2B6C06: ContaminantA2B5C02: Fabricated item did not meet requirementsA2B3C02 I ti /t ti LTA
5
6
7 A2B3C02: Inspection/testing LTAA2B5C04: Product acceptance requirements LTA
3
4
0
1
2
18 RELIABILITY | ACCOUNTABILITY
0A2B6C01 A2B6C07 A2B3C03 A2B6C04 A2B6C06 A2B5C02 A2B3C02 A2B5C04
NERC Alert-Advisory
Configuration Control Practices – Advised industry of events resulting from human performance errors duringevents resulting from human performance errors during protection system maintenanceEvent examples of inadequate control procedures:Event examples of inadequate control procedures:
1. Relay technician follow proper procedure to return protection system to normal state resulting in remote trip
2. Construction team failed to use latest construction document resulting in incorrect calibration of equipment
3 Relay technician leaves work site Returns to resume work3. Relay technician leaves work site. Returns to resume work but did so at wrong cabinet and trips substation
4. Technician trips a transformer due opening a wrong
19 RELIABILITY | ACCOUNTABILITY
p p g gcurrent shorting switch
NERC Alert-Advisory
• EMS Alert Advisory Analysis‐ During the Event Analysis (EA) field trial, 28 Category 2b events have occurred where a complete loss of SCADA/EMS lasted for more than 30 minutes Analysis is currently being conducted to providefor more than 30 minutes. Analysis is currently being conducted to provide emerging trends for the industry
• Current analysis of these events has shown:
f f il b f f h Software failure is a major contributing factor in 50 percent of the events
Testing of the equipment has been shown to be a factor in over 40 percent of the failures:
o Test environment did not match the production environment
o Product design (less than adequate)
Change Management has had an impact in over 50 percent of the failures:g g p p
o Risk and consequences associated with change not properly managed
o Identified changes not implemented in a timely manner
Individual operator skill based error was involved in 15 percent of the
20 RELIABILITY | ACCOUNTABILITY
Individual operator skill‐based error was involved in 15 percent of the events...
Solving Problems: Untying the Knot
21 RELIABILITY | ACCOUNTABILITY
Malcolm K. Sparrow John F. Kennedy School of Government, Harvard University
The Way Ahead
• Process must continually improvey p• Need to combine processes when possible• Better follow‐up as neededp• Tie in other data sources• Provide not just data but information to industryj y
Not every event results in a succinct lesson learned, but we learn from every event.
22 RELIABILITY | ACCOUNTABILITY
Malcolm K. Sparrow John F. Kennedy School of Government, Harvard University
Safety Check
Peer Check
23 RELIABILITY | ACCOUNTABILITY
The Way Ahead
• EAS Focus• EA Process Document Annual Update• EMS SCADA Task Force• Registered Entity Reports to OC• Summary of current lessons learned for OCy• Human Performance/Cause Code Task Force
Goal – HELP OPERATOR ON SHIFT GET BETTER
24 RELIABILITY | ACCOUNTABILITY
Malcolm K. Sparrow John F. Kennedy School of Government, Harvard University
Q ti d AQuestions and Answers
25 RELIABILITY | ACCOUNTABILITY