1 tdtwg report to rms scr 745 ercot unplanned system outages wednesday, july 13th

24
1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

Upload: claire-strickland

Post on 12-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

1

TDTWG Report to RMS

SCR 745

ERCOT Unplanned

System Outages

Wednesday, July 13th

Page 2: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

2

Motion

SCR745 includes:

(1.) a system evaluation and

(2.) a recommended solution based on a review of the evaluation.

SCR745 will be sent to the TAC and Board for consideration and possible approval.

Page 3: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

3

SCR 745 Analysis Approach

•SCR 745 requested ERCOT to perform in depth analysis in order to determine root causes for unplanned system outages.

•ERCOT in depth analysis indicates the current architecture supporting the Retail Market contains multiple single points of failure.

•While it is not possible to totally eliminate any possibility for an ERCOT system outage, it is possible to implement solutions that drastically reduce unplanned system outages for ERCOT by removing these single points of failure.

•This presentation includes the solutions identified.

Page 4: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

4

Paperfree Process Servers

Key:

PROXIES

INTERNET

OUTBOUND

NAESB

TCH

EAI

TCH Database

Single Retail Database Server(Multiple Oracle Databases)

PAPERFREE

SIEBEL

Bi-Directional Data Flow

Siebel DatabaseNAESB Database Paperfree DatabaseOutbound Data Flow

Inbound Data Flow

INBOUND

Paperfree File Server

FIREWALL

Solaris

W2K

W2K

HP

W2K

W2K

HP

IN/OUT

DMZ

SWITCH

Single Point of Failure

Retail Systems• NAESB • PaperFree• TCH-EAI (Transaction Clearing House)• All Retail (Database Server)

Market Participant

Page 5: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

5

The following options are being presented to assist RMS in reviewing and eventually approving the best solutions for resolving unplanned ERCOT system outages.

The 4 options include:

1 of 2 options for NAESB Proxy Server improvements

1 of 3 options for NAESB Application (dependent on NAESB Proxy Server option) 1 of 2 options for PaperFree improvements

1 of 3 options for Database Server for All Retail

System Options Included

Page 6: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

6

Current NAESB Architecture

Key:

PROXIES

INTERNET

OUTBOUND

NAESB

TCH Database

Retail Database Server

(Oracle Databases)

Bi-Directional Data Flow

Siebel DatabaseNAESB Database Paperfree DatabaseOutbound Data Flow

Inbound Data Flow

INBOUND

FIREWALL

Solaris

W2K

HP

IN/OUT

DMZ

SWITCH

Single Point of Failure

The Retail Transaction communication system using the North American Energy Standard Board Electronic Delivery Mechanism (NAESB EDM) V 1.6. This system is an internet based protocol.

The current NAESB architecture includes 2 NAESB Proxy servers in Taylor and 2 NAESB Proxy servers in Austin (to be used for disaster recovery only).

Due to the large quantity of data and critical timing for that data, the current NAESB architecture is insufficient for supporting the Texas Retail Market.

Page 7: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

7

NAESB Proxy Server Options Option 1 – Fully Clustered* V880 Solution –

4 V880 NAESB Proxy Servers

Summary – Maximum reliability solution. This option will provide a fully clustered and fault tolerant solution; opportunity to consolidate the current 18 production proxy servers including the servers identified in Option 2

This option virtually eliminates the potential for NAESB proxy outages, unplanned or planned.

This option will provide 99.99% availability for the NAESB proxy servers.

*Cluster: A group of servers that are typically on different physical machines and have the same

applications configured within them, but operate as a single logical server.

Page 8: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

8

NAESB Proxy Server Options

Option 2 – 4 V120 NAESB Proxy Servers.

Summary – Minimum reliability solution.

This option will provide redundancy to address the single point of failure. Two servers will be located in Taylor and two servers will be located in Austin.

This will not be a clustered solution it will be a load balance solution. V120 servers cannot cluster.

This solution will reduce the frequency and duration of proxy outages, is not as costly as option 1 but is also not as a robust solution as Option 1.

Page 9: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

9

NAESB Application Options

Option 3 - Separate Application Server Cluster

This option moves peripheral NAESB processes (data encryption, decryption) to the PaperFree cluster and separates inbound and outbound transmissions to disconnected clusters.

Page 10: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

10

NAESB Application Options

Option 4 Hybrid Application Cluster

This option creates an application cluster for inbound transactions and moves outbound transaction processing to the PaperFree system in order to utilize PaperFree’s load balancing and high availability capabilities.

Page 11: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

11

NAESB Application Options

Option 5 – Combined Application Cluster

This option combines inbound and outbound transaction processing into a single application cluster.

Page 12: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

12

Summary of NAESB Application Cost

Option 1 V880 Server Cluster $370,000

Option 2 V120 Server Redundancy $97,000

Option 3 Separate Application Server Cluster $175,000

Option 4 Hybrid Application Cluster $165,000

Option 5 Combined Application Cluster $235,000

Must choose one selection of Option 1 or Option 2 and one selection of Option 3, Option 4 or Option 5.

An additional cost of $66,105 identified for Training, Business Process and Monitoring.

Blue highlighting identifies recommended solution

Page 13: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

13

PaperFree

Paperfree Process Servers

TCH Database

Retail Database Server

(Oracle Databases)

PAPERFREE

Siebel DatabaseNAESB Database Paperfree Database

Paperfree File Server

W2K

HP

Key:

Bi-Directional Data Flow

Outbound Data Flow

Inbound Data Flow

Single Point of Failure

Paper Free includes the data validation and transformation system. The current architecture contains a single disk share for multiple load balanced application servers. This disk is the single point of failure for this system.

Page 14: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

14

PaperFree Options

Option 1 – Clustered File System Server solution

This option represents the maximum availability solution.

TCH Database

Retail Database Server

(Oracle Databases)

PAPERFREE

Siebel DatabaseNAESB Database Paperfree Database

PaperFree (Option 1)

Paperfree File ServerCluster

W2K

HP

Key:

Bi-Directional Data Flow

Outbound Data Flow

Inbound Data Flow

Single Point of Failure

Paperfree Process Servers

File

ser

ver c

lust

er V

irtua

l IP

Page 15: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

15

PaperFree Options

Option 2 – Local File System Solution

– This option supports the load balancing applications

– The system will still be active with a single sever failure; however server interruptions may result in delays in processing persistent data for the server experiencing an interruption.

TCH Database

Retail Database Server

(Oracle Databases)

PAPERFREE

NAESB Database Paperfree Database

HP

Key:

Bi-Directional Data Flow

Outbound Data Flow

Inbound Data Flow

Single Point of Failure

Paperfree Process Servers

Page 16: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

16

Summary of PaperFree Costs

• Option 1 – Clustered File System Server solution– $75,000

• Option 2 – Local File System Solution– $105,000

Blue highlighting identifies recommended solution

Page 17: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

17

All Retail System

Paperfree Process Servers

Key:

PROXIES

INTERNET

OUTBOUND

NAESB

TCH

EAI

TCH Database

Single Retail Database Server(Multiple Oracle Databases)

PAPERFREE

SIEBEL

Bi-Directional Data Flow

Siebel DatabaseNAESB Database Paperfree DatabaseOutbound Data Flow

Inbound Data Flow

All Retail (Database Server)

INBOUND

Paperfree File Server

FIREWALL

Solaris

W2K

W2K

HP

W2K

W2K

HP

IN/OUT

DMZ

SWITCH

Single Point of Failure

Page 18: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

18

All Retail System

Key:

Retail Database Server

(Oracle Databases)

Bi-Directional Data Flow

Siebel DatabaseNAESB Database Paperfree DatabaseOutbound Data Flow

Inbound Data Flow

HPSingle Point of

Failure

The All Retail System is the database server which houses each system’s database ( NAESB, PaperFree, Siebel and TCH-EAI). This Database server is a single point of failure for multiple Retail Systems.

All Retail System Goal:Provide high availability for all databases that support the Retail Applications including; NAESB, PaperFree, Siebel, TCH-EAI. This will allow processing of data to continue in the event of a database server failure.

Page 19: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

19

Database Server High Availability Options

Option 1 - All HP-UX Oracle Real Application Cluster (RAC)

Option 2 - All Linux Oracle Real Application Cluster (RAC)

For options 1 and 2:

Provides active redundancy for database connectivity for all retail databases

Complex to implement

Removes single point of failure at the database server level

Page 20: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

20

Database Server High Availability Options

Option 3:– NAESB Linux Oracle RAC and Different Standby/cluster solution for

the rest of the Retail databases• Provides active redundancy for database connectivity for

NAESB database• Less complex to implement as NAESB database is small and

easier to migrate• Provides option to migrate PaperFree and Siebel to migrate into

this RAC• Removes single point of failure at the database server level

– Veritas cluster, or Oracle Standby or Oracle RAC for other databases on HP-UX or Linux for appropriate availability requirements.

• Phased implementation NAESB first and other databases next• Removes single point of failure at the database server level

Page 21: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

21

Database Server High Availability Options

• Summary– All three options provide highest availability

architecture for NAESB database.– Option 1 and 2 provide highest availability

architecture for all databases, however, they are most expensive and complex to implement and manage.

– Option 3 provides highest availability option for the NAESB database and will provide appropriate high availability solutions for the rest of the retail databases in subsequent phases. Easier to implement in phased manner addressing acute availability needs first.

Page 22: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

22

Summary of Database Server High Availability Costs

Cost– Options 1&2 Oracle RAC

• Hardware – $450,000• Cluster SW – $400,000• Oracle RAC SW - $400,000• Cluster Ext Service - $100,000• Oracle RAC Ext Service - $100,000• Internal project cost (FTE) - $180,000

• Total: $1,630,000

– Option 3 Partial Oracle RAC + Alternate Solution for remaining

• Hardware – $400,000 - $600,000• Cluster SW –$100,000 - $400,000• Oracle RAC SW - $0-$400,000• Cluster Ext Service –$0-$120,000• Oracle RAC Ext Service - $120,000 - $180,000• Internal project cost (FTE) - $120,000 - $180,000

• Total: $890,000 - $ 1,650,000

Page 23: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

23

Next Steps

Today if recommended by RMS, TDTWG will facilitate a technical workshop to be held before the next RMS meeting.

This workshop is intended to help RMS members and interested Market Participants review the in depth system evaluation in order to select recommended solution(s) for approval at the August RMS meeting.

Page 24: 1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th

24

Questions