ceres automated job loading system (catalyst) › documents › intern_docs › pdf › ... ·...

64
CERES AuTomAted job Loading sYSTem (CATALYST) Operational Readiness Review December 19, 2013 1

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CERES AuTomAted job Loading sYSTem (CATALYST)

Operational Readiness Review

December 19, 2013

1

Page 2: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Background & Scope Background:

NPR 7120.7 - Operational Readiness Review (ORR) and GSFC SEL Document 84-101 consulted for review structure

Review Scope: CATALYST – Workflow management tool designed

to construct, disposition & submit PGE jobs for CERES data production based on PGE / data range Production Requests (PRs)

2

Page 3: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

ORR Entrance and Success Criteria Entrance Criteria Success Criteria

1) All validation testing has been completed. 1) The system, including any enabling products, is determined to be ready to be placed in an operational status.

2) Test failures and anomalies from validation testing have been resolved and the results incorporated into all supporting and enabling operational products.

2) All applicable lessons learned for organizational improvement and systems have been captured.

3)

All operational supporting and enabling products (e.g., facilities, equipment, documents, updated databases) that are necessary for the nominal and contingency operations have been tested and delivered/installed at the site(s) necessary to support operations.

3) All waivers and anomalies have been closed.

4) Operations handbook has been approved. 4) Systems hardware, software, personnel, and procedures are in place to support operations.

5) Training has been provided to the users and operators on the correct operational procedures for the system.

6) Operational contingency planning has been accomplished, and all personnel have been trained.

3

Page 4: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Readiness Assessment

Stakeholder Concurrence

Documentation

Operational Concept

Systems

Acceptance Testing

Transitional Readiness

Maintenance

Mostly ready, no major outstanding actions

Not Ready, significant outstanding actions

Ready, no outstanding actions

Page 5: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Background & Prior Milestones

Need ●  5 CERES Instruments on Terra, Aqua and S-NPP

and 1 planned for JPSS-1 in near future ●  Science SW processing requires individual streams for

each unique instrument & platform set à necessitates running multiple reprocessing & forward processing streams concurrently

●  Automation software will enable operations staff to manage many concurrent streams à ●  Optimizes throughput ●  Enables effective job management during “lights out” operation ●  Maximizes cluster utilization

5

Page 6: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Background & Prior Milestones ●  Initial Workshop January 2012 to identify

approach to streamline and automate CERES production ●  Established teams & responsibility for:

●  Requirements definition, Operations Concept creation, software development and evaluation and testing

●  Steering Group defined and composed of branch chief and PI-level stakeholders

●  Identified approach to Leverage existing Production Request Database effort and interface with existing production environment

●  Requirements V1.0 baselined March 19, 2012 ●  PR Tool transition to operations March 1, 2013 ●  Test Readiness Review March 22, 2013

6

Page 7: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

7

CATALYST Team Members Tammy Ayers Reynold Byrd Shawn Clark Angel Cross Tonya Davenport Sharon Dukes-Allen Jonathan Gleason Chris Harris Nelson Hillyer Vertley Hopson Walter Miller Lindsay Parker Pamela Rinsland Josh Wilkins

Page 8: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Stakeholders

●  CERES Principal Investigator ●  ASDC SIT Team ●  ASDC Operations Team ●  CERES DMT ●  CERES Science Team ●  Climate Modeling Community ●  All other CERES Data Users

8

Page 9: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Scope of Review (1 of 2) ●  Scope: Evaluate operational readiness of the

CERES AuTomAted job Loading sYSTem (CATALYST)

●  CATALYST Server software ●  CATALYST Operator’s Console software ●  Server interface to Operator’s Console ●  Interfaces between Server and external components

including: ●  PR Tool web application ●  Logging Database ●  CERES Epilog Scripts

●  CATALYST Build 1.0 only implements CERES Edition 4 Clouds and Inversion processing stream (8 PGEs) 9

●  Ganglia ●  AMI Job Submission Scripts ●  Grid Engine

Page 10: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Scope of Review (2 of 2)

●  CERES PR Tool not within scope of review ●  What is it?

●  PR Tool is a database & web interface used to store configuration information used to build one or more job calls to CERES PGEs

●  Sends user selected, PR-level information to the CATALYST Server

●  Content permission protected

●  Software update to PR Tool required to support CATALYST Go-Live ●  All changes successfully tested in PPE – Delivery

considered low risk 10

Page 11: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

●  Current model: Operators view PR Tool output and manually initiate PGE jobs and epilog calls

Simplified Current State Diagram

11

PR

Database

PR Web Interface

AMI-P Production

Environment

Human Operator

View PRs Submit Jobs

ASDC Epilogues

Page 12: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Model

12

PR Database

PR Web Interface

CATALYST AMI-P

Production Environment

Logging Database

Human Operator

CATALYST Ops Console

ASDC Epilogues

●  New paradigm operator manages software, PR Tool communicates directly to CATALYST and CATALYST builds and submits jobs

XML-RPC

XM

L-R

PC

Manage Jobs

Page 13: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

13

CERES Production Environment

Page 14: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Documentation

14

CATALYST Project Documentation Project Requirements Document

CATALYST_Requirements_Document baseline_v1.1.pdf

Requirements Traceability Matrix

CATALYST_Requirements_Tracability_Matrix_121213.xls

Concept of Operations CATALYST_CONOPS-Baseline_v1.0.pdf

CATALYST Operator’s Manual

CATALYST_opman_V2-937.pdf

Development Test Plan CATALYST_test_plan_V2-937.pdf PPE Test Plan CATALYST_PPE_TEST_PLAN_baseline_1.0_121312.pdf PPE Test Cases http://ceres.larc.nasa.gov/Internal/catalyst_PPE_testcases.php Test Readiness Review presentation

CATALYST_TRR_03222013.pdf

End-To-End Testing Verification Logs

http://ceres.larc.nasa.gov/Internal/catalyst_PPE_ende2end.php

CATALYST PPE Test Log http://ceres.larc.nasa.gov/Internal/catalyst_PPE_testcases.php

Page 15: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Supplemental/Reference Documents

15

Document Description Document Name State diagram of job states in CATALYST software and state transition points

CATALYST_StateFlowDiagram.pdf

Flow Chart of information starting at PR Tool and transitioning into and through the CATALYST Software

PRTool_to_CATALYST_WorkFlowDiagram_withStates.pdf

CATALYST/PR Tool Change Review Board Decision Flow Chart

CAPRCRiB_Workflow.pdf

Page 16: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

SOFTWARE OVERVIEW CATALYST Operational Readiness Review

Page 17: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

What is CATALYST?

●  PGE Execution/Coordination/Logging Framework –  Implements unique processing flow requirements for CERES –  Ingests PRs to Create a Collection of Jobs

●  Job represents a single data date of a single PGE ●  Jobs execute on Univa Grid Engine

–  CATALYST jobs wait for predecessor jobs to complete ●  Predecessors jobs can be:

–  Internal to CATALYST –  External to CATALYST (via a backlogging interface)

–  CATALYST jobs broadcast completion status to follow-on jobs for rapid follow-on execution

–  CATALYST jobs store execution state long-term in a job logging database

17

Page 18: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

What is CATALYST? - cont.

●  Application Programming Interface (API) –  Externally accessible XML-RPC API –  Allows users to inspect and modify job execution

status programmatically in any language (with XML-RPC libraries)

–  User authentication handled through LDAP –  Permissions handled with Access Control List

18

Page 19: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Implementation

●  CATALYST Server –  Job controller –  XML-RPC server –  Written in Perl, C,

and Bash –  Multi-threaded

●  Operator's Console –  Graphical User

Interface –  Communicates to

CATALYST Server using XML-RPC protocol

–  Written in Java –  Multi-threaded

19

Page 20: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components

20

Page 21: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – XML-RPC Front-End

●  External-facing programming interface

●  Multi-threaded to handle concurrent client connections

●  Customized version of RPC::XML from Perl CPAN

–  Switched to use threads, instead of forking

●  Verifies users using the User Manager

●  Production Requests enter through here from:

–  PR Web Application

–  Standalone Test PR Submission Applications

21

Page 22: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – User Manager

●  Authenticates AMI users against AMI LDAP server

●  Maintains ACL to grant AMI users with CATALYST operation permissions

22

Page 23: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – PGE Handlers Pool

●  Collection of job generators tailored for each PGE

●  Follows Factory Pattern

●  Inputs – Production Request

●  Outputs – List of PGE jobs

23

Page 24: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – POSIX_SYSCALL Pool

●  Invokes child processes ●  Used to prevent

environment variable pollution amongst AJSS & ANGe Epilogue calls

●  Returns exit codes of child processes to the caller (if necessary)

●  Multi-threaded with 8 primary execution slots –  To maximize job launching –  To minimize overloading

24

Page 25: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – Logging Database Interface

●  Handles insertion/modification/ querying for CATALYST job objects

●  Translates CATALYST job objects into series of SQL queries

●  Results returned as CATALYST job objects

25

Page 26: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – Cluster Resource Monitor

●  Assembles SGE and Ganglia information for cluster nodes

●  Presently used for presentation in the CATALYST Operator's Console

●  Future use will be for dynamic load balancing –  Help with running I/O

intensive PGEs

26

Page 27: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – DRMAA Interface

●  Submits CATALYST Jobs to Univa Grid Engine using Distributed Resource Management Application API (DRMAA)

●  Applies: –  [Re]starting –  Pausing/Resuming

●  Collects: –  resource usage data –  exit status 27

Page 28: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Server Components – CATALYST Core

●  Ingests, monitors, executes, and deletes CATALYST jobs: –  Leveraging previously

mentioned components

●  Broadcasts job completion to follow on PGEs/PRs

●  Job data paged in and out of memory to SQLite3 files on disk –  Reduces memory footprint –  Resumes to pre-shutdown

state when started

28

Page 29: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Operator's Console

1.  Server Status Bar

2.  PR List (Subsystem & PGE)

3.  PR Progress Indicator

4.  Job Navigator

5.  Job Viewer

6.  Job Actions

7.  Connection Log

29

Page 30: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

LaRCnet

SSH Tunnel

CATALYST Operator's Console and Server Connection

●  SSH Tunneling:

–  Uses the JCraft SSH2 library –  Bundled with Operator’s Console

distribution

–  Establishes an SSH tunnel to the selected host – users login with LDAP SSH credentials

–  All traffic from the console to server is routed over this tunnel

30

●  XML-RPC Handling:

–  Uses Apache XML RPC library with 2 minute timeout feature for RPC requests

–  Manages all synchronous and asynchronous RPC requests and callbacks

AMI-P Head Node

CATALYST Server

XML-RPC

Page 31: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Job Life-Cycle ●  Uninitialized – Job object created

from PR, stored on disk, not yet recorded to logging database

●  Submitted – Job is now recorded to logging database and has an CATALYST Job ID

●  Waiting – Job is waiting for notification of predecessor completion

●  Scheduled – Job is ready to run

●  Executing – Job is actively running

●  Completed – Job has finished, or has been deleted

31

Page 32: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Dependencies

32

•  Perl 5.16.1

•  CERES Perl_Lib – SCCR 985

•  CERES AMI Job Submission Scripts – for Clouds/Inversion Edition4 Processing Chain PGEs

•  CERES PR Tool – SCCR 962

•  Univa Grid Engine 8.1 – (ops.q)

•  PostgreSQL 9.2 – (dsrvr205)

•  Ganglia 3.0.7

•  LDAP – (ab01.cluster.net)

•  ASDC ANGe Epilogue Scripts

•  CERES Beta2 Edition 4 Clouds and Inversion PGEs – CER4.1-4.1P6, CER4.1-4.2P4, CER4.1-4.2P5, CER4.1-4.3P3, CER4.5-6.1P4, CER4.5-6.1P5, CER4.5-6.2P3, CER4.5-6.4P2

Page 33: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

TEST APPROACH CATALYST Operational Readiness Review

Page 34: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

PPE Test Approach •  Phase 1: Test Cases based on system requirements defined in

CATALYST Requirements Document baselined 03/19/12 •  B1.0.1-B1.0.4 (03/27/13 – 05/29/13)

•  Phase 2: Operational Scenarios (ex., day-in-life, proof of concept)

•  B1.0.2-B1.0.4 (06/05/13 – 12/18/13) •  Test execution with CATALYST and non-CATALYST PGEs

34

Page 35: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

PPE Test Approach •  Phase 3: PGE input and output verification based on existing

CERES documentation •  B1.0.2 (06/05/13 – 06/19/13)

•  Phase 4: End-To-End Testing: PRDB and Epilogue Interface

Testing •  B1.0.2-B1.0.4 (08/13/13 – 12/17/13)

35

Page 36: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Overview of Test Components ●  The ability to accept electronic production requests

●  The ability to interpret environment variables and special parameters for all instances as defined in the production request (PR) and subsystem Operator’s Manual

●  The ability to gracefully shut down

●  The ability to restart and restore from the state recorded prior to shutdown

●  The ability to submit CERES PGEs to a job scheduler

●  The ability to determine PGE preconditions prior to job submissions

●  The ability to track all job instances

●  The ability to log results of job submissions

●  The ability to interface with the CERES Epilogue scripts

●  Compliance with NASA LaRC IT security requirements

●  Access Controls

●  Usability of Operator’s Console

36

Page 37: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

PPE Test Assumptions ●  All Production Requests (PRs) associated with PPE testing are

entered in the PPE PR database and are accurate.

●  All Clouds and Inversion PGEs for PPE testing have been certified for production

●  Subsystem Operator’s Manuals associated with PGEs are up to date

●  Staff members have been trained with Operator’s Console

●  Epilogue scripts are operational

●  The DPO is accessible

●  PPE PRDB is operational

37

Page 38: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

PPE TEST RESULTS CATALYST Operational Readiness Review

Page 39: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Test Results

●  Requirements Results ●  73 Total Requirements

●  12 Deferred, 2 Partially Implemented, 2 Failed, 57 Passed

●  Requirements Failed Test Case: TC 18, Requirements 13a and 13b

●  Operational Test Results ●  CATALYST processed 3 months of Terra & Aqua concurrently over month, year and leap year

boundaries

●  Successful execution of 8 PGEs from PR submission to ANGe Ingest into DPO (20 PRs)

●  CATALYST-generated science data compared successfully with manually-generated data

●  System with CATALYST software running successfully passed security scan

39

Page 40: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Issues & Resolutions – JIRA Ticketing

40

Disposition   # Tickets  

Closed   45  

In Progress   9  

In Queue   3  

Open   8  

Under Review By Board   7  

Total   72  

Page 41: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Test Results Summary

41

●  Successfully completed the 4-Phased test approach

●  Extended the test period to investigate and document anomalies that create less than optimal operating conditions and overhead for the operations staff ●  Documentation of existing workarounds

●  Scenarios that cause manual intervention with CATALYST

●  System recovery utilities

•  Support V1.0.4 as a go-live candidate upon the resolution/acceptable workaround for:

•  CER-120 CATALYST shutdown unexpectedly

•  CER-99 Upon deletion of a PR, delete associated science data created

Page 42: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

TRANSITION TO OPERATIONS CATALYST Operational Readiness Review

Page 43: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Operational Support

43

Page 44: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Operational Support

●  SIT and CATALYST developers available during normal business hours (9am – 5pm)

●  Support model based on current CERES Science Software process and is documented in Operations Procedures documents

44

Page 45: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Early Operations Timeline ●  After promotion, CATALYST will execute 1

month of data as a ValR1 (Aqua, July 2002) ●  DMT staff will verify CATALYST ValR1 SSF output

and compare to existing Beta2 Edition 4 SSF output in DPO

●  Run 1 stream (Terra only) for X data months ●  Begin running both streams

45

Page 46: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Rollback Plan

●  Proceed with normal manual operations – “Business as usual”

●  Running jobs in CATALYST does not disable existing approach ●  Cannot alternate between running a given stream

in CATALYST and non-CATALYST without delivering LogDB updates

●  Once a stream begins production in CATALYST, manual production process be discontinued unless the Rollback plan has been initiated

46

Page 47: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

MAINTENANCE CATALYST Operational Readiness Review

Page 48: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Maintenance ●  CATALYST & PR Tool Change Review Board

(CAPRCRiB) initiated in response to TRR RFA ●  Board currently chaired by CATALYST government project

lead with membership representation spanning each key development and testing stakeholder area: (Developers for CATALYST Server and OpsConsole, Developers for PR Tool Web and DB, CERES CM, SIT Testing, SIT or Operations Lead, DMT lead & ASDC Govt. Ops lead)

●  Board chair retains approval authority for all CATALYST software modifications and PR Tool updates affecting CATALYST óPR Tool interface

●  Anyone with access to the project “ASDC-CERES” in JIRA can submit a change request – Ticket submitter required to verify success before ticket can be closed

●  CAPRCRiB implementation authority shared across ASDC & CSB – Chair position expected to rotate between respective government leads

48

Page 49: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Change Process ●  JIRA leveraged to manage all CATALYST and PR

Tool software updates ●  After board approves a change, implementation and

test responsibility mirrors CERES science software delivery process

●  DMT implements change

●  CERES CM tests against test plan

●  SIT functionally tests deliveries and promotes to production

●  Due to shared maintenance responsibility, CAPRCRiB requires diverse membership representation

49

Page 50: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

50

CATALYST & PR Tool Change Review Board (CAPRCRiB) Work Flow

Page 51: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Versioning

v1.0.4

•  Major – Larger-scale modifications to primary components

•  Minor – Feature enhancements appended to existing framework

•  Patch – Bug fixes to existing features

Page 52: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Next Build Required Change

1)  Add capability to enable/disable individual PGE epilog scripts – Currently all enabled or all disabled

2)  Capture and display epilogue script standard output/error

3)  Implement a more efficient way to view, locate, and summarize unsuccessful jobs in the Operator’s Console.

4)  Implement standalone utility that connects to CATALYST API to update Log DB and provide inspection and admin functionality

5)  Add easy menu-driven access to log files from Operator’s Console

52

Page 53: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Planned Deliveries (1 of 3) ●  Build 1.1:

●  B1.0.4 required patches (previous chart) ●  Deferred Requirement 5j ●  New Inversion PGEs

●  Build 1.2: ●  API to update Log DB for forward processing streams ●  Basic load balancer, deferred requirement 15 (partially)

●  Build 1.3: (PGEs only) ●  Instrument Subsystem ●  ERBElike Subsystem ●  Remaining Clouds Subsystem ●  Remaining Inversion Subsystem

53

Page 54: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Planned Deliveries (2 of 3)

●  Build 1.3: ●  Instrument Subsystem ●  ERBElike Subsystem ●  Remaining Clouds Subsystem ●  Remaining Inversion Subsystem

●  Build 2.0 ●  Expanded Operator’s Console search capabilities ●  Deferred requirements 15 & 15a - 15g ●  Improve “job” data structure à maintainability and

search-ability

54

Page 55: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Planned Deliveries (3 of 3) ●  Build 2.1:

●  Tisa Gridding Subsystem ●  Tisa Averaging Subsystem

●  Build 2.2: ●  TISA Gridding & Averaging remaining

●  Build 2.3: ●  SARB Subsystem

55

Page 56: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

RISKS CATALYST Operational Readiness Review

Page 57: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Risks (1 of 3) 1) Epilogue Output & Success Tracking – CATALYST captures the exit code from the epilogue scripts but does not display epilogue standard output – Impact: Medium (3), Likelihood: High (3)

Mitigation: Near-term software update to capture output to log file

2) PR Deletion and PGE Data Clean – Deleting a PR does not remove PGE output data from production workspace – Impact: Medium (3), Likelihood: Low (2)

Mitigation: Fix to be implemented via Per_Lib delivery prior to CATALYST Go-Live

3) Stale or Lost CATALYST Jobs – Once CATALYST jobs are submitted to UGE, the system cannot detect if the jobs are stale or lost – Impact: Low (2), Likelihood: Low (1)

Mitigation – Use CATALYST dedicated user account with restricted permissions. Software update planned for future build.

4) Data Storage – CATALYST expected to submit jobs at higher rates than current and produce more data products than current – Impact: High (4), Likelihood: Moderate (3)

Mitigation – Epilogue script modification to remove intermediate files in progress. Closely monitor disk usage and implement cleaning procedures

57

Page 58: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Risks (2 of 3)

5) Epilogue Selection – CATALYST epilogue state management homogeneous for all PGEs – Impact: Low (2), Likelihood: Low (2)

Mitigation: Near-Term software update will add functionality to disable/enable epilogues for individual PGEs

6) Logging Database Loss – CATALYST relies on Log DB to track successful jobs and identify subsequent PRs needing those inputs – Impact: Moderate (3), Likelihood: Low (1)

Watch: Database resides on production database server that is backed up multiple times daily. Log DB entries can be manually reconstructed from DPO, delivered and DBA can apply to Log DB

7) Server Unexpectedly Shuts Down – Operational boundary conditions not handled properly – Impact: High (4) , Likelihood: Low-Medium (2) (JIRA Ticket 120)

Mitigation: Development team provided list of know conditions and work around and recovery procedures

58

Page 59: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CATALYST Project Risks 1 M Epilog Output & Success Tracking

2 M PR Deletion and PGE Data Clean

3 M Manual Deletion of CATALYST Jobs

4 M Data Storage

5 M Epilogue Selection

6 W Logging Database Loss

7 M Server Unexpectedly Shuts Down

1 4

2

3

5

59

Risks (2 of 3)

7

65

Page 60: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

CONCLUSION CATALYST Operational Readiness Review

Page 61: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Remaining Pre-Operational Tasks

●  Capture remaining error cases and define avoidance & recovery procedures – Ops Manual

●  Replicate error conditions

●  Perl_Lib delivery to remove PGE output files ●  Compile lessons learned document ●  Complete CATALYST Design document – including

API specifications

61

Page 62: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Success Criteria Summary

62

ORR Success Criteria Status

1) The system, including any enabling products, is determined to be ready to be placed in an operational status.

- CATALYST Server b1.0.4 & OpsConsole b1.0.4 ready to proceed with transition into early operations

2) All applicable lessons learned for organizational improvement and systems have been captured.

- In Progress

3) All waivers and anomalies have been closed.

- Operations anomalies documented & path to resolution identified

4) Systems hardware, software, personnel, and procedures are in place to support operations.

- Operations personnel trained - Leverage existing CERES environment hardware & software - Final OpsManual updates (work around procedures) in progress

Page 63: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

63

•  Of the known existing deficiencies, a path to resolution or appropriate mitigation is in place for each

•  CATALYST team finds no “show stoppers” to prevent promoting candidate build (CATALYST Server b1.0.4 + Operator’s Console b1.0.4b)

•  Anticipate operational challenges due to known deficiencies

•  Benefit from hands on experience gained only after becoming operational will outweigh the impact of aforementioned mitigated risks.

Therefore  ,  CATALYST  build  1.0  is  considered  ready  for    transition  into  operations  

Conclusion & Readiness Assessment

Page 64: CERES AuTomAted job Loading sYSTem (CATALYST) › documents › intern_docs › pdf › ... · 2014-02-06 · Scope of Review (1 of 2) Scope: Evaluate operational readiness of the

Backup Material

64