ceres automated job loading system (catalyst) › documents › intern_docs › pdf › ... ·...
TRANSCRIPT
CERES AuTomAted job Loading sYSTem (CATALYST)
Operational Readiness Review
December 19, 2013
1
Background & Scope Background:
NPR 7120.7 - Operational Readiness Review (ORR) and GSFC SEL Document 84-101 consulted for review structure
Review Scope: CATALYST – Workflow management tool designed
to construct, disposition & submit PGE jobs for CERES data production based on PGE / data range Production Requests (PRs)
2
ORR Entrance and Success Criteria Entrance Criteria Success Criteria
1) All validation testing has been completed. 1) The system, including any enabling products, is determined to be ready to be placed in an operational status.
2) Test failures and anomalies from validation testing have been resolved and the results incorporated into all supporting and enabling operational products.
2) All applicable lessons learned for organizational improvement and systems have been captured.
3)
All operational supporting and enabling products (e.g., facilities, equipment, documents, updated databases) that are necessary for the nominal and contingency operations have been tested and delivered/installed at the site(s) necessary to support operations.
3) All waivers and anomalies have been closed.
4) Operations handbook has been approved. 4) Systems hardware, software, personnel, and procedures are in place to support operations.
5) Training has been provided to the users and operators on the correct operational procedures for the system.
6) Operational contingency planning has been accomplished, and all personnel have been trained.
3
Readiness Assessment
Stakeholder Concurrence
Documentation
Operational Concept
Systems
Acceptance Testing
Transitional Readiness
Maintenance
Mostly ready, no major outstanding actions
Not Ready, significant outstanding actions
Ready, no outstanding actions
Background & Prior Milestones
Need ● 5 CERES Instruments on Terra, Aqua and S-NPP
and 1 planned for JPSS-1 in near future ● Science SW processing requires individual streams for
each unique instrument & platform set à necessitates running multiple reprocessing & forward processing streams concurrently
● Automation software will enable operations staff to manage many concurrent streams à ● Optimizes throughput ● Enables effective job management during “lights out” operation ● Maximizes cluster utilization
5
Background & Prior Milestones ● Initial Workshop January 2012 to identify
approach to streamline and automate CERES production ● Established teams & responsibility for:
● Requirements definition, Operations Concept creation, software development and evaluation and testing
● Steering Group defined and composed of branch chief and PI-level stakeholders
● Identified approach to Leverage existing Production Request Database effort and interface with existing production environment
● Requirements V1.0 baselined March 19, 2012 ● PR Tool transition to operations March 1, 2013 ● Test Readiness Review March 22, 2013
6
7
CATALYST Team Members Tammy Ayers Reynold Byrd Shawn Clark Angel Cross Tonya Davenport Sharon Dukes-Allen Jonathan Gleason Chris Harris Nelson Hillyer Vertley Hopson Walter Miller Lindsay Parker Pamela Rinsland Josh Wilkins
Stakeholders
● CERES Principal Investigator ● ASDC SIT Team ● ASDC Operations Team ● CERES DMT ● CERES Science Team ● Climate Modeling Community ● All other CERES Data Users
8
Scope of Review (1 of 2) ● Scope: Evaluate operational readiness of the
CERES AuTomAted job Loading sYSTem (CATALYST)
● CATALYST Server software ● CATALYST Operator’s Console software ● Server interface to Operator’s Console ● Interfaces between Server and external components
including: ● PR Tool web application ● Logging Database ● CERES Epilog Scripts
● CATALYST Build 1.0 only implements CERES Edition 4 Clouds and Inversion processing stream (8 PGEs) 9
● Ganglia ● AMI Job Submission Scripts ● Grid Engine
Scope of Review (2 of 2)
● CERES PR Tool not within scope of review ● What is it?
● PR Tool is a database & web interface used to store configuration information used to build one or more job calls to CERES PGEs
● Sends user selected, PR-level information to the CATALYST Server
● Content permission protected
● Software update to PR Tool required to support CATALYST Go-Live ● All changes successfully tested in PPE – Delivery
considered low risk 10
● Current model: Operators view PR Tool output and manually initiate PGE jobs and epilog calls
Simplified Current State Diagram
11
PR
Database
PR Web Interface
AMI-P Production
Environment
Human Operator
View PRs Submit Jobs
ASDC Epilogues
CATALYST Model
12
PR Database
PR Web Interface
CATALYST AMI-P
Production Environment
Logging Database
Human Operator
CATALYST Ops Console
ASDC Epilogues
● New paradigm operator manages software, PR Tool communicates directly to CATALYST and CATALYST builds and submits jobs
XML-RPC
XM
L-R
PC
Manage Jobs
13
CERES Production Environment
Documentation
14
CATALYST Project Documentation Project Requirements Document
CATALYST_Requirements_Document baseline_v1.1.pdf
Requirements Traceability Matrix
CATALYST_Requirements_Tracability_Matrix_121213.xls
Concept of Operations CATALYST_CONOPS-Baseline_v1.0.pdf
CATALYST Operator’s Manual
CATALYST_opman_V2-937.pdf
Development Test Plan CATALYST_test_plan_V2-937.pdf PPE Test Plan CATALYST_PPE_TEST_PLAN_baseline_1.0_121312.pdf PPE Test Cases http://ceres.larc.nasa.gov/Internal/catalyst_PPE_testcases.php Test Readiness Review presentation
CATALYST_TRR_03222013.pdf
End-To-End Testing Verification Logs
http://ceres.larc.nasa.gov/Internal/catalyst_PPE_ende2end.php
CATALYST PPE Test Log http://ceres.larc.nasa.gov/Internal/catalyst_PPE_testcases.php
Supplemental/Reference Documents
15
Document Description Document Name State diagram of job states in CATALYST software and state transition points
CATALYST_StateFlowDiagram.pdf
Flow Chart of information starting at PR Tool and transitioning into and through the CATALYST Software
PRTool_to_CATALYST_WorkFlowDiagram_withStates.pdf
CATALYST/PR Tool Change Review Board Decision Flow Chart
CAPRCRiB_Workflow.pdf
SOFTWARE OVERVIEW CATALYST Operational Readiness Review
What is CATALYST?
● PGE Execution/Coordination/Logging Framework – Implements unique processing flow requirements for CERES – Ingests PRs to Create a Collection of Jobs
● Job represents a single data date of a single PGE ● Jobs execute on Univa Grid Engine
– CATALYST jobs wait for predecessor jobs to complete ● Predecessors jobs can be:
– Internal to CATALYST – External to CATALYST (via a backlogging interface)
– CATALYST jobs broadcast completion status to follow-on jobs for rapid follow-on execution
– CATALYST jobs store execution state long-term in a job logging database
17
What is CATALYST? - cont.
● Application Programming Interface (API) – Externally accessible XML-RPC API – Allows users to inspect and modify job execution
status programmatically in any language (with XML-RPC libraries)
– User authentication handled through LDAP – Permissions handled with Access Control List
18
CATALYST Implementation
● CATALYST Server – Job controller – XML-RPC server – Written in Perl, C,
and Bash – Multi-threaded
● Operator's Console – Graphical User
Interface – Communicates to
CATALYST Server using XML-RPC protocol
– Written in Java – Multi-threaded
19
CATALYST Server Components
20
CATALYST Server Components – XML-RPC Front-End
● External-facing programming interface
● Multi-threaded to handle concurrent client connections
● Customized version of RPC::XML from Perl CPAN
– Switched to use threads, instead of forking
● Verifies users using the User Manager
● Production Requests enter through here from:
– PR Web Application
– Standalone Test PR Submission Applications
21
CATALYST Server Components – User Manager
● Authenticates AMI users against AMI LDAP server
● Maintains ACL to grant AMI users with CATALYST operation permissions
22
CATALYST Server Components – PGE Handlers Pool
● Collection of job generators tailored for each PGE
● Follows Factory Pattern
● Inputs – Production Request
● Outputs – List of PGE jobs
23
CATALYST Server Components – POSIX_SYSCALL Pool
● Invokes child processes ● Used to prevent
environment variable pollution amongst AJSS & ANGe Epilogue calls
● Returns exit codes of child processes to the caller (if necessary)
● Multi-threaded with 8 primary execution slots – To maximize job launching – To minimize overloading
24
CATALYST Server Components – Logging Database Interface
● Handles insertion/modification/ querying for CATALYST job objects
● Translates CATALYST job objects into series of SQL queries
● Results returned as CATALYST job objects
25
CATALYST Server Components – Cluster Resource Monitor
● Assembles SGE and Ganglia information for cluster nodes
● Presently used for presentation in the CATALYST Operator's Console
● Future use will be for dynamic load balancing – Help with running I/O
intensive PGEs
26
CATALYST Server Components – DRMAA Interface
● Submits CATALYST Jobs to Univa Grid Engine using Distributed Resource Management Application API (DRMAA)
● Applies: – [Re]starting – Pausing/Resuming
● Collects: – resource usage data – exit status 27
CATALYST Server Components – CATALYST Core
● Ingests, monitors, executes, and deletes CATALYST jobs: – Leveraging previously
mentioned components
● Broadcasts job completion to follow on PGEs/PRs
● Job data paged in and out of memory to SQLite3 files on disk – Reduces memory footprint – Resumes to pre-shutdown
state when started
28
CATALYST Operator's Console
1. Server Status Bar
2. PR List (Subsystem & PGE)
3. PR Progress Indicator
4. Job Navigator
5. Job Viewer
6. Job Actions
7. Connection Log
29
LaRCnet
SSH Tunnel
CATALYST Operator's Console and Server Connection
● SSH Tunneling:
– Uses the JCraft SSH2 library – Bundled with Operator’s Console
distribution
– Establishes an SSH tunnel to the selected host – users login with LDAP SSH credentials
– All traffic from the console to server is routed over this tunnel
30
● XML-RPC Handling:
– Uses Apache XML RPC library with 2 minute timeout feature for RPC requests
– Manages all synchronous and asynchronous RPC requests and callbacks
AMI-P Head Node
CATALYST Server
XML-RPC
CATALYST Job Life-Cycle ● Uninitialized – Job object created
from PR, stored on disk, not yet recorded to logging database
● Submitted – Job is now recorded to logging database and has an CATALYST Job ID
● Waiting – Job is waiting for notification of predecessor completion
● Scheduled – Job is ready to run
● Executing – Job is actively running
● Completed – Job has finished, or has been deleted
31
Dependencies
32
• Perl 5.16.1
• CERES Perl_Lib – SCCR 985
• CERES AMI Job Submission Scripts – for Clouds/Inversion Edition4 Processing Chain PGEs
• CERES PR Tool – SCCR 962
• Univa Grid Engine 8.1 – (ops.q)
• PostgreSQL 9.2 – (dsrvr205)
• Ganglia 3.0.7
• LDAP – (ab01.cluster.net)
• ASDC ANGe Epilogue Scripts
• CERES Beta2 Edition 4 Clouds and Inversion PGEs – CER4.1-4.1P6, CER4.1-4.2P4, CER4.1-4.2P5, CER4.1-4.3P3, CER4.5-6.1P4, CER4.5-6.1P5, CER4.5-6.2P3, CER4.5-6.4P2
TEST APPROACH CATALYST Operational Readiness Review
PPE Test Approach • Phase 1: Test Cases based on system requirements defined in
CATALYST Requirements Document baselined 03/19/12 • B1.0.1-B1.0.4 (03/27/13 – 05/29/13)
• Phase 2: Operational Scenarios (ex., day-in-life, proof of concept)
• B1.0.2-B1.0.4 (06/05/13 – 12/18/13) • Test execution with CATALYST and non-CATALYST PGEs
34
PPE Test Approach • Phase 3: PGE input and output verification based on existing
CERES documentation • B1.0.2 (06/05/13 – 06/19/13)
• Phase 4: End-To-End Testing: PRDB and Epilogue Interface
Testing • B1.0.2-B1.0.4 (08/13/13 – 12/17/13)
35
Overview of Test Components ● The ability to accept electronic production requests
● The ability to interpret environment variables and special parameters for all instances as defined in the production request (PR) and subsystem Operator’s Manual
● The ability to gracefully shut down
● The ability to restart and restore from the state recorded prior to shutdown
● The ability to submit CERES PGEs to a job scheduler
● The ability to determine PGE preconditions prior to job submissions
● The ability to track all job instances
● The ability to log results of job submissions
● The ability to interface with the CERES Epilogue scripts
● Compliance with NASA LaRC IT security requirements
● Access Controls
● Usability of Operator’s Console
36
PPE Test Assumptions ● All Production Requests (PRs) associated with PPE testing are
entered in the PPE PR database and are accurate.
● All Clouds and Inversion PGEs for PPE testing have been certified for production
● Subsystem Operator’s Manuals associated with PGEs are up to date
● Staff members have been trained with Operator’s Console
● Epilogue scripts are operational
● The DPO is accessible
● PPE PRDB is operational
37
PPE TEST RESULTS CATALYST Operational Readiness Review
Test Results
● Requirements Results ● 73 Total Requirements
● 12 Deferred, 2 Partially Implemented, 2 Failed, 57 Passed
● Requirements Failed Test Case: TC 18, Requirements 13a and 13b
● Operational Test Results ● CATALYST processed 3 months of Terra & Aqua concurrently over month, year and leap year
boundaries
● Successful execution of 8 PGEs from PR submission to ANGe Ingest into DPO (20 PRs)
● CATALYST-generated science data compared successfully with manually-generated data
● System with CATALYST software running successfully passed security scan
39
Issues & Resolutions – JIRA Ticketing
40
Disposition # Tickets
Closed 45
In Progress 9
In Queue 3
Open 8
Under Review By Board 7
Total 72
Test Results Summary
41
● Successfully completed the 4-Phased test approach
● Extended the test period to investigate and document anomalies that create less than optimal operating conditions and overhead for the operations staff ● Documentation of existing workarounds
● Scenarios that cause manual intervention with CATALYST
● System recovery utilities
• Support V1.0.4 as a go-live candidate upon the resolution/acceptable workaround for:
• CER-120 CATALYST shutdown unexpectedly
• CER-99 Upon deletion of a PR, delete associated science data created
TRANSITION TO OPERATIONS CATALYST Operational Readiness Review
Operational Support
43
Operational Support
● SIT and CATALYST developers available during normal business hours (9am – 5pm)
● Support model based on current CERES Science Software process and is documented in Operations Procedures documents
44
Early Operations Timeline ● After promotion, CATALYST will execute 1
month of data as a ValR1 (Aqua, July 2002) ● DMT staff will verify CATALYST ValR1 SSF output
and compare to existing Beta2 Edition 4 SSF output in DPO
● Run 1 stream (Terra only) for X data months ● Begin running both streams
45
Rollback Plan
● Proceed with normal manual operations – “Business as usual”
● Running jobs in CATALYST does not disable existing approach ● Cannot alternate between running a given stream
in CATALYST and non-CATALYST without delivering LogDB updates
● Once a stream begins production in CATALYST, manual production process be discontinued unless the Rollback plan has been initiated
46
MAINTENANCE CATALYST Operational Readiness Review
Maintenance ● CATALYST & PR Tool Change Review Board
(CAPRCRiB) initiated in response to TRR RFA ● Board currently chaired by CATALYST government project
lead with membership representation spanning each key development and testing stakeholder area: (Developers for CATALYST Server and OpsConsole, Developers for PR Tool Web and DB, CERES CM, SIT Testing, SIT or Operations Lead, DMT lead & ASDC Govt. Ops lead)
● Board chair retains approval authority for all CATALYST software modifications and PR Tool updates affecting CATALYST óPR Tool interface
● Anyone with access to the project “ASDC-CERES” in JIRA can submit a change request – Ticket submitter required to verify success before ticket can be closed
● CAPRCRiB implementation authority shared across ASDC & CSB – Chair position expected to rotate between respective government leads
48
Change Process ● JIRA leveraged to manage all CATALYST and PR
Tool software updates ● After board approves a change, implementation and
test responsibility mirrors CERES science software delivery process
● DMT implements change
● CERES CM tests against test plan
● SIT functionally tests deliveries and promotes to production
● Due to shared maintenance responsibility, CAPRCRiB requires diverse membership representation
49
50
CATALYST & PR Tool Change Review Board (CAPRCRiB) Work Flow
CATALYST Versioning
v1.0.4
• Major – Larger-scale modifications to primary components
• Minor – Feature enhancements appended to existing framework
• Patch – Bug fixes to existing features
Next Build Required Change
1) Add capability to enable/disable individual PGE epilog scripts – Currently all enabled or all disabled
2) Capture and display epilogue script standard output/error
3) Implement a more efficient way to view, locate, and summarize unsuccessful jobs in the Operator’s Console.
4) Implement standalone utility that connects to CATALYST API to update Log DB and provide inspection and admin functionality
5) Add easy menu-driven access to log files from Operator’s Console
52
Planned Deliveries (1 of 3) ● Build 1.1:
● B1.0.4 required patches (previous chart) ● Deferred Requirement 5j ● New Inversion PGEs
● Build 1.2: ● API to update Log DB for forward processing streams ● Basic load balancer, deferred requirement 15 (partially)
● Build 1.3: (PGEs only) ● Instrument Subsystem ● ERBElike Subsystem ● Remaining Clouds Subsystem ● Remaining Inversion Subsystem
53
Planned Deliveries (2 of 3)
● Build 1.3: ● Instrument Subsystem ● ERBElike Subsystem ● Remaining Clouds Subsystem ● Remaining Inversion Subsystem
● Build 2.0 ● Expanded Operator’s Console search capabilities ● Deferred requirements 15 & 15a - 15g ● Improve “job” data structure à maintainability and
search-ability
54
Planned Deliveries (3 of 3) ● Build 2.1:
● Tisa Gridding Subsystem ● Tisa Averaging Subsystem
● Build 2.2: ● TISA Gridding & Averaging remaining
● Build 2.3: ● SARB Subsystem
55
RISKS CATALYST Operational Readiness Review
Risks (1 of 3) 1) Epilogue Output & Success Tracking – CATALYST captures the exit code from the epilogue scripts but does not display epilogue standard output – Impact: Medium (3), Likelihood: High (3)
Mitigation: Near-term software update to capture output to log file
2) PR Deletion and PGE Data Clean – Deleting a PR does not remove PGE output data from production workspace – Impact: Medium (3), Likelihood: Low (2)
Mitigation: Fix to be implemented via Per_Lib delivery prior to CATALYST Go-Live
3) Stale or Lost CATALYST Jobs – Once CATALYST jobs are submitted to UGE, the system cannot detect if the jobs are stale or lost – Impact: Low (2), Likelihood: Low (1)
Mitigation – Use CATALYST dedicated user account with restricted permissions. Software update planned for future build.
4) Data Storage – CATALYST expected to submit jobs at higher rates than current and produce more data products than current – Impact: High (4), Likelihood: Moderate (3)
Mitigation – Epilogue script modification to remove intermediate files in progress. Closely monitor disk usage and implement cleaning procedures
57
Risks (2 of 3)
5) Epilogue Selection – CATALYST epilogue state management homogeneous for all PGEs – Impact: Low (2), Likelihood: Low (2)
Mitigation: Near-Term software update will add functionality to disable/enable epilogues for individual PGEs
6) Logging Database Loss – CATALYST relies on Log DB to track successful jobs and identify subsequent PRs needing those inputs – Impact: Moderate (3), Likelihood: Low (1)
Watch: Database resides on production database server that is backed up multiple times daily. Log DB entries can be manually reconstructed from DPO, delivered and DBA can apply to Log DB
7) Server Unexpectedly Shuts Down – Operational boundary conditions not handled properly – Impact: High (4) , Likelihood: Low-Medium (2) (JIRA Ticket 120)
Mitigation: Development team provided list of know conditions and work around and recovery procedures
58
CATALYST Project Risks 1 M Epilog Output & Success Tracking
2 M PR Deletion and PGE Data Clean
3 M Manual Deletion of CATALYST Jobs
4 M Data Storage
5 M Epilogue Selection
6 W Logging Database Loss
7 M Server Unexpectedly Shuts Down
1 4
2
3
5
59
Risks (2 of 3)
7
65
CONCLUSION CATALYST Operational Readiness Review
Remaining Pre-Operational Tasks
● Capture remaining error cases and define avoidance & recovery procedures – Ops Manual
● Replicate error conditions
● Perl_Lib delivery to remove PGE output files ● Compile lessons learned document ● Complete CATALYST Design document – including
API specifications
61
Success Criteria Summary
62
ORR Success Criteria Status
1) The system, including any enabling products, is determined to be ready to be placed in an operational status.
- CATALYST Server b1.0.4 & OpsConsole b1.0.4 ready to proceed with transition into early operations
2) All applicable lessons learned for organizational improvement and systems have been captured.
- In Progress
3) All waivers and anomalies have been closed.
- Operations anomalies documented & path to resolution identified
4) Systems hardware, software, personnel, and procedures are in place to support operations.
- Operations personnel trained - Leverage existing CERES environment hardware & software - Final OpsManual updates (work around procedures) in progress
63
• Of the known existing deficiencies, a path to resolution or appropriate mitigation is in place for each
• CATALYST team finds no “show stoppers” to prevent promoting candidate build (CATALYST Server b1.0.4 + Operator’s Console b1.0.4b)
• Anticipate operational challenges due to known deficiencies
• Benefit from hands on experience gained only after becoming operational will outweigh the impact of aforementioned mitigated risks.
Therefore , CATALYST build 1.0 is considered ready for transition into operations
Conclusion & Readiness Assessment
Backup Material
64