© 2009 ibm corporation ibm power systems implementing cec concurrent maintenance ron barker ibm...

33
© 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support [email protected]

Upload: mervin-cooper

Post on 24-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

© 2009 IBM Corporation

IBM Power Systems

ImplementingCEC Concurrent Maintenance

Ron BarkerIBM Power Systems Advanced Technical [email protected]

Page 2: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Overview

CEC Concurrent Maintenance (CCM) offers new capabilities in Reliability, Availability and Serviceability (RAS)

Concurrent add and upgrade functions enable the expansion of the processor, memory, and I/O hub subsystems without a system outage

If prerequisites have been met, repairs can be made on the system processor, memory, I/O hub, and other CEC hardware without a system outage

Accomplishing CCM requires careful advance planning and meeting all prerequisites

If desired, customers can continue to schedule maintenance during planned outages

Page 3: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Terminology

Concurrent Maintenance: An add, upgrade or repair made while the server is running

Some system elements may be unavailable during maintenance, but a re-IPL is NOT required to reintegrate all resources

Concurrent Add/Upgrade: Adds new or exchanges hardware components while the system is running

Node: A physical group of processor, memory, and I/O hubs in the system (595 processor book, 570 CEC drawer or module)

Node evacuation: Frees up processor and memory resources from the target node and replaces them with CPU and memory from other nodes if available. De-allocates I/O resources so the node can be electrically isolated from the system for concurrent maintenance.

Page 4: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Terminology

Concurrent Cold Repair: Repairs to components electrically isolated from the running system (de-allocated or “garded”) before the current repair action was started

Repairs following a system shutdown and reboot after hardware failure Reintegration following repair does NOT require a reboot

Concurrent Hot Repair: Repairs on components that will be electrically isolated from the running system during the repair action

Reintegration following repair does NOT require a reboot

Non-Concurrent Repair: Repairs requiring the system be powered off

GX Adapter: An I/O hub which connects I/O expansion units to the processors and memory in the system (e.g., RIO-2, 12X adapters)

Page 5: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Planning and Prerequisites

CCM has both hardware and firmware prerequisites

Power Systems 595 and 570 only

Hardware Management Console V7R3.4.0 MH01163_0401 (SP 1) or later System firmware EH340_061 and EM340_061 or later

This update has deferred content requiring a re-IPL to activate enhancements

Power System 570 concurrent node add requires that the system cable be connected in advance (cannot be added concurrently)

Adding new GX adapters concurrently requires that sufficient system memory has been reserved in advance; here are the defaults that may need to be increased:

Power Systems 595, 1 additional per node, 2 maximum, if slots available Power Systems 570, 1 additional maximum, if an empty slot available

Page 6: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Planning and Prerequisites

System configurations should allow for free processors, unused system memory and redundant I/O paths

When a processor node is powered off, all of its resources need to be shifted to another node

Unlicensed Capacity on Demand processors and memory will be used by the system during node evacuation

System CPU and memory usage can be reduced through dynamic reallocation of running partitions, or by shutting down those that are unnecessary

Insufficient processor and memory capacity, or lack of redundant I/O paths, may force shutdown of some or all logical partitions on the system

Page 7: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Planning and Prerequisites

Preparation for concurrent maintenance begins when the system is ordered and configured

1. Customers decide how much system to buy and how to configure it to take advantage of CCM capability

2. Customers decide whether to use concurrent maintenance techniques or schedule planned outages for upgrades and repairs

Page 8: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Planning Guides for CEC Concurrent Maintenance

Follow these guidelines:

The system should have enough unused CPU and memory to allow a node to be taken off-line for repair

All critical I/O resources should be configured using multi-path I/O solutions allowing failover using redundant I/O paths

Redundant physical or virtual I/O paths must be configured through different nodes and GX adapters

Page 9: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Note the Exposure on Node 3

Page 10: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Partition Considerations

CCM is concurrent from the system point of view, but may not be completely transparent to logical partitions

Temporary reduction in CPU, memory and I/O capabilities could impact performance

To take full advantage of concurrent node add or memory upgrade/add, partition profiles should reflect higher maximum processor and memory values than exist before the upgrade

New resources can then be added dynamically after the add or upgrade

Note: higher partition maximum memory values will increase system memory set aside for Partition Page Tables

Page 11: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Partition Considerations

I/O resource planning

To maintain access to data, multi-path I/O solutions must be utilized (i.e., MPIO, SDDPCM, PowerPath, HDLM, etc.)

Redundant I/O adapters must be located in different I/O expansion units that are attached to different GX adapters located in different nodes

This can be either directly attached I/O or virtual I/O provided by dual VIO servers housed in different nodes

Page 12: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Partition Considerations

Check system settings for the server

If shutting down all partitions becomes necessary, make sure the system doesn’t power off during the repair action, prolonging the repair action

Leave this box unchecked

Page 13: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

IBM i Planning Considerations

To allow for a hot node repair/memory upgrade to take place with i partitions running, the following PTFs are also required:

V5R4: MF45678V6R1: MF45581

If the PTFs are not activated, the IBM i partitions have to be powered off before the CCM operation can proceed.

Page 14: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Rules for Concurrent Maintenance Operations

Guidelines for CCM operations

Only one operation at a time from only one HMC

A second CCM operation cannot be started until the first one has completed successfully

All CCM operations except a 570 GX adapter add must be done by IBM service personnel

On both the 595 and 570, you must have at least two nodes for hot node repair or hot memory add/upgrade

You cannot evacuate a 570 node that has an active system clock

Enable service processor redundancy on a 570 before starting a hot node add, except on a single-node server

Both service processors on a 595 must be functioning

Display Service Effect utility must be run by the system administrator before hot repair or hot memory add/upgrade

Ensure that the system is not in energy savings mode prior to concurrent node add, memory upgrade or concurrent node repair

Page 15: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Guidelines for All Concurrent Maintenance Operations

With proper planning and configuration, enterprise-class PowerServers are designed for concurrent add/upgrade or repair

However, changing the hardware configuration or the operational state of electronic equipment may cause unforeseen impacts to the system status or running applications

Some highly recommended precautions to consider:

Schedule concurrent upgrades or repairs during off-peak operational hours

Move business-critical applications to another server using the Live Partition Mobility feature or stop them

Back up critical application and system state information

Checkpoint data bases

Page 16: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Guidelines for All Concurrent Maintenance Operations

Features and capabilities that don’t support CCM

Systems clustered using RIO-SAN technology (This technology is used only by i users clustering using switchable towers and virtual OptiConnect technologies)

Systems clustered using InfiniBand technology (This capability is typically used by High Performance Computing clients using an InfiniBand switch)

I/O Processors (IOPs) used by i partitions do not support CCM (Any i partitions that have IOPs assigned must either have the IOPs powered off or the partition must be powered off)

16 GB memory pages, also known as huge pages, do not support memory relocation (Partitions with 16 GB pages must be powered off to allow CCM)

Page 17: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Guidelines for Concurrent Add/Upgrade

For adding or upgrading

All serviceable hardware events must be repaired and closed before starting an upgrade

Firmware enforces node and GX adapter plugging order

Only the next node position or GX adapter slot based on plugging rules will be available

For 570 node add, make sure the system cable is in place before starting

If the concurrent add includes a node plus a GX adapter, install the adapter in the node first, then add the entire unit

This way, the 128 MB of memory required by the adapter will come from the new node when it is powered on

Page 18: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Guidelines for Concurrent Add/Upgrade

For adding or upgrading

For multiple upgrades that include new I/O expansion drawers, as well as node or GX adapter adds, the concurrent node or GX adapter add must be completed first

The I/O drawer can then be added later as a separate concurrent I/O drawer add (a sequential operations)

Page 19: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Guidelines for Concurrent Repair

Repair with same FRU type:

The node repair procedure doesn’t allow for any additional action beyond the repair

The same FRU type must be used to replace a failing FRU, and no additional hardware can be added or removed during the procedure

For example, if a 4GB DIMM fails, it must be replaced with a 4GB DIMM – not a 2GB or 8GB DIMM

A RIO GX adapter must be replaced with a RIO GX adapter, not an InfiniBand GX adapter

Page 20: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Customer Responsibilities

The customer is responsible for deciding whether to do a concurrent upgrade or repair or to schedule a maintenance window

The customer must determine whether all prerequisites have been met and the configuration will support a node evacuation, if necessary

In the case of an upgrade, the World-wide Customized Install Instructions (WCII) for the order will ship assuming a non-concurrent installation

The WCII will tell you how to obtain instructions for a concurrent upgrade

All repairs are the responsibility of IBM service personnel

Customers are responsible for adding new 570 GX adapters

Page 21: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Display Service Effect Utility

The Display Service Effect utility needs to be run by the customer prior to concurrent hot node repair or memory add/upgrade

The utility shows memory, CPU and I/O issues that must be addressed before a node evacuation

The utility runs automatically at the start of a hot repair or upgrade, but it can be run manually ahead of time to determine whether the repair or upgrade will be concurrent

Ideally, this utility should be run by the systems administrator before the arrival of the IBM service representative

The DSE utility is not required if no node evacuation is required, as during a hot GX adapter add or a hot node add

Page 22: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Starting the Display Service Effect Utility

Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility

Page 23: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Select Display Service Effect

Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility

Page 24: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Select Yes – Confirm Advanced Power Control Command

This is a misleading message: it does NOT mean you’re about to power off your system!

Page 25: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Display Service Effect Summary Page

Look at the details by clicking the tabs

Page 26: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Tips on How to View Data

When working with the informational and error messages shown on the Node Evacuation Summary Status panel, work with the Platform and Partition messages first (the first and last tabs)

The impacts to the platform and partitions indicated in these messages may lead to the shutdown of partitions on the system for reasons such as I/O resources being used by the partition in the target node

The shutdown of a partition will free up memory and processor resources

If a partition must be shutdown, use the Recheck button to re-evaluate the memory and processor resources

Page 27: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Platform – Informational Messages

Check both Errors and Informational Messages

Page 28: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Memory Impacts

Page 29: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Processor Impacts

Page 30: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Partition Impacts – I/O Related Conflicts

Page 31: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

“White Glove” Tracking Program

During the next several months, IBM will track concurrent maintenance operations

In the US, potential concurrent CEC add MES orders and repairs will be pro-actively tracked during the feedback period

In the NE, SW IOTs and CEEMA GMT (EMEA), they will use the "Install PMH" or "Repair PMH" process to request feedback from SSRs

For all geographies, SSRs who perform CCM upgrades, adds and repairs are asked to complete the feedback form located at the following URL:

http://w3.rchland.ibm.com/~cuii/CCM/CCMfeedback_WCII.html

Page 32: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Summary

CCM gives customers new options for maintaining availability

Careful advance planning is required to make it work

Pre-requisites include creating CPU and memory reserves to allow CCM, as well as configuring redundant I/O paths or preparing for loss of I/O routes during concurrent maintenance

Customers must run the Display Service Effect utility to determine whether a concurrent repair or memory add/upgrade can be initiated

If concurrent repairs are not possible, a regular maintenance window must be scheduled

Page 33: © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

IBM Power Systems

© 2009 IBM Corporation

Required Reading

Technical white paper, “IBM Power 595 and 570 Servers CEC Concurrent Maintenance Technical Overview”, available at:

ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/pow03023usen/POW03023USEN.PDF

CEC Concurrent Maintenance article in IBM System Hardware Information Center available at:

http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/ared3/ared3kickoff.htm