rac & asm best practices[1]

RAC & ASM Best Practices “You Probably Need More than just RAC”

Kirk McGowanTechnical Director – RAC PackOracle Server TechnologiesCluster and Parallel Storage Development

AgendaOperational Best Practices (IT MGMT 101)

Background– Requirements– Why RAC Implementations Fail– Case Study– Criticality of IT Service Management (ITIL)

process

� Best Practices– People, Process, AND Technology

Why do people buy RAC?

� Low cost scalability– Cost reduction, consolidation, infrastructure that

can grow with the business

� High Availability– Growing expectations for uninterrupted service.

Why do RAC Implementations fail?

� RAC, scale-out clustering is “new” technology – Insufficient budget and effort is put towards filling

the knowledge gap

� HA is difficult to do, and cannot be done with technology alone

– Operational processes and discipline are critical success factors, but are not addressed sufficiently

Case Study

� Based on true stories. Any resemblance, in full or in part, to your own experiences is intentional and expected.

� Names have been changed to protect the innocent

Case Study

� Background– 8-12 months spent implementing 2 systems – somewhat

different architectures, very different workloads, identical tech stacks

– Oracle expertise (Development) engaged to help flatten tech learning curve

– Non-mission critical systems, but important elements of a larger enterprise re-architecture effort.

– Many technology issues encountered across the stack, and resolved over the 8-12 month implementation

� Hw, OS, storage, network, rdbms, cluster, and application

Case Study

� Situation– New mission critical deployment using same technology

stack– Distinct architecture, applications development teams, and

operations teams– Large staff turnover– Major escalation, post production

� CIO: “Oracle products do not meet our business requirements”� “RAC is unstable”� “DG doesn’t handle the workload”� “JDBC connections don’t failover”

Case Study

� Operational Issues– Requirements, aka SLO’s were not defined

� e.g. Claim of 20s failover time; application logic included 80sfailover time, cluster failure detection time alone set to 120s.

– Inadequate test environments� Problems encountered first in production – including the fact

that SLO’s could not be met

– Inadequate change control� Lessons learned in previous deployments were not applied to

new deployment – rediscovery of same problems� Some changes implemented in test, but never rolled into

production – re-occuring problems (outages) in production� No process for confirming a change actually fixes the problem

prior to implementing in production

Case Study� More Operational Issues

– Poor knowledge xfer between internal teams� Configuration recommendations, patches, fixes identified in

previous deployments were not communicated.� Evictions are a symptom, not the problem.

– Inadequate system monitoring� OS level statistics (CPU, IO, memory) were not being captured.� Impossible to RCA on many problems without ability to correlate

cluster / database symptoms with system level activity.

– Inadequate Support procedures� Inconsistent data capture� No on-site vendor support consistent with criticality of system� No operations manual

- Managing and responding to outages- Responding and restoring service after outages

Overview of Operational Process Requirements

� What are “ITIL Guidelines”?

“ITIL (the IT Infrastructure Library) is the most w idely acceptedapproach to IT service management in the world, ITI L provides a comprehensive and consistent set of best practices for IT service management, promoting a qu ality approach to achieving business effectiveness and ef ficiency in the use of information systems.”

IT Service Management

� IT Service Management = Service Delivery + Service Support

� Service Delivery: partially concerned with setting up agreements and monitoring the targets within these agreements.

� Service Support: processes can be viewed as delivering services as laid down in these agreements.

Provisioning of IT Service Mgmt

� In all organizations, must be matched to current and rapidly changing business demands. The objective is to continually improve the quality of service, aligned to the business requirements, cost-effectively. To meet this objective, three areas need to be considered:

– People with the right skills, appropriate training and the right service culture

– Effective and efficient Service Management processes– Good IT Infrastructure in terms of tools and technology.

� Unless People, Processes and Technology are considered and implemented appropriately within a steering framework, the objectives of Service Management will not be realized.

Service Delivery

� Financial Management� Service Level Management

– Severity/priority definitions� e.g. Sev1, Sev2, Sev3, Sev4

– Response time guidelines– SLAs

� Capacity Management� IT Service Continuity Management� Availability Management

Service Support

� Incident Management– Incident documentation & Reporting, incident handling,

escalation procedures

� Problem Management– RCAs, QA & Process improvement

� Configuration Management– Standard configs, gold images, CEMLIs

� Change Management– Risk assessment, backout, sw maintenance, decommission

� Release Management– New deployments, upgrades, Emergency release,

component release

BP: Set & Manage Expectations

� Why is this important?– Expectations with RAC are different at the outset– HA is as much (if not moreso) about the processes and

procedures, than it is about the technology� No matter what technology stack you implement, on it’s own it

is incapable of meeting stringent SLA’s

� Must communicate what the technology can AND can’t do

� Must be clear on what else needs to be in place to supplement the technology if HA business requirements are going to be met.

– HA isn’t cheap!

BP: Clearly define SLO’s� Sufficiently granular

– Cannot architect, design, OR manage a system without clearly understanding the SLOs

– 24x7 is NOT an SLO� Define HA/recovery time objectives, throughput,

response time, data loss, etc– Need to be established with an understanding of the cost of

downtime for the system.– RTO and RPO are key availability metrics– Response time and throughput are key performance metrics

� Must address different failure conditions– Planned vs unplanned– Localized vs site-wide

� Must be linked to the business requirements– Response time and resolution time

� Must be realistic

Manage to the SLO’s� Definitions of problem severity levels� Documented targets for both incident response time, and

resolution time, based on severity� Classification of applications w.r.t. business criticality� Establish SLA with business

– Negotiated response and resolution times– Definition of metrics

� E.g. Application Availability shall be measured using the following formula: Total Minutes In A Calendar Month minus Unscheduled Outage Minutes minus Scheduled Outage Minutes in such month, divided by Total Minu tes In A Calendar Month

– Negotiated SLO’s– Effectively documents expectations between IT and business

� Incident log: date, time, description, duration, resolution

Example Resolution Time Matrix

< 132 hrsSeverity 2 SRs

< 14 hoursSeverity 2 Priority 1 SRs

< 13 HoursSeverity 1 Priority 3 SRs

< 1 hourSeverity 1 Priority 1 and 2 SRs

Example Response Time Matrix

10 days3 days444DEV

4 days2 days444LMS,CUS

3 hrs120 min60260INT

4 days18 hrs606060WIP

3 hrs12060N/A60PCR,RDV

12060156015RVW,1CB

12060153015IRR, 2CB

6030156015ASG

6030153015New,XFR

Sev3/ Sev4

Sev2Sev2/P1Sev1/P2Sev1/P1Status

BP: TEST, TEST, TEST� Testing is a shared responsibility

– Functional, destructive, and stress testing

� Test environments must be representative of production– Both in terms of configuration, and capacity– Separate from Production– Building a test harness to mimic production workload is a necessary, but

non-trivial effort

� Ideally, problems would never be encountered first in production

– If they are, the first question should be: Why didn’t we catch the problem in test?� Exceeding some threshold� Unique timing or race condition

– What can we do so we catch this type of problem in the future?� Build a test case that can be reused as part of pre-production

testing.

BP: Define, document, and adhere to Change Control Processes� This amounts to self discipline� Applies to all changes at all levels of the tech stack

– Hw changes, configuration changes, patches and patchsets, upgrades, and even significant changes in workload.

– If no changes are introduced, system will reach a steady state, and function for ever.

� A well designed system will be able to tolerate some fluctuations, and faults.

� A well managed system will meet service levels– If a problem (that was fixed) is encountered again elsewhere, it is

a change management process problem, not a technology problem. I.e. rediscovery should not happen.

– Ensure fixes are applied across all nodes in a cluster, and all environments to which the fix applies.

BP: Plan for, and execute Knowledge Xfer� New technology has a learning curve.� 10g, RAC, and ASM cross traditional job boundaries so

knowledge xfer must be executed across all affected groups– Architecture, development, and operations– Network admin, sysadmin, storage admin, dba

� Learn how to identify and diagnose problems– e.g. evictions are not a problem, they are a symptom– Learn how to use the various tools and interpret output

� Hanganalyze, system state dumps, truss, etc…� Understand behaviour – distinction between cause and

symptom� Needs to occur pre-production

– Operational Readiness

BP: Monitor your system

� Define key metrics and monitor them actively– Establish a (performance) baseline

� Learn how to use Oracle-provided tools– RDA (+ RACDDT)– AWR/ADDM– Active Session History– OSWatcher

� Coordinate monitoring and collection of OS level stats as well as db-level stats

– Problems observed at one layer are often just symptoms of problems that exist at a different layer

� Don’t jump to conclusions

BP: Define, Document, and communicate Support procedures� Define corrective procedures for outages

– Routinely test corrective procedures

� HA process:– Prevent � Detect � capture � resume � analyze � fix– Classify high priority systems, and the steps that need to

be taken in each phase– Keep an active log of every outage

� If we don’t provide sufficient tools to get to root cause, then shame on us.

� If you don’t implement the diagnositic capabilities that are provided to help get to root cause, then shame on you

� Serious outages should never happen more than once.

Summary

� Deficiencies in operational processes and procedures are the root cause of the vast majority of escalations

– Address these, you dramatically increase your chances of a successful RAC deployment, and will save yourself a lot of future pain

� Additional areas of challenge– Configuration Management – Initial Install and config,

standardized “gold image” deployment

– Incident Management - Diagnosing cluster-related problems

rac & asm best practices[1]

Documents

rac implementations

business requirements

productioncase study

sufficientlycase study

technology issues

rac asm best practices

inadequate system

case study criticality