ast-0036468 addressing the leading root causes of downtime

Upload: raza-usman

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    1/16

    A White Paper from the Expertsin Business-Critical Continuity

    Addressing the Leading Root Causes of Downtime:Technology Investments and Best Practices for Assuring Data Center Availability

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    2/16

    2

    Executive Summary

    Todays data center has evolved into a strategic business asset at the core of businessperformance and customer satisfaction. However, as the criticality of data center operationscontinues to increase, so too do the nancial and intangible costs of a downtime event.

    While availability remains a key concern for CIOs, some underestimate the true businessimpact of unplanned outages, focusing instead on reducing CAPEX/OPEX whilesimultaneously increasing their data centers throughput. As a result of this miscalculation,many companies do not have technology investments and best practices in place toadequately prevent and/or address the leading causes of downtime this according to the2010 National Survey on Data Center Outages, an independent study conducted by thePonemon Institute. In addition to providing insight into industry perceptions of data centeravailability and downtime events, the report uncovers the most frequently cited root causesof preventable outages, ranging from failures of critical data center equipment to human errorand accidental emergency power-off (EPO) events.

    This white paper will examine the most frequently reported root causes of data centerdowntime and recommend cost-effective solutions, design strategies and best practices foreliminating these vulnerabilities. These recommendations are designed to signicantly reducethe frequency of unplanned data center outages while simultaneously addressing key businessconcerns for CIOs, including improving energy efciency, reducing total cost of ownership andmaintaining end-user satisfaction. This paper also will explore the value of proactive service including preventive maintenance and data center assessments and how these proactiveinvestments can be used to identify subtle weaknesses in a business-critical infrastructure, andoptimize a facility for performance and efciency.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    3/16

    3

    Background

    For companies that rely on data centers todeliver business-critical services, downtimealways has been a key concern. However, as

    the effects of the economic downturn beganto take their toll on once-thriving enterprises,achieving cost reductions through energyefciency began to overshadow theimportance of availability in the eyes of seniormanagement.

    According to a 2009 survey of the DataCenter Users Group (DCUG), efciency wascited as a primary concern by more than 47percent of data center professionals polled,making energy savings the No. 2 concern

    overall. Unfortunately, the increased focuson rapid cost-cutting and energy efciencyleft many business-critical data centers atan increased risk for equipment failures andunplanned downtime events.

    While companies began adopting high-density congurations, virtualization andother strategies intended to boost thecapacity of existing IT equipment, manyoverlooked the need for a robust data

    center infrastructure to ensure continuityfor business-critical applications. As a result,a number of high-prole outages across avariety of industries proved to be more costlythan investments that could have preventedthem altogether. In addition to an overalldisruption of service and (in some cases) lossof customers, these outages translated tohundreds of thousands of dollars in nanciallosses and future customer business.

    According to the Ponemon Institutes

    National Survey on Data Center Outages,95 percent of companies have experiencedan unplanned downtime event within thepast two years. Types of outages cited includetotal data center outages (2.48 events on

    average), partial data center outages (6.84events on average) and device-level outages(11.29 events on average). However, eventhough more than 60 percent of thesecompanies rely on their data center to

    generate revenue or support e-commerceactivity, less than 35 percent believe theirdata center utilizes critical system designand redundancy best practices to maximizeavailability.

    Furthermore, the majority of data centerprofessionals do not believe they haveenough resources to respond quickly to anunplanned outage or failure, with the averagedowntime event lasting nearly two hours(107 minutes) per outage. When examining

    the prevalence of downtime in specicindustries (Figure 1), data centers serving thehealthcare and civic/public sectors accountedfor the longest average duration of downtime(with an average of 3 and 2.84 annualdowntime events, respectively), followedclosely by the industrial and nancial sectors.This is particularly alarming consideringthe criticality of electronic health records,nancial services and public informationsystems dependent on the availability of

    these systems.

    140

    120

    100

    80

    60

    40

    20

    96107

    Financial

    Services

    Industrial

    110

    Public

    Sector

    119

    Healthcare

    Figure 1. Extrapolated duration of unplanned totaldata center shutdowns by industry segment.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    4/16

    4

    To gain a better understanding of whichvulnerabilities contribute to such a highoccurrence of costly downtime events, thesurvey asked more than 450 data centerprofessionals to cite the root causes of

    data center outages experienced duringthe past two years. The vast majorityof data center downtime was related toinadequate investments in a high-availabilityinfrastructure. In fact, more than half ofdata center professionals polled agreed themajority of downtime events could have beenprevented.

    As shown in Figure 2, the ndings uncoveredthe seven most frequently reported rootcauses of downtime, ranging from UPS and

    cooling equipment failures to human errorand accidental shutdowns.

    The implementation of cost-effectivesolutions, design strategies and/or bestpractices can enable data center managersto reduce or eliminate the risk of these rootcauses while simultaneously improvingenergy efciency, exibility, total cost ofownership and end-user satisfaction.

    UPS Battery Failure

    While batteries may be the most low-techcomponents supporting todays mission-critical data centers, battery failure remains

    the leading cause of unplanned downtimeevents. In fact, studies conducted by EmersonNetwork Power indicate battery-relatedfailures account for more than one-third of alluninterruptible power supply (UPS) systemfailures over the life of the equipment.

    The continuity of critical systems during apower outage typically is dependent on adata centers power equipment, comprisedof UPS and their respective battery backups.While the vast majority of outages last less

    than 10 seconds, a single bad cell can cripplea data centers entire backup system particularly if adequate UPS redundancy hasnot been implemented.

    All batteries have a limited life expectancy,dictated by the frequency of batterydischarge and recharge. However, there area number of factors that can impact theaging process and shorten a batterys usefullife, including high ambient temperatures,

    UPS battery failure 65%

    UPS capacity exceeded 53%

    Accidental EPO/human error 51%

    UPS equipment failure 49%

    Water incursion 35%

    Heat related/CRAC failure 33%

    PDU/circuit breaker failure 33%

    0% 10% 20% 30% 40% 50% 60% 70%

    Figure 2. Top root causes of downtime cited by data center professionals. Totals exceed 100 percentbecause data center professionals cited multiple root-causes over a period of two years.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    5/16

    5

    frequent discharge cycles, overcharging,loose connections and strained batteryterminals.

    To safeguard backup power systems against

    unplanned or premature battery failures,data center professionals should take stepsto ensure battery maintenance best practicesare observed. In addition to standardsoutlined by the Institute of Electrical andElectronics Engineers (IEEE), manufacturerschedules for maintenance checks shouldbe followed to ensure batteries are properlymaintained (correctly installed, fully charged,etc.), serviced and/or replaced before theypose a risk for mission-critical applications.

    As outlined in Figure 3, monthly preventivemaintenance best practices for batterysystems include visual inspections (internaland external), acceptance testing andload testing. Capacity tests also should beperformed by a trained service technicianat recommended intervals to measure thedegradation of a battery over time.

    In addition to regular service and preventivemaintenance, the installation of an integratedbattery monitoring solution enables datacenter professionals to proactively monitorthe performance of individual cells across all

    UPS systems in the data center 24/7.

    Monitoring solutions provide comprehensiveinsight into battery health including cellvoltage, resistance, current and temperature without requiring a full discharge andrecharge cycle. This allows batteries to beutilized to their maximum effectiveness,safeguarding data center professionalsagainst premature replacements as well asunanticipated battery expirations. In additionto robust on-site monitoring capabilities,

    integrated solutions typically offer predictiveanalysis and provide web-based remoteaccess to facilitate rapid service responsebefore battery failure occurs.

    While proactive battery monitoring andmaintenance are critical to maximizing UPSavailability, data center professionals also

    Figure 3. Monthly preventive maintenance best practices for battery systems include visual inspections(internal and external), acceptance testing and load testing.

    Recommended TaskFLOODED IEEE 450 VRLA IEEE 1188

    Monthly Quarterly Annually Monthly Quarterly Annually

    Battery system voltage

    Charger current and voltage

    Ambient Temperature

    Visual inspection

    Electrolyte levels

    Pilot cell voltage and specic gravity

    Specic gravity all cells 10%

    All cell voltages

    All cell temperatures 10%

    Detail internal visual inspection

    AC Ripple Current and voltage

    Capacity test 5 Years

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    6/16

    6

    should keep charged spares on-site to coverany cells that may have expired betweenservice visits. Depending on the criticalityof the data center, data center professionalsshould have enough batteries on hand to

    replace between 5 and 10 percent of batteriesin all cabinets. Because these batteries arefully charged and ready for rapid deployment,service visits to repair battery strings can besignicantly reduced cutting costs withoutcompromising availability.

    Exceeding UPS Capacity

    High-density congurations havebecome more common in recent yearsas data center managers seek to achieve

    increased throughputs from their existinginfrastructures at the highest efcienciespossible. In fact, the average power draw foran individual rack can exceed 10 kW in somehigh-density data centers. With such highdensities common during peak hours, thecapacity of a single UPS system can be quicklyexhausted, leaving critical IT equipmentunprotected and the UPS system at risk foroverload and failure.

    According to the Ponemon Institute, morethan half of all downtime events were theresult of exceeded UPS capacity, with outagesranging from individual rack-rows to total datacenter shutdowns. Because the UPS is therst line of defense in a data centers powerinfrastructure, critical IT equipment must bebacked by adequate UPS protection to ensurethat all systems will be fully supported.

    If operating under a single-module UPSconguration, it is important to have a

    thorough understanding of the typical loadsexperienced throughout the data center. Bymeasuring output multiple times per day viaan integrated monitoring and managementsolution, data center professionals can

    gauge the typical power draw of their ITequipment over time to conrm whethertheir infrastructure is backed by enoughUPS capacity. Some solutions, such as theAperture Integrated Resource Manager, also

    give data center professionals the ability toscale UPS systems automatically at the rack-level and shift loads in real-time based onavailable capacities.

    Establishing a redundant UPS architecture alsoenables data center professionals to increasecapacity of their backup power system,with the added benet of eliminating singlepoints of failure. Parallel (N+1) redundancyremains the most cost-effective option forhigh availability data centers and is well-suited

    for high density environments where futuregrowth is certain.

    In a parallel redundant system (Figure 4),multiple UPS modules are sized so enoughmodules are available to power connectedequipment (N), plus one additional module forredundancy (+1). In these congurations, allUPS modules remain online and share the loadequally. This enables data center professionalsto design their redundant UPS systems

    with additional capacity to accommodateincreased, non-typical loads resulting fromequipment failures, modules being takenofine for service and rapid power uctuationscommon in virtualized data centers. However,in order for this conguration to provideadequate protection, it is critical to ensurethat the total IT load does not exceed thetotal capacity of N UPSs.

    Because N+ 1 congurations require adedicated static switch, the facility manager

    cannot simply add on modules in thefuture to increase capacity. Therefore, it isa best practice to maximize the size of UPSsystems used in an N+1 conguration basedon projected capacity needs rather than

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    7/16

    7

    investing in multiple small UPS systems. Inaddition to being more cost effective from along-term CAPEX perspective, limiting thenumber of modules in a parallel redundantsystem minimizes service costs over the life of

    the equipment and is more energy efcient.An alternative option is a 1+N conguration,in which specialized UPS modules withintegrated static switches are paralleled viaa paralleling cabinet, enabling data centerprofessionals to add-on UPS modules ascapacity needs change.

    UPS selection also should be a considerationwhen seeking to minimize the occurrenceof UPS capacity overload. Many vendors,including Emerson Network Power, have

    introduced intelligent UPS systems designed

    for rapid, seamless deployment in critical ITenvironments.

    In addition to high operating efciencies(up to 97 percent efciency through the

    integration of an always-on inverter), someintelligent UPS solutions have signicantoverload capacity built-in and are able tohandle bursts of more than 25 percent abovethe UPSs total capacity.

    Many intelligent UPS systems, including theLiebert NXL, also are capable of achievingsuperior performance and availability throughredundant components, fault tolerancesfor input currents and integrated batterymonitoring capabilities.

    EO

    6 6 6

    2 2 2

    55

    6 6 6

    6 6 6

    4 4 4

    1 1 1

    5 5 5

    44

    3

    35

    MBD MBD MBD

    MaintenanceBypass

    (optional)

    System ControlCabinet(SCC)

    RectifierAC

    Input

    BIMBCUtuptuOSPU

    MBBSBB

    BFB

    ContinuousDutyStatic

    Switch

    CB1

    CB1

    CB1

    CB2

    CB2

    CB2

    UPSModuleNo.

    1

    UPSModuleNo.

    2

    UPSModuleNo.

    3

    Figure 4. A typical parallel-redundant system configuration.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    8/16

    8

    UPS Equipment Failure

    In addition to exceeding capacity, general UPSequipment failure is another leading causeof downtime cited by survey data center

    professionals, with 49 percent reporting anequipment failure within the past two years.With this in mind, it is important to considera number of key factors that can affectthe reliability and overall lifespan of UPSequipment.

    System topology is a signicant considerationwhen evaluating UPS reliability. Generallyspeaking, two types of UPS are found intodays data centers: line interactive anddouble conversion. While double-conversion

    UPS systems have emerged as an industrystandard in recent years, many legacy and/or small data centers may still be usingline-interactive systems in their criticalinfrastructures. Therefore, when evaluatingeach topology, data center professionalsshould consider the criticality of their datacenter operations.

    In addition to distributing backup power tothe load, the UPS battery in line-interactive

    systems conditions the power before it

    ows to IT equipment. Because of this dualrole, line-interactive UPS batteries can drainrapidly, putting the system at an increasedrisk for failure. While these systems provideadequate protection for some data center

    applications, they are not recommendedfor business-critical applications or facilitiesthat experience a high occurrence of utilityfailures.

    Double conversion UPS systems, on the otherhand, condition power through a double-conversion process, in which AC powerfrom the PDU is converted into DC power,which is converted back to AC power whenit is delivered to the rack or row. This enablesthe battery to be dedicated to the load and

    eliminates the need for power transfer if theprimary utility fails. While double-conversionUPS systems typically require a larger initialcapital investment, they have been proven tobe more than twice as reliable as their line-interactive counterparts making them idealfor truly mission-critical environments.

    Data center professionals also should considerthe overall durability of their UPS system.Some UPSs, such as the Liebert NXL, are

    designed with integrated fault tolerances

    Number of Annual Preventive Maintenance Visits

    TimesBetter

    0

    100

    80

    60

    40

    20

    01 2 3 4 5 6

    0x

    10x

    2.3x

    1.6x

    1.4x

    1.3x

    1.2x

    23x

    37x

    51x

    67x

    82x

    Figure 5. An increase in the number of annual preventive maintenance visits can be directly correlatedwith increases in MTBF.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    9/16

    9

    for input current and a variety of redundantinternal components, including fans, powersupplies and communications cards. Thesefeatures enhance reliability and enablethe UPS to maintain availability between

    service visits even in the event of an internalcomponent failure.

    However, regardless of the type of UPSused, preventive maintenance is necessaryto maximize the life of critical equipmentand minimize the occurrence of unplannedoutages. All electronics contain limited-life components that need to be inspectedfrequently, serviced and replaced periodicallyto prevent catastrophic system failures. If notserviced properly, the risk of unplanned UPS

    failure increases dramatically.

    While most UPS systems can achieve alifecycle of 10 years or more under normaloperating conditions, it is not uncommonfor a well-maintained system to remain inoperation for 20 years or more. In a recentstudy of 5,000 three-phase UPS unitswith more than 185 million combinedoperating hours, the frequency of preventivemaintenance visits correlated with an increase

    in mean time between failures (MTBF).As shown in Figure 5, our visits annuallyincreased MTBF more than 50-fold comparedto no visits (or more than 25-fold comparedto semi-annual maintenance).

    It is important to note, however, that whilepreventive maintenance increases thereliability of a data centers UPS systems,redundancy still is required so that a systemcan be taken ofine and serviced. Withoutadequate UPS system redundancy, unplanned

    downtime risk can increase dramaticallyduring a routine service visit.

    PDU and Circuit Breaker Failures

    As evidenced by the Ponemon Institutessurvey ndings, UPS-related issues are amongthe most common root causes of downtime

    for a data centers power infrastructure.However, it also is important to be mindfulof downstream factors that can impact theavailability of an entire data center namely,circuit breaker and power distribution unit(PDU) failures.

    A third of surveyed data center professionalsidentied PDU and/or circuit breaker failuresas a root cause of downtime. While thesefailures do not occur as frequently as UPS-related outages, they nonetheless should

    be a signicant area of concern for datacenter professionals because a single failurecan bring down an entire facility. However,through effective capacity monitoringand management the likelihood of circuitbreaker and PDU overload can be reducedsignicantly.

    To maximize visibility into the stress placedon the data centers power infrastructure,data center professionals should consider

    investing in a PDU that has integrated branchcircuit monitoring capabilities. Branch circuitmonitoring solutions such as the LiebertDistribution Monitor (LDM) utilize branchcircuit sensor modules and individual currenttransformers (CT) to monitor current input/output for the main panel board as well asindividual branch circuit breakers, reportingthe current and alarm conditions for eachbreaker.

    Many branch circuit monitoring systems

    also are designed to work in-concert withthe data centers building managementsystems. By pairing circuit monitoring data

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    10/16

    10

    with performance information for criticalfacility equipment, data center professionalsgain a comprehensive view of the datacenters power infrastructure from the utilityto the rack. Many data center management

    solutions such as the Aperture IntegratedResource Manager (AIRM) enable loads tobe shifted at the rack-level in real-time. Thisenhanced functionality allows data centerprofessionals to make precise capacitymanagement decisions based on holistic dataacross interdependent systems, reducingthe likelihood of equipment overload failuredownstream.

    Beyond enhancing capacity monitoring andmanagement, the addition of static transfer

    switches to the power system congurationcan safeguard data center professionalsagainst downtime.

    Most servers in todays market are designedwith dual power supplies, capable of eitherpowering the server independently or sharing

    the load equally. Because of this internalredundancy, these types of servers are lessprone to downtime due to faults passedalong from the PDU. However, it is easy tooverlook that legacy servers in todays data

    centers are designed with a single powersupply and an integrated transfer switchbetween cords. Because these systemsdepend on the capacity of a single integratedpower supply, there is an increased riskof system shutdown if primary power isdisrupted.

    The installation of static transfer switches(STS) upstream from the IT load assures thatsingle-cord loads will be powered in the eventof bus failure, maintaining the availability of

    critical IT equipment. Static transfer switchesalso provide increased protection duringservice times, which ensures constant poweris delivered to the servers.

    In addition to maintaining systemavailability, STS also enable faults to becompartmentalized and localized to a singlebus, preventing widespread outages andincreasing the facility managers ability toquickly identify and remedy root causes. As

    shown in Figure 6, if an STS is being used in asystem with a load fault, the fault will remainisolated and will not be passed along to thesecond bus. While all single-corded loadsconnected to the bus will fail in this scenario,distributing the loads among multiple STSminimizes the overall effect of the fault.

    Fault Condition

    Detects Fault Current Prohibit Source Transfer

    Preferred Source Alternate Source

    Preferred Source Alternate Source

    Load

    Load

    Load

    Load

    Load

    Load

    Load

    Load

    Load

    Load

    Load

    LoadBusA

    BusB

    X

    Static

    Transfer

    Switch 2

    Static

    Transfer

    Switch 1

    Figure 6. If an STS is being used in a system with aload fault, the fault will remain isolated and will notbe passed along to the second bus.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    11/16

    11

    Cooling Issues: Heat-Related CRACFailures and Water Incursion

    While many of the leading root causes ofdowntime are directly related to the data

    centers power infrastructure, coolingsystem failures can be equally detrimentalto availability. Cooling-related failures werecited as a root cause of at least one outageby more than a third of data center operatorspolled. Water incursions and heat-relatedcomputer room air conditioner (CRAC)failures were cited as the leading causesof cooling-related downtime (35 and 33percent, respectively).

    As high-density congurations become more

    common, so too does increased heat density.

    As a result, availability of critical systems hassuffered. One commonly reported issue isheat-related CRAC failure. Many legacy CRACsystems were not designed to accommodatethe high heat densities common in todays

    high-density data centers. As a result,heat-related failure has become a commonconcern in data centers seeking to get morefrom their existing infrastructure.

    One way to minimize the risk of heat-related CRAC failure is to optimize air owwithin the data center by adopting a cold-aisle containment strategy. In a cold-aisleconguration, rack-rows are arrangedfacing each other, creating distinct cold-aisles (where cool air from the CRAC enters

    the rack) and hot-aisles (where hot air isexhausted from the rack). Because thecold-aisles are isolated and sealed off, hot airexpelled from the rack is not able to re-enterthe cooling environments. This increasesthe effectiveness of the CRAC system andensures that cooling capacity is utilized asefciently as possible.

    Many data center professionals choose toincrease cooling system effectiveness by

    utilizing a row-based cooling solution. Infact, the use of row-based cooling solutionscan reduce the annual cooling related powerconsumption by nearly 30 percent (Figure 7).

    However, while using row-based coolingmodules is a sound strategy for increasingcooling effectiveness and efciency,some row-based cooling solutions createvulnerabilities that can undermine datacenter availability the most signicantbeing water incursion.

    Most row-based cooling solutions in high-density environments use uid to quicklyremove high-heat from the rack. Thesesolutions typically fall into one of two

    RefrigerantModules

    Fan Chiller

    Pump (CDU) Pump (CW)

    TraditionalCRAC

    kWp

    owertocool1kW

    sensibleheat

    30%

    Lower

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    Figure 7. Row-based cooling can lead to energysavings of up to 30 percent in high-density datacenters.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    12/16

    12

    categories: water-based and refrigerantbased. While chilled water-based solutionsare extremely cost effective, they bear anintrinsic risk for leakage. If not detectedearly, water incursion within the rack can

    lead to the corrosion and failure of sensitiveelectronic equipment, resulting in downtimeas well as costly equipment repairs and/orreplacements.

    With this in mind, the use of a refrigerant-based row-based cooling solution is a bestpractice for minimizing the risk of cooling-related equipment failures. Unlike water-based systems, refrigerant-based coolingdoes not rely on an electrically conductivecooling element, minimizing the risk of

    catastrophic system failures in the event of acooling uid leak. At room temperature mostrefrigerants become a vapor, further reducingthe risk of damaging uid leaks.

    If refrigerant-based cooling solutions are costprohibitive, integrating a comprehensiveleak detection system such as the LiebertLiqui-tect into the cooling infrastructureis essential to mitigating the risk of systemfailures due to water incursion. By employing

    leak detection modules installed at criticalpoints throughout the data center, leakdetection systems signal an alarm whenmoisture reaches potentially hazardous levels.This enables data center professionals topinpoint leakages within seconds and takecorrective actions before sensitive electricalequipment is damaged or destroyed.

    Human Error and Accidental EPO

    Often the most cost-effective root causesto address, human error and accidentalemergency power-off (EPO) remain leading

    causes of data center outages. In fact, morethan half of all data center professionals to thePonemon survey reported at least one outageas a direct result of accidental shutdown oruser errors within the past 24 months.

    Because these events represent signicant yetwholly preventable threats to the availabilityof mission-critical data centers, data centerprofessionals should observe and enforcethe following rules and policies to minimizethe potential for catastrophic errors and

    accidents:

    1.Shielding Emergency OFF ButtonsEmergency OFF buttons are generally locatednear doorways in the data center. Often,these buttons are not covered or labeledand can be mistakenly shut off during anemergency, which shuts down power to theentire data center. This can be eradicated bylabeling and covering emergency OFF buttons

    to prevent someone from accidentallypushing the button.

    2.Strictly Enforcing Food/Drinks PoliciesLiquids pose the greatest risk for shortingout critical computer components. The bestway to communicate your data centers food/drink policy is to post a sign outside the doorthat states what the policy is and how strictlythe policy is enforced.

    3.Avoiding ContaminantsNot keeping the indoor air quality clean cancause unwanted dust particles and debris

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    13/16

    13

    to enter servers and other IT infrastructure.Much of the problem can be alleviated byhaving all personnel who access the datacenter wear antistatic booties or placing amat outside the data center. This includes

    packing and unpacking equipment outsidethe data center. Moving equipment inside thedata center increases the chances that bersfrom boxes and skids will end up in serverracks and other IT equipment.

    4.Documented Maintenance ProceduresFollowing a documented, task-orientedprocedure can mitigate or eliminate the riskassociated with performing maintenance.This step-by-step procedure should apply toall vendors, and data center professionals alsoshould ensure that back-up plans are availablein case of unforeseen events.

    5.Accurate Component LabelingThe inaccurate labeling of power protectiondevices, such as circuit breakers, can have adirect impact on data center load availability.To correctly and safely operate a powersystem, all switching devices and the facilityone-line diagram must be labeled correctlyto ensure correct sequence of operation.

    Procedures also should be in place toregularly double-check device labeling.

    6.Consistent Operating of the SystemAs data center professionals becomecomfortable with the operations of their datacenters, it is common to neglect establishedprocedures, forget or skip steps, or perform

    short cuts and inadvertently shut down thewrong equipment. This further reinforces thepoint that it is critical to keep all operationalprocedures up to date and follow theinstructions to operate the system.

    7.Ongoing Personnel TrainingData center professionals should ensurethat all individuals with access to thedata center, including IT, emergency andsecurity personnel, have basic knowledgeof equipment so that its not shut down bymistake.

    8.Secure Access PoliciesOrganizations without strict sign-in policiesrun the risk of security breaches. Havinga sign-in policy that requires an escort forvisitors, such as vendors, will enable datacenter managers to know who is entering andexiting the facility at all times.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    14/16

    14

    The Value of Data Center Assessments

    As we have highlighted throughout thiswhite paper, a number of technologies andbest practices can be adopted to address the

    leading root-causes of downtime outlined bythe Ponemon Institute. However, many datacenter professionals and senior executiveshave yet to recognize the value in identifyingpotential weaknesses in their data centerbefore they result in an unplanned downtimeevent.

    According to the survey, only 13 percent ofdata center professionals have conducted adata center assessment or audit followingan unplanned downtime event (Figure 8).

    Unfortunately, these results represent asignicant missed opportunity for datacenter professionals to identify and addressvulnerabilities unique to their data center.

    In addition to power and cooling systems, anumber of factors can impact the availabilityand performance of critical systems. Thesefactors some as in-depth as arc ashvulnerability, others as subtle as obstructed

    oor tiles can go undetected by datacenter professionals, resulting in decreasedperformance and an increased risk fordowntime. When considering the ever-increasing cost of downtime, conducting a

    comprehensive data center assessment isthe most cost effective way to ensure that allvulnerabilities are identied and remediedbefore they impact data center operations.

    A comprehensive assessment of the facility aswell as all thermal and electrical systems alsocan offer detailed insight into how an existingdata center can be optimized for efciencywithout compromising the availability ofcritical systems.

    For example, by conducting a computationaluid dynamics (CFD) simulation, datacenter professionals can gain a thoroughunderstanding of how cooling systemconguration affects cooling efciency andfacility performance. Based on the ndingsof this simulation, existing equipment canbe recongured for optimal performance,increasing availability and efciency with littleto no capital expenses.

    Repair IT or Infrastructure equipment 60%

    Replace IT or Infrastructure equipment 56%

    Contact equipment vendors for support 51%

    Purchase new IT or Infrastructure equipment 51%

    Hire outside experts to remediate the problem 26%

    Implement / Improve monitoring capabilities 19%

    Conduct a data center audit or assessment 13%

    0% 10% 20% 30% 40% 50% 60% 70%

    Figure 8. Organizations response to fixing or correcting root causes as reported by the Ponemon Institute.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    15/16

    15

    Conclusion

    As evidenced by the ndings of the Ponemon Institute, unplanned data center outages remaina difcult and costly challenge for organizations that have focused their attention on cuttingoperating costs while maintaining or increasing throughput. Unfortunately, this has led many

    organizations to overlook actions that must be taken to ensure data center efciency does notcompromise the availability of critical systems.

    While root causes reported by survey data center professionals span the entire data center,most can be mitigated cost-effectively by adhering to best practices outlined in this whitepaper. However, while the root causes addressed represent the most common causes ofdowntime cited by survey data center professionals, conducting a comprehensive data centerassessment is the only way to adequately ensure that all potential vulnerabilities are identiedand addressed prior to causing a data center outage.

    References

    Emerson Network Power e-book,Avoiding Trap Doors Associated with Purchasing a UPS System,http://www.EmersonNetworkPower.com, 2010.

    Emerson Network Power white paper, Balancing Scalability and Reliability in the Critical PowerSystem: When does N+1 Become Too Many +1?, http://www.EmersonNetworkPower.com,2004.

    Emerson Network Power white paper, Choosing System Architecture and Cooling Fluid for HighHeat Density Cooling Solutions, http://www.EmersonNetworkPower.com, 2008.

    Emerson Network Power white paper, Data Center Assessment Helps Keep Critical EquipmentOperational, http://www.EmersonNetworkPower.com, 2007.

    Emerson Network Power white paper, The Effect of Regular, Skilled Preventive Maintenance on

    Critical Power System Reliability, http://www.EmersonNetworkPower.com, 2007.Emerson Network Power white paper, High-Availability Power Systems, Part II: Redundancy

    Options, http://www.EmersonNetworkPower.com, 2003.

    Emerson Network Power white paper, Implementing Proactive Battery Management Strategies toProtect Your Critical Power System, http://www.EmersonNetworkPower.com, 2008.

    Emerson Network Power white paper, Longevity of Key Components in Uninterruptible PowerSystems, http://www.EmersonNetworkPower.com, 2008.

    Ponemon Institute study, National Survey on Data Center Outages, http://www.Ponemon.org,2010.

    Emerson Network Power white paper, Protecting Critical Systems during Utility Outages: The Role

    of UPS Topology, http://www.EmersonNetworkPower.com, 2004.Emerson Network Power technical note, Using Static Transfer Switches to Enhance Data Center

    Availability and Maintainability, http://www.EmersonNetworkPower.com, 2010.

  • 8/3/2019 AST-0036468 Addressing the Leading Root Causes of Downtime

    16/16

    Emerson Network Power.

    The global leader in enabling Business-Critical Continuity. EmersonNetworkPower. com

    AC Power

    Connectivity

    DC Power

    Embedded Computing

    Embedded Power

    Infrastructure Management & Monitoring

    Outside Plant

    Power Switching & Controls

    Precision Cooling

    Racks & Integrated Cabinets

    Services

    Surge Protection

    Emerson Network Power

    1050 Dearborn DriveP.O. Box 29186

    Columbus, Ohio 43229

    800.877.9222 (U.S. & Canada On

    614.888.0246 (Outside U.S.)

    Fax: 614.841.6022

    EmersonNetworkPower.com

    Liebert.com

    While every precaution has been taken to ensure accuracy andcompleteness in this literature, Liebert Corporation assumes noresponsibility, and disclaims all liability for damages resultingfrom use of this information or for any errors or omissions.

    2010 Liebert Corporation. All rights reserved throughoutthe world. Specifications subject to change without notice.All names referred to are trademarks or registered trademarksof their respective owners.Liebert and the Liebert logo are registered trademarks of theLiebert Corporation. Business-Critical Continuity, Emerson NetwPower and the Emerson Network Power logo are trademarks anservice marks of Emerson Electric Co. 2010 Emerson Electric C

    SL-24656-R10-10 Printed in USA