storage systems reliability

White Paper – System Storage Reliability Juha Salenius

Storage Systems

As a way of defining the various facets of reliability let’s start out by configuring a hypothetical a server

using six internal hot swap disk drives that can be either Serial Attached SCSI (SAS) or Serial Advanced

Technology Attachment (SATA) disk drives. If these drives were SAS disks and were set up as Just a

Bunch of Drives (JBOD), a non RAID configuration, then the MTBF would be all we’d need to define the

computed failure rate. With an individual drive MTBFSAS of 1,400,000 hours1 the combined MTBF for six

drives would compute to 233,333 hours using the following equation where N is the number of the

same component (in this case disk drives) and the subscripts are tc=total components and

c=component:

(Special case where all components are the same)

In contrast to the SAS MTBFtc, using individual SATA drives instead, each exhibiting an individual

MTBFSATA of 1,000,000 hours2, the combined MTBFtc for six drives in a JBOD configuration is 166,667

hours.

RAID Considerations:

Even with the drives configured as a RAID (levels 0, 1, 5 and 6) array the total MTBF for all the drives will

remain the same as above because it does not take into account any redundancy. MTBI (Mean Time

Between Interruption) can be used to highlight the difference in uptime based on a redundant

configuration. In a JBOD configuration the MTBF and the MTBI are the same. You’ll notice that once we

move from a non-redundant system to a system with redundant components we move from reporting

MTBF to reporting MTBI as the more meaningful term from a system perspective.

Consider the following RAID levels:

RAID 0 will not be considered here because it does not provide any failure protection, though it

does provide higher throughput by striping the data stream across multiple drives and it is

usually used with other RAID levels increasing their throughput. Certain RAID levels can be

combined, for example RAID 10, RAID 50 and RAID 60. These configurations combine data

striping across multiple drives combined with either mirroring or parity drives. These

configurations drive increased complexity but improve performance.

RAID 1 mirrors data across two disk drives which require doubling of the number of data drives.

RAID 5 uses parity bits to recover from a bad read and the data/parity is written in blocks across

all drives in the array with the parity distributed evenly among all drives. Because of the added

parity bit information, a minimum of three or more disk drives are needed to implement RAID 5.

RAID 6 is a RAID 5 configuration using yet an additional parity bit.

1 Adaptec Inc. Storage Advisors Weblog 11/02/2005

2 Ibid


With RAID 5 and 6 the ratio of data storage to parity storage increases as the number of spindles

increase, so for a system with six drives there could be the equivalent of five data drives and one parity

drive for RAID 5 and four data drives and two parity drives for RAID 6. The spindle overhead for RAID 5

with five drives is 20% and doubling the total number of drives to 10 decreases the overhead to 10%.

Why use RAID 6 instead of RAID 5? RAID 5 provides protection against a single failed drive. RAID 6 will

provide protection against two concurrent failures. Once a drive has failed the only time exposure the

array has to an additional failure is the time it takes to replace and re-build the failed drive, the MTTR

(Mean Time To Repair) interval. With RAID 6, the exposure to an additional failure is eliminated

because of the additional parity. If the system has a hot swap drive the time to repair will be

significantly reduced, the re-built time can start immediately and the failed drive replaced during or

after the rebuild. The probability of another hardware failure during the MTTR interval is extremely low.

But there is another disk related issue that could cause a problem during this MTTR interval and that is a

hard read error, more prevalent in SATA disks.

SAS and SATA Drive Considerations:

Both SAS and SATA drives have well defined Bit Error Rates (BER). SAS drives are more robust than

SATA drives, exhibiting a BER of the order of one out of every 1015 bits read3, equating to one out of

every 100 terabytes (TB) read.

SATA drives are not as robust and exhibit BERs in the order of one every 1014 bits read4 or every 10 TB.

What does this mean from a system perspective? To illustrate the issue, we’ll start with a SATA disk

array that has failed due to a hardware problem and is in the process of rebuilding. Let’s make some

assumptions; using 500GB drives, the array has 10 drives in it and the drives each have a 1014 read BER.

The following formula can determine the number of times an unrecoverable error will occur:

It’s entirely possible that an array will be rebuilt 2.5 times in its life and there may be a non-recoverable

error occurring during those 2.5 rebuilds. This scenario only addresses 500GB drives, but the industry

has moved on and drive sizes have increased to 1TB and beyond, which makes this issue more

problematic. The higher the drive size or the more drives in the array, the more frequently the non-

recoverable read error can occur during rebuild.

Combining Optimal RAID and Hard Drive Choices (probability) – the math:

A concern with SATA disk technology is the Unrecoverable Read Error (URE) which is currently at 1014. A

URE every 1014 bits equates to an error every 2.4E10 sectors. This becomes critical as the drive sizes

3 Ibid

4 Ibid


increase. When a drive fails in a 7 drive RAID 5 array made up of 2 TB SATA disks, the 6 remaining good

2 TB drives will have to be read completely to recover the missing data. As the RAID controller is

reconstructing the data it is very likely it will see an URE occur in the remaining media. At that point the

RAID reconstruction stops.

Here’s the math:

There is a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5 array with one failed disk, assuming a 1014 read error rate and ~23 billion sectors in 12 TB. Feeling lucky?5

RAID 6 is a technique that can be used to mitigate this failure during the rebuild cycle. This is important

because it allows the system to recover from two disk failures, one array failure and a subsequent single

hard read error from the surviving disks in the array during the rebuild.

With customers looking to reduce system cost using SATA technology, the additional overhead for RAID

6 parity is becoming acceptable. But there are drawbacks to using RAID 6 which include longer write

times due to the additional time required to generate the RAID 5 parity bit and then generating the RAID

6 parity. When an error occurs during a read, the RAID 5 and RAID 6 array reduces the read throughput

due to bit recovery.

As we mentioned in the beginning of this article, we wanted to constrain this discussion to defining RAS

and addressing increased reliability with SATA disks in a RAID environment. But there are other areas

that should be addressed at the system level that also affect disk drive performance. One such area is

rotational vibration. This issue is a systemic problem in rack mount systems due to the critical thermal

constraints in 1U and 2U chassis in a NEBS environment. Rotational vibration effects are mitigated in

our mechanical designs and the techniques used are covered in a separate document.

5 Does RAID 6 stop working in 2019? by Robin Harris on Saturday, 27 February, 2010

(http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/)

http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/


Reliability – MTTF and MTBF (Mean Time To Failure and Mean Time Between Failure)

With so much data exposed to catastrophic failure, exacerbated in the cloud computing environment,

it’s important to maintain data integrity, especially in the medical, telecommunications and military

markets. Systems designed for these markets must address three key areas; Reliability, Availability and

Serviceability, system uptime must be maximized and mission critical data must be maintained.

The term Mean Time To Failure (MTTF) is an estimate of the average, or mean time until the initial

failure of a design or component (you may not want to include external failures), or disruption in the

operation of the product, process, procedure, or design occurs. A failure assumes that the product

cannot be repaired nor can it resume any of its normal operations without taking it out of service.

MTTF it is similar to Mean Time Between Failure (MTBF) though MTBF typically is slightly longer in time

than MTTF because MTBF includes the repair time of the design or component. Also, MTBF is the

average time between failures including the average repair time, which is known as MTTR (Mean Time

To Repair).6

What is Reliability? Per the Six Sigma SPC’s (Sigma Process Control) Quality Control Dictionary,

Reliability is the probability for any given design or process to execute within the anticipated operational

or design margin for a specified period of time. In addition, the system will work under defined

operating conditions with a minimum amount of stoppage due to a design or process error. Some

indicators for reliability are MTBF (Mean Time Between Failures) computations, ALT (Accelerated Life

Test using temperature chambers), MTTF (Mean Time To Failure) computations, and Chi-Square7

(statistical difference between observed and expected).

MTBF is a calculated indication of reliability. From a system perspective, any reliable assembly must

satisfactorily perform its intended function under some defined circumstances which may not be part of

the MTBF calculation’s environment. This may include conditions such as operating in varying ambient

temperatures. MTBF addresses reliability in a very controlled and limited scope. Traditionally, MTBF

calculations are based on the Telcordia Technologies Special Report SR-332, Issue 1, Reliability

Prediction Procedure for Electronic Equipment. The results from these calculations can be used to

roughly assist customers in the evaluation of the individual products, but should not be used as a

representation or guarantee of reliability or performance of the product. MTBF is only a gross

representation of how reliable a system will be under clearly defined conditions, clearly not real world.

If we can’t use the results of the MTBF calculation to determine when the components will wear out in

the real world, which product is better than the others and MTBF does not provide a reliable metric for

field failures, then why use it? Well it allows us to determine how often a system will fail under steady

state environmental conditions. Early in the design cycle, component MTBF can be used to determine

which parts will initially fail enabling engineering to improve the design robustness by selecting more

6 Paraphrasing the Six Sigma SPC's Quality Control Dictionary and Glossary http://www.sixsigmaspc.com/dictionary/glossary.html

7 ibid

http://www.sixsigmaspc.com/dictionary/xbar-mean.html

http://www.sixsigmaspc.com/dictionary/defect.html

http://www.sixsigmaspc.com/dictionary/external-failure.html

http://www.sixsigmaspc.com/dictionary/process.html

http://www.sixsigmaspc.com/dictionary/probability.html

http://www.sixsigmaspc.com/dictionary/process.html

http://www.sixsigmaspc.com/dictionary/MTBF-meantimebetweenfailure.html

http://www.sixsigmaspc.com/dictionary/ALT-acceleratedlifetest.html



http://www.sixsigmaspc.com/dictionary/MTTF-meantimetofailure.html

http://www.sixsigmaspc.com/dictionary/chi-square.html

http://www.sixsigmaspc.com/dictionary/glossary.html


robust components or design using hardened assemblies. There are three methods used in the MTBF

calculations: 1. the black box, 2. the black box plus laboratory inputs, and 3. the black box plus field

failure inputs. While the industry traditionally uses Method 1, Kontron Inc. CBPU/CRMS uses a

combination of all three methods – black box with lab inputs coupled with field data where available.

For large aggregated component assemblies, such as computer baseboards, there are typically vendor-

calculated MTBF; for passive components, there is industry standard failure rate data; and for

proprietary components, lab or field data is available.

Availability – MTBI (Mean Time Between Interruption)

If MTTF and MTBF reference failure modes address the failures from an initial power on or address the

failures from a previous failure including the repair time (MTTR) what is meant by Mean Time Between

Interruption (MTBI)? It addresses designs that provide redundancy allowing for the failure of a

redundant component that will not halt (fail) the system. The system may not run at full speed during

the time it takes to replace or re-build the failed component but it will run. MTBI time durations are

much larger than MTTR/MTBF intervals, which is better and they could be include multiple failures (RAID

6) with the replacement of the redundant components.

Serviceability – MTTR (Mean Time To Repair)

This term refers to how quickly and easily a system can be repaired after a MTBF, MTTF or an MTBI

event.

One measure of availability is what’s touted as the Five Nines. As we’ve seen, MTBI and MTTR are

tightly coupled. There is a significant amount of marketing literature promoting Five Nines availability

for systems designed for critical environments. But what is meant by Five Nines? This particular metric

is an artifact of the monolithic telecommunications industry when the incumbent carriers exercised

complete control of the equipment installed in their central offices. Five Nines availability was, and in

many cases remains, a requirement of Telco-grade Service Level Agreements (SLAs), defining a ratio of

system uptime (MTTR/ MTBI) versus unplanned downtime (MTTR), not counting scheduled

maintenance, planned updates, reboots, etc. Five Nines availability has an uptime of 99.999% per year,

or expressed conversely, its five minutes and thirty-five seconds of unplanned downtime per year,

equivalent to six sigma, a 99.99966% process capability. With a downtime measured in minutes, it is

vitally important that the system serviceability duration is minimized and any spare parts are available

locally, e.g., hot spares for disk drive arrays.

With Five Nines reflecting the system elements and not the network, we can easily compute the

network level availability. For example, if two non-redundant serial network elements each have

99.999% availability, the total availability of the network is 0.99999 X 0.99999 = 0.99998 or 99.998% or

Four Nines availability. Notice that with redundant components in a system we use MTBI not MTBF as a

measure of the interval between system level failures. By providing redundancy for all high powered


and rotating components we increase the time the system takes to fail (MTBI) but reduce the

MTTF/MTBF because there are more components to fail.

Computations

When evaluating a non-redundant system, all sub-system’s MTBF numbers can be viewed as a series

sequence with any single component or assembly causing a single system failure. The total calculated

MTBF will be less than the lowest individual component MTBF as illustrated in the following formula.

(Standard case where all components aren’t the same)

When we add redundant assemblies to the system, these combined components are measured as a

single block and the system level result is no longer MTBF but rather MTBI; the system keeps working

even with the failed redundant component. For example, in a system with no redundant fans, the MTBF

for the fan group maybe 261,669 hours. After we add redundant fans the MTBI is 3,370,238,148 hours

even though the MTBF is reduced because of the added fans. Because this MTBI is such a large number,

the fan group is virtually eliminated from the equation for system MTBI. We add redundant

components to increase the MTBI of the grouped components so they no longer adversely affect the

system level MTBI because their MTBI values are so large. The system’s single point of failure is reduced

by taking the assemblies that traditionally exhibit high single point of failure rates i.e., any component

assemblies that move, rotate or work at the edge of their thermal or electrical envelope and designing

the system in such as was so that these assemblies are redundant assemblies.

Power supplies are also items that fail due because they are usually working at the higher end of their

components thermal and electrical limits. By adding redundant power supplies, the MTBF can go from

125,000 hours for a single supply to an MTBI of 326,041,999 hours for a redundant pair. Like the earlier

example with the fans this is substantial change and will have a major positive impact to the system

MTBI.