chapter 7. disk subsystem -...

© Copyright IBM Corp. 1998, 2000 119

Chapter 7. Disk subsystem

Ultimately, all data must be retrieved from and stored to disk. Disk accessesare usually measured in milliseconds, whereas memory and PCI busoperations are measured in nanoseconds or microseconds. Disk operationsare typically thousands of times slower than PCI transfers, memory accesses,and LAN transfers. For this reason the disk subsystem can easily become themajor bottleneck for any server configuration.

Disk subsystems are also important because the physical orientation of datastored on disk has a dramatic influence on overall server performance. Adetailed understanding of disk subsystem operation is critical for effectivelysolving many server performance bottlenecks.

A disk subsystem consists of the physical hard disk and the controller. A diskis made up of multiple platters coated with magnetic material to store data.The entire platter assembly mounted on a spindle revolves around the centralaxis. A head assembly mounted on an arm moves to and fro (linear motion) toread the data stored on the magnetic coating of the platter.

The linear movement of the head is referred to as the seek. The time it takesto move to the exact track where the data is stored is called seek time. Therotational movement of the platter to the correct sector to present the dataunder the head is called latency. The ability of the disk to transfer therequested data is called the data transfer rate.

The most widely used drive technology today in servers is SCSI (SmallComputer System Interface). IBM’s flagship SCSI controller is theServeRAID-4H adapter. Besides SCSI, other storage technologies areavailable, such as:

• SSA (Serial Storage Architecture)• FC-AL (Fibre Channel Arbitrated Loop)• EIDE (Enhanced Integrated Drive Electronics)

For performance reasons, do not use EIDE disks in your server. The EIDEinterface does not handle multiple simultaneous I/O requests veryefficiently and so is not suited to a server environment. The EIDE interfaceuses more server CPU capacity than SCSI.

We recommend you limit EIDE use to CD-ROM and tape devices.

Using EIDE in servers

120 Tuning Netfinity Servers for Performance — Getting the most out of Windows 2000 and Windows NT 4.0

In this redbook we will focus only on SCSI and Fibre Channel.

7.1 SCSI bus overview

The SCSI bus has evolved into the predominant server disk connectiontechnology. Several different versions of SCSI exist. The table below containsall versions covered by the current SCSI specification.

Table 11. SCSI specifications

7.1.1 SCSI

First implemented as an ANSI standard in 1986, the Small Computer SystemInterface defines an 8-bit interface with a burst-transfer rate of 5 MBps with a5 MHz clock (that is, 1 byte transferred per clock cycle). SCSI cable lengthsare limited to 6 meters.

7.1.2 SCSI-2

The SCSI-2 standard was released by ANSI in 1996 and allowed for betterperformance than the original SCSI interface. It defines extensions that allowfor 16-bit transfers and twice the data transfer due to a 10 MHz clock. The8-bit interface is called SCSI-2 Fast and the 16-bit interface is called SCSI-2Fast/Wide.

In addition to the faster speed, SCSI-2 also introduced new command sets toimprove performance when multiple requests are issued from the server.

The trade-off with increased speed was shorter cable length. The 10 MHzSCSI-2 interface supported a maximum of 3 meter cable lengths.

SCSI Standard Bus Clock Speed 50-pin Narrow (8-bit) /68-pin Wide (16-bit)

Maximum CableLength

SCSI 5 MHz 5 MBps 6 meters

SCSI-2 Fast 10 MHz 10 MBps / 20 MBps 3 meters

Ultra SCSI 20 MHz 20 MBps / 40 MBps 1.5 meters

Ultra2 SCSI 40 MHz 40 MBps / 80 MBps 12 meters (LVD)

Ultra3 SCSI 80 MHz 80 MBps / 160 MBps 12 meters (LVD)

Chapter 7. Disk subsystem 121

7.1.3 Ultra SCSI

Ultra SCSI is an update to the SCSI-2 interface offering faster data transferrates and was introduced in 1996. It is a subset of the SCSI-3 parallelinterface (SPI) standard currently under development within the X3T10 SCSIcommittee.

The clock speed was doubled again to 20 MHz and provides a data transferspeed up to 40 MBps with a 16-bit data width, while maintaining the backwardcompatibility with SCSI and SCSI 2. Although the data transfer can be doneat 20 MHz (that is 40 MBps wide), all SCSI commands are issued at 10 MHzto maintain compatibility. This means that the maximum bandwidth is lessthan 31 MBps, even with 64 KB blocks.

Once again, with the increased speed, cable lengths were halved to 1.5meters maximum.

7.1.4 Ultra2 SCSI

Ultra2 SCSI uses Low Voltage Differential (LVD) signalling, which is designedto improve SCSI bus signal quality, enabling faster transfer rates and longercable lengths. Ultra2 SCSI doubles the clock speed to 40 MHz. It employs thesame concept as the older Differential SCSI specification where two signallines are used to transmit each of the 8 or 16 bits, one signal the negative ofthe other. See Figure 34. At the receiver, one signal, A+ is subtracted fromthe other, A- (that is, the differential is taken) which effectively removes spikesand other noise from the original signal. The result is A± as shown in Figure34.

Figure 34. Differential SCSI

Differential components tend to be more expensive than similar single-endedSCSI components, and differential termination requires a lot of power,generating significant heat levels. Because of the large voltage swings (20Volts) and high power requirements, current differential transceivers cannot

A+

A-

A+_

0

1

0

-1

0

1


be integrated onto the SCSI chip, but must be additional externalcomponents.

LVD has differential's advantages of long cables and noise immunity withoutthe power and integration problems. Because LVD uses a small (1.1 Volts)voltage swing, LVD transceivers can be implemented in CMOS, allowing themto be built into the SCSI chip, reducing cost, board area, power requirements,and heat.

The use of LVD allows cable lengths to be up to 12 meters.

7.1.5 Ultra3 SCSI

The maximum theoretical throughput of Ultra3 160/m SCSI can reach 160MBps on each SCSI channel. Ultra3 160/m uses the same clock frequency asUltra2 SCSI, but data transfers occur on both rising and falling edges of theclock signal, effectively doubling the throughput. This feature is called doubletransition clocking.

Note: double transition clocking requires LVD signalling. On a single-endedSCSI bus, clocking will revert to Single Transition mode. If you use a mixtureof Ultra3 and Ultra2 devices on an LVD-enabled SCSI bus, there is no needfor all devices use run at Ultra2 speed: the Ultra3 SCSI devices will stilloperate at the Ultra3 (160 MBps) speed.

Additionally, Ultra3 160/m SCSI can use CRC to ensure data integrity and istherefore far more reliable than older SCSI implementations which onlysupport parity control.

Domain validation is another feature of Ultra3 160/m SCSI. It is performedduring the SCSI bus initialization and the intent is to ensure devices on theSCSI bus (=domain) can reliably transfer data at negotiated speed. OnlyUltra3 capable devices can use domain validation.

Note: Ultra3 160/m is a subset of Ultra3 SCSI. It supports double transitionclocking, CRC and domain validation, but does not include all Ultra3 SCSIfeatures, like packetization or quick arbitration.

7.1.6 SCSI controllers and devices

There are two basic types of SCSI controller designs — array and non-array.A standard non-array SCSI controller allows connection of SCSI disk drives tothe PCI bus. Each drive is presented to the operating system as an individual,physical drive.


Figure 35 shows a typical non-array controller. The SCSI bus (an internalcable typically) is terminated on both ends. The SCSI controller (or hostadapter) normally has one of the end terminators integrated within itselectronics, so only one physical terminator is required.

The SCSI bus can contain different device types, such as disk, CD-ROM andtape all on the same bus. However, most non-disk devices conform to theslower SCSI and SCSI-2 Fast standards. So, if I/O to a CD-ROM or tape driveis required, the entire SCSI bus would have to switch to the slower speedduring that access, which dramatically affects performance.

This would not be much of a problem if the CD-ROM is not used forproduction purposes (that is, the CD-ROM is not a LAN resource available tousers) and the tape drive is only accessed after hours, when performance isnot critical.

If at all possible, we recommend you do not attach CD-ROMs or tape drives tothe same SCSI bus as disk devices. Fortunately, most Netfinity servers havethe standard CD-ROM on the EIDE bus.

Figure 35. Non-array SCSI configuration

The array controller, a more advanced design, contains hardware designed tocombine multiple SCSI disk drives into a larger single logical drive.Combining multiple SCSI drives into a larger logical drive greatly improves I/Operformance compared to single-drive performance. Most array controllersemploy fault-tolerant techniques to protect valuable data in the event of a diskdrive failure. Array controllers are installed in almost all servers because ofthese advantages.

Note: Although there are many array controller technologies, eachpossessing unique characteristics, this redbook includes details and tuninginformation specific to the IBM ServeRAID array controller.

HostSystem

SCSI HostAdapter

Controller

SCSI Bus (Cable)

Controller

Disk

System Bus

T T

Controller

Disk

Controller

Disk

Controller

CD-ROM


7.2 SCSI IDs

Figure 36. SCSI ID priority

With the introduction of SCSI-2, a total of 16 devicescan be connected to a single SCSI bus. To uniquelyidentify each device, each is assigned a SCSI IDfrom 0 to 15. One of these is the SCSI controlleritself and it is assigned ID 7.

Because the 16 devices share a single data channel,only one device can use the bus at a time. When twoSCSI devices attempt to control the bus, the SCSIIDs determine who wins according to a priorityscheme, as shown in Figure 36.

The highest priority ID is that of the controller. Nextare the low order IDs from 6 to 0 and then the highorder IDs from 15 to 8.

Although this priority scheme allows backwardcompatibility, it can result in negative systemperformance if your devices are configured incorrectly. Narrow (8-bit) deviceswith lower IDs will automatically preempt use of the bus by the faster F/Wdevices with addresses greater than 7. This is especially important whenCD-ROMs and tape drives are placed on the same SCSI bus as F/W diskdrives.

Note: With the use of hot-swap drives, the SCSI ID is automatically set by thehot-swap backplane. Typically, the only change is whether the backplaneassigns high-order IDs or low-order IDs.

7.3 Disk array controller architecture

Almost all server disk controllers implement the SCSI communicationbetween the disk controller and disk drives. SCSI is an intelligent interfacethat allows simultaneous processing of multiple I/O requests. This is thesingle most important advantage for using SCSI controllers on servers.Servers must process multiple independent requests for I/O. SCSI’s ability toconcurrently process many different I/O operations makes it the optimalchoice for servers.

SCSI array controllers consist of the following primary components:

• PCI bus interface/controller

SCSI ID Priority7 (Highest) Controller654321015141312111098 (Lowest)


• SCSI bus controller(s) and SCSI bus(es)• Microprocessor• Memory (microprocessor code and cache buffers)• Internal bus (connects PCI interface, microprocessor, and SCSI

controllers)

Figure 37. Architecture of a disk array controller

7.4 Disk array controller operation

The SCSI-based disk array controller is a PCI busmaster initiator withcapability to master the PCI bus to gain direct access to server main memory.The following sequence outlines the fundamental operations that occur whena disk-read operation is performed:

1. The server operating system generates a disk I/O read operation bybuilding an I/O control block command in memory. The I/O control blockcontains the READ command, a disk address called a Logical BlockAddress (LBA), a block count or length, and the main memory addresswhere the read data from disk is to be placed (destination address).

2. The operating system generates an interrupt to tell the disk arraycontroller that it has an I/O operation to perform. This interrupt initiatesexecution of the disk device driver. The disk device driver (executing onthe server’s CPU) addresses the disk array controller and sends it theaddress of the I/O control block and a command instructing the disk arraycontroller to fetch the I/O control block from memory.

3. The disk array controller initiates a PCI bus transfer to copy the I/O controlblock from server memory into its local adapter memory. The on-boardmicroprocessor executes instructions to decode the I/O control block

SCSIController

PCI Bus Controller

MicroprocessorMemory

Microcode

DataBuffers

SCSI BusCache

SCSI Disk Drives

InternalBus


command, to allocate buffer space in adapter memory to temporarily storethe read data, and to program the SCSI controller chip to initiate access tothe SCSI disks containing the read data. The SCSI controller chip is alsogiven the address of the adapter memory buffer that will be used totemporarily store the read data.

4. At this point, the SCSI controller arbitrates for the SCSI bus, and when busaccess is granted, a read command, along with the length of data to beread, is sent to the SCSI drives that contain the read data. The SCSIcontroller disconnects from the SCSI bus and waits for the next request.

5. The target SCSI drive begins processing the read command by initiatingthe disk head to move to the track containing the read data (called a seekoperation). The average seek time for current high-performance SCSIdrives is about 5 to 7 milliseconds.

This time is derived by measuring the average amount of time it takes toposition the head randomly from any track to any other track on the drive.The actual seek time for each operation can be significantly longer orshorter than the average. In practice, the seek time depends upon thedistance the disk head must move to reach the track containing the readdata.

6. After the seek time elapses, and the head reaches its destination track,the head begins to read a servo track (adjacent to the data track). A servotrack is used to direct the disk head to accurately follow the minutevariations of the data signal encoded within the disk surface.

The disk head also begins to read the sector address information toidentify the rotational position of the disk surface. This allows the head toknow when the requested data is about to rotate underneath the head.The time that elapses between the point when the head settles and is ableto read the data track, and the point when the read data arrives is calledthe rotational latency. Most disk drives have a specified average rotationallatency, which is half the time it takes to traverse one complete revolution.It is half the rotational time because on average, the head will have to waita half revolution to access any block of data on a track.

The average rotational latency of a 7200 RPM drive is about 4milliseconds, whereas the average rotational latency of a 10,000 RPMdrive is about 3 milliseconds. The actual latency depends upon theangular distance to the read data when the seek operation completes, andthe head can begin reading the requested data track.

7. When the read data becomes available to the read head, it is transferredfrom the head into a buffer contained on the disk drive. Usually this bufferis large enough to contain a complete track of data.


8. The disk drive has the ability to be a SCSI bus initiator or SCSI bus target(similar terminology used for PCI). Now the controller logic in the diskdrive arbitrates to gain access to the SCSI bus, as an initiator. When thebus becomes available, the disk drive begins to burst the read data intobuffers on the adapter SCSI controller chip. The adapter SCSI controllerchip then initiates a DMA (direct memory access) operation to move theread data into a cache buffer in array controller memory.

9. When the transfer of read data into disk array cache memory is complete,the disk controller becomes an initiator and arbitrates to gain access to thePCI bus. Using the destination address that was supplied in the originalI/O control block as the target address, the disk array controller performs aPCI data transfer (memory write operation) of the read data into servermain memory.

10.When the entire read transfer to server memory has completed, the diskarray controller generates an interrupt to communicate completion statusto the disk device driver. This interrupt informs the operating system thatthe read operation has completed.

7.5 RAID summary

Most of us have heard of RAID (redundant array of independent disks)technology. Unfortunately, there is still significant confusion about how RAIDactually works and the performance implications of each RAID strategy.Therefore, this section presents a brief overview of RAID and theperformance issues as they relate to commercial server environments.

RAID was created by computer scientists at the University of California atBerkeley, to address the huge gap between computer I/O requirements andsingle disk drive latency and throughput. RAID is a collection of techniquesthat treat multiple, inexpensive disk drives as a unit, with the object ofimproving performance and/or reliability. IBM and the IT industry have also


introduced more RAID levels to meet industry demand. The following RAIDstrategies are defined by the Berkeley scientists, IBM and the IT industry:

Table 12. RAID summary

RAID-3 is useful for scientific applications that require increased bytethroughput. It has very poor random access characteristics, and is notgenerally used in commercial applications.

RAID-4 uses a single checksum drive that becomes a significant bottleneck inrandom commercial applications. It is not likely to be used by a significantnumber of customers because of its slow performance.

RAID strategies that are supported by the IBM ServeRAID adapter are:

• RAID-0• RAID-1• RAID-1E• RAID-5• RAID-5E• Composite RAID levels, such as RAID-10 and RAID-50

7.5.1 RAID-0

RAID-0 is a technique that stripes data evenly across all disk drives in thearray. Strictly, it is not a RAID level, as no redundancy is provided. Onaverage, accesses will be random, thus keeping each drive equally busy.

RAID level Fault tolerant? Description

RAID-0 No All data evenly distributed to all drives.

RAID-1 Yes A mirrored copy of one drive to another drive (2disks).

RAID-1E Yes Mirrored copies of each drive.

RAID-3 Yes Single checksum drive. Bits of data are stripedacross N-1 drives.

RAID-4 Yes Single checksum drive. Blocks of data are stripedacross N-1 drives.

RAID-5 Yes Distributed checksum. Both data and parity arestriped across all drives.

RAID-5E Yes Distributed checksum and hot-spare. Data, parityand hot-spare are striped across all drives.

RAID-10 Yes Mirror copies of RAID-0 arrays.


SCSI has the ability to process multiple, simultaneous I/O requests, and I/Operformance is improved because all drives can contribute to system I/Othroughout. Since RAID-0 has no fault tolerance, when a single drive fails, theentire array becomes unavailable.

RAID-0 offers the fastest performance of any RAID strategy for randomcommercial workloads. RAID-0 also has the lowest cost of implementationbecause redundant drives are not supported.

Figure 38. RAID-0: All data evenly distributed across all drives but there is no fault tolerance

7.5.2 RAID-1

RAID-1 provides fault tolerance by mirroring one drive to another drive. Themirror drive ensures access to data should a drive fail. RAID-1 also has goodI/O throughput performance compared to single-drive configurations becauseread operations can be performed on any data record on any drive containedwithin the array.

Most array controllers (including the ServeRAID family) do not attempt tooptimize read latency by issuing the same read request to both drives in themirrored pair. The drive in the pair that is least busy is issued the readcommand, leaving the other drive to perform another read operation. Thistechnique ensures maximum read throughput.

Write performance is somewhat reduced because both drives in the mirroredpair must complete the write operation. For example, two physical writeoperations must occur for each write command generated by the operatingsystem.

23456789

1

Logical view

1

4

7

2

5

8

3

6

9

RAID-0 - Physical view


RAID-1 offers significantly better I/O throughout performance than RAID-5.However, RAID-1 is somewhat slower than RAID-0.

Figure 39. RAID-1: Fault-tolerant. A mirrored copy of one drive to another drive.

7.5.3 RAID-1E

RAID-1 Enhanced (which will be referred to as RAID-1E throughout the restof this document), is only implemented by the IBM ServeRAID adapter andallows a RAID-1 array to consist of three or more disk drives. “Regular”RAID-1 consists of exactly two drives.

The data stripe is spread across all disks in the array to maximize the numberof spindles that are involved in an I/O request to achieve maximumperformance. RAID-1E is also called mirrored stripe, as a complete stripe ofdata is mirrored to another stripe within the set of disks. Like RAID-1, onlyhalf of the total disk space is usable — the other half is used by the mirror.

Figure 40. RAID-1: Mirrored copies of each drive

Because you can have more than two drives (up to 16), RAID-1E will outperform RAID-1. The only situation where RAID-1 will perform better thenRAID-1E is the reading of sequential data. The reason for this is that when aRAID-1E reads sequential data off a drive, the data is striped across multipledrives. RAID-1E interleaves data on different drives therefore seek operationsoccur more frequently during sequential I/O. In RAID-1, data is notinterleaved so fewer seek operations occur for sequential I/O.

1

2

3

1'

2'

3'


1

3'

4

2

1'

5

3

2'

6

RAID-1E - Physical view


7.5.4 RAID-5

RAID-5 offers an optimal balance between price and performance for mostcommercial server workloads. RAID-5 provides single-drive fault tolerance byimplementing a technique called single equation single unknown. Thistechnique says that if any single term in an equation is unknown, the equationcan be solved to exactly one solution.

The RAID-5 controller calculates a checksum (parity stripe in Figure 41) usinga logic function known as an exclusive-or (XOR) operation. The checksum isthe XOR of all data elements in a row. The XOR result can be performedquickly by the RAID controller hardware and is used to solve for the unknowndata element.

In Figure 41, addition is used instead of XOR to illustrate the technique: stripe1 + stripe 2 + stripe 3 = parity stripe 1-3. Should drive one fail, stripe 1becomes unknown and the equation becomes X + stripe 2 + stripe 3 = paritystripe 1-3. The controller solves for X and returns stripe 1 as the result.

A significant benefit of RAID-5 is the low cost of implementation, especiallyfor configurations requiring a large number of disk drives. To achieve faulttolerance, only one additional disk is required. The checksum information isevenly distributed over all drives, and checksum update operations are evenlybalanced within the array.

Figure 41. RAID-5: Both data and parity are striped across all drives

However, RAID-5 yields lower I/O throughout then RAID-0 and RAID-1. Thisis due to the additional checksum calculation and write operations required.In general, I/O throughput with RAID-5 is 30-50% lower than with RAID-1.(The actual result depends upon the percentage of write operations.) Aworkload with a greater percentage of write requests generally has a lowerRAID-5 throughput. RAID-5 will provide I/O throughput performance similar toRAID-0 when the workload does not require write operations (read only).

For more information on RAID-5 performance, see 7.6, “ServeRAID RAID-5algorithms” on page 136.


1

4

7

2

5

7-9 parity

3

4-6 parity

8

1-3 parity

6

9


7.5.5 RAID-5E

Figure 42. RAID-5E: The hot spare is integrated into all disks, instead of a separate disk

RAID-5E was invented by IBM research and is a technique that distributes thehot-spare drive space over the N+1 drives comprising the RAID-5 array plusstandard hot-spare drive. It was first implemented in ServeRAID firmwareV3.5.

Adding a hot-spare drive to a server protects data by reducing the time spentin the critical state. This technique does not make maximum use of thehot-spare drive because it sits idle until a failure occurs. Often many yearscan elapse before the hot-spare drive is ever used. IBM invented a method toutilize the hot-spare drive to increase performance of the RAID-5 array duringtypical processing and preserve the hot-spare recovery technique. Thismethod of incorporating the hot spare into the RAID array is called RAID-5E.

RAID-5E is designed to increase the normal operating performance of aRAID-5 array in two ways:

• The hot-spare drive contains data that can be accessed during normaloperation. The RAID-5 array now has an extra drive to contribute to thethroughput of read and write operations. Standard 10,000 RPM drives canperform more than 100 I/O operations per second so the RAID-5 arraythroughput is increased with this extra I/O capability.

• The data in RAID-5E is distributed over N+1 drives instead of N as is donefor RAID-5. As a result, the data occupies less tracks on each drive. Thishas the effect of physically utilizing less space on each drive keeping thehead movement more localized and reducing seek times.

Together, these improvements yield a typical system-level performance gainof about 10-20%.

Another benefit of RAID-5E is the faster rebuild times needed to reconstruct afailed drive. In a standard RAID-5 hot-spare configuration the rebuild of afailed drive requires serialized write operations to the single hot-spare drive.Using RAID-5E the hot spare drive space is evenly distributed across all

1

5

Hot spare

2

6

Hot spare

3

7

Hot spare

RAID-5E - Physical view

4

5-8 parity

Hot spare

1-4 parity

8

Hot spare


drives, so the rebuild operations are evenly distributed to all remaining drivesin the array. Rebuild times with RAID-5E can be dramatically faster thanrebuild times using a standard hot-spare configuration.

The only downside of RAID-5E is that the hot-spare drive cannot be sharedacross multiple physical arrays as can be done with standard RAID-5 plushot-spare. This RAID-5 technique is more cost efficient for multiple arraysbecause it allows a single hot-spare drive to provide coverage for multiplephysical arrays. This reduces the cost of using a hot-spare drive but thesacrifice is the inability to handle separate drive failures within differentarrays. IBM ServeRAID adapters offer increased flexibility by providing thechoice to use either standard RAID-5 with hot-spare or the newer integratedhot-spare provided with RAID-5E.

While RAID-5E provides a performance improvement for most operatingenvironments there is a special case where its performance can be slowerthan RAID-5. Consider a three-drive RAID-5 with hot-spare configuration asshown in Figure 43. This configuration employs a total of four drives but thehot-spare drive is idle so for a performance comparison it can be ignored. Afour-drive RAID-5E configuration would have data and checksum on fourseparate drives.

Figure 43. Writing a 16 KB block to a RAID-5 array with an 8 KB stripe size

Referring to Figure 43, whenever a write operation is issued to the controllerthat is two times the stripe size (for example, a 16 KB I/O request to an arraywith an 8 KB stripe size), a three-drive RAID-5 configuration would notrequire any reads because the write operation would contain all the dataneeded for each of the two drives. The checksum would be generated by the

RAID-5with hot-spare8 KB stripe

8 KB 8 KB 8 KB

Adapter cache

ServeRAID adapter

16 KB write operation

Step 1

Step 2:calculatechecksum

Step 4:write checksum


Step 3:write data


array controller (step 2) and immediately written to the corresponding drive(step 4) without the need to read any existing data or checksum. This entireseries of events would require two writes for data to each of the drives storingthe data stripe (step 3) and one write to the drive storing the checksum (step4), for a total of three write operations.

Contrast these events to the operation of a comparable RAID-5E array whichcontains four drives as shown in Figure 44. In this case, in order to calculatethe checksum, a read must be performed of the data stripe on the extra drive(step 2). This extra read was not performed with the three-drive RAID-5configuration and it slows the RAID-5E array for write operations that aretwice the stripe size.

Figure 44. Writing a 16 KB block to a RAID-5E array with a 8 KB stripe size

This problem with RAID-5E can be avoided with proper stripe size selection.By monitoring the average I/O size in bytes, or knowing the I/O sizegenerated by the application, a large enough stripe size can be selected sothat this performance degradation rarely occurs.

7.5.6 Composite RAID levels

The ServeRAID-4 adapter family supports composite RAID levels. Thismeans that it supports RAID arrays that are joined together to form largerRAID arrays.

For example, RAID 10 is the result of forming a RAID-0 array from two ormore RAID-1 arrays. With four SCSI channels each supporting 15 drives, this

Adapter cache

ServeRAID adapter


Step 1

Step 3:calculatechecksum

Step 5:write checksum


Step 4:write data

RAID-5Ewith integratedhot-spare8 KB stripe

8 KB 8 KB 8 KB 8 KB

Step 2:read data

Extrastep


means you can theoretically have up to 60 drives in one array. With theEXP200, the limit is 40 disks and with the EXP300, the limit is 56 disks.

A ServeRAID RAID-10 array is shown in Figure 45:

Figure 45. RAID-10: A striped set of RAID-1 arrays

Likewise a striped set of RAID-5 arrays is shown in Figure 46

Figure 46. RAID-50: A striped set of RAID-5 arrays

1

2

3

1'

2'

3'

1

2

3

1'

2'

3'

1

2

3

1'

2'

3'

RAID-10 - Physical view (striped RAID-1)

1

4

7

2

5

8

3

6

9

1

4

7

2

57-9 parity

3

4-6 parity

8

1-3 parity

6

9

1

4

7

2

57-9 parity

3

4-6 parity

8

1-3 parity

6

9

1

3

5

2

4

6

RAID-50 - Physical view (striped RAID-5)


The ServeRAID-4 family supports the following combinations:

Table 13. Composite RAID levels supported by ServeRAID-4 adapters

Table 14 shows a summary of the performance characteristics of the threeRAID levels commonly used in array controllers. A comparison is also madebetween small and large I/O data transfers.

Table 14. Summary of RAID performance characteristics

7.6 ServeRAID RAID-5 algorithms

The IBM ServeRAID adapter uses one of two algorithms for the calculation ofRAID-5 parity. These algorithms ensure the best performance of RAID-5 writeoperations in array configurations, regardless of the number of drives in thearray:

RAID level The sub-logical array is and the spanned array is

RAID-00 RAID-0 RAID-0


RAID-1E0 RAID-1E RAID-0


RAID level Datacapacity1

Sequential I/Operformance2

Random I/Operformance2

Data availability2

Read Write Read Write With hot spare Without hot spare

Single Disk n 6 6 4 4 0 Not applicable

RAID-0 n 10 10 10 10 0 Not applicable

RAID-1 n/2 7 5 6 3 7 8

RAID-1E n/2 5 4 7 6 8 9

RAID-5 n-1 7 73 7 4 7 8

RAID-5E n-2 8 83 8 5 8 N/A

RAID-10 n/2 10 9 7 6 9 9

Notes:1 In the data capacity column, n refers to the number of equally sized disks in the array.2 10 = best, 1=worst. You should only compare values within each column. Comparisons between columns

is not valid for this table.3 With the write back setting enabled.


• Use “read/modify write” for RAID-5 arrays of five drives or more.• Use “full XOR” for RAID-5 arrays of three or four drives.

This section compares these two algorithms.

7.6.1 Read/modify write algorithm

The read/modify write algorithm is optimized for configurations that usegreater than four drives. The RAID-5 read/modify write algorithm is describedin Figure 47. This algorithm always requires four disk operations to beperformed for each write command, regardless of the number of drives in theRAID-5 array. As per Figure 47, the steps that occur are:

1. Read old data (data1)2. Read old checksum (CS6)3. Calculate the new checksum from old data, new data and old checksum4. Write new data (data4)5. Write new checksum (CS9)

Figure 47. Read/modify write algorithm— four I/O operations for every write command

Regardless of the number of drives, with the read/modify write algorithm, thewrite command will always require four I/O operations: two reads and twowrites.

The algorithm is called “read/modify write” because it reads the checksum,modifies the checksum, then writes the checksum.

data1data4

data2 data3 CS6CS9

ServeRAID adapter Adapter cache

data1

CS6

data4NewCS9

Read/Modify Write algorithmCommand: Update data1 to data4

Step 1(Read)

Step 2(Read)

Step 3(Calc)

Step 4(Write)

Step 5(Write)


7.6.2 Full XOR algorithm

A different method can be used to generate RAID-5 checksum information fora write operation that modifies data1 to be data4. This method is called the“full exclusive or” algorithm (full XOR algorithm).

It involves disk read operations of data2 and data3. The full XOR algorithmthen creates a new checksum from data4 + data2 + data3, writes the modifieddata (data4), and overwrites the old checksum with the new checksum (CS9).In this case, four disk operations are performed.

The following operations (as per Figure 48) show the steps involved in the fullXOR algorithm:

1. Read data22. Read data33. Calculate new checksum (CS6) from new data (data4), data2 and data34. Write data45. Write checksum (CS9)

Figure 48. Full XOR algorithm

In this case, four disk operations are performed: two reads and two writes.

If the number of disks in the array increases, then the number of readoperations also increases:

• Five disks: five I/O operations (three reads and two writes)

data1data4

data2 data3 CS6CS9

Adapter cache

data2

data3

data4NewCS9

Full XOR algorithmCommand: Update data1 to data4

Step 1(Read)

Step 2(Read)

Step 3(Calc)

Step 4(Write)

Step 5(Write)

ServeRAID Adapter


• Six disks: six I/O operations (four reads and two writes)• n disks: n I/O operations (n-2 reads and two writes)

The extra read operations required by this algorithm cause the performanceof write commands to degrade as the number of drives increases.

The algorithm is called “full XOR” due to the way the checksum is calculated.The checksum is calculated from all the data and then the calculatedchecksum is written to disk. The original checksum is not used in thecalculation.

However, for three disks, only three I/O operations are required: one read andtwo writes. Thus the following conclusions can be reached:

• For 3-drive RAID-5 arrays, full XOR is faster.• For 4-drive RAID-5 arrays, the algorithms are the same.• For 5+ drive RAID-5 arrays, read/modify write is faster.

Thus, for a four-drive configuration, the full XOR algorithm requires the samenumber of disk operations as the read/modify write algorithm. A RAID-5configuration using five drives would require four disk operations for theread/modify write algorithm, but five disk operations for the full XORalgorithm. Consequently, the number of disk operations increases for a fullXOR algorithm as the number of drives configured in a RAID-5 arrayincreases. The extra read operations required by the full XOR algorithmcause the performance of write operations to degrade as the number of drivesincreases.

To take advantage of this, Version 2.3 of the ServeRAID firmware introduceda technique which used the better of these two algorithms depending on thenumber of drives in the array. It uses full XOR when the adapter is configuredwith three or four drives in a RAID-5 array, and read/modify write when theadapter is configured with five or more drives.

7.6.3 Sequential write commands

The benefits of these two algorithms also affects the sequential writecommands.

When the ServeRAID adapter is configured for RAID-5 and the server I/O issequential write operations (for example, when copying files to the server orwhen building a database), additional performance benefits can be achievedusing a full XOR algorithm and using a write-back cache policy. (The benefitsof write-back cache are discussed in 7.7.7, “Disk cache write-back versuswrite-through” on page 153.)


The ServeRAID firmware V2.7 has intelligence to detect this type of I/O, andswitches to full XOR. This would cause each data element, data1, data2,data3, and checksum to be stored in the ServeRAID adapter cache after thefirst operation to update data1 to data4. In write-back mode, the updates todata2, data3 and the successive updates to the checksum could all beaccomplished in cache memory. After the entire group of stripe elements issequentially updated in cache memory, only three disk operations arerequired to store the updated data2, data3, and checksum information ondisk.

This feature of the ServeRAID can improve database load times in RAID-5mode by up to eight times over earlier ServeRAID firmware levels.

7.7 Factors affecting disk array controller performance

Many factors affect array controller performance. The most importantconsiderations (in order of importance) for configuring the IBM ServeRAIDadapter are:

• RAID strategy• Number of drives• Drive performance• Logical drive configuration• Stripe size• SCSI bus organization and speed• Disk cache write-back versus write-through• RAID adapter cache size• Device drivers• Firmware

7.7.1 RAID strategy

Your RAID strategy should be carefully selected because it significantlyaffects disk subsystem performance. Figure 49 illustrates the performancedifferences between RAID-0, RAID-1E and RAID-5 for a server configuredwith 10,000 RPM Fast/Wide SCSI-2 drives and the IBM ServeRAID-3HBadapter with v3.6 code. The chart shows the RAID-0 configuration deliveringabout 97% greater throughput than RAID-5 and 35% greater throughput thanRAID-1E.

RAID-0 has no fault tolerance and is, therefore, best utilized for read-onlyenvironments when downtime for possible backup recovery is acceptable.RAID-1E or RAID-5 should be selected for applications requiring fault


tolerance. RAID-1E is usually selected when the number of drives is low (lessthan six) and the price for purchasing additional drives is acceptable.RAID-1E offers about 45% more throughput than RAID-5. These performanceconsiderations should be understood before selecting a fault-tolerant RAIDstrategy.

Figure 49. Comparing RAID levels

In many cases, RAID-5 is the best choice because it provides the best priceand performance combination for configurations requiring capacity greaterthan five or more disk drives. RAID-5 performance approaches RAID-0performance for workloads where the frequency of write operations is low.Servers executing applications that require fast read access to data and highavailability in the event of a drive failure should employ RAID-5.

For more information about RAID-5 performance, see 7.6, “ServeRAIDRAID-5 algorithms” on page 136.

7.7.2 Number of drives

The number of disk drives significantly affects performance because eachdrive contributes to total system throughput. Capacity requirements are oftenthe only consideration used to determine the number of disk drivesconfigured in a server. Throughput requirements are usually not wellunderstood or are completely ignored. Capacity is used because it is easilyestimated and is often the only information available.

Configuration:Windows NT Server 4.0ServeRAID-3HBFirmware/driver v3.6Maximum number of drives10,000 RPM

8 KB I/O sizeRandom I/O mix: 67/33R/W

3575

2641

1816

0

1000

2000

3000

4000

I/Oop

erat

ions

per

seco

nde

RAID-5RAID-1ERAID-0


The result is a server configured with sufficient disk space, but insufficientdisk performance to keep users working efficiently. High-capacity drives havethe lowest price per byte of available storage and are usually selected toreduce total system price. This often results in disappointing performance,particularly if the total number of drives is insufficient.

It is difficult to accurately specify server application throughput requirementswhen attempting to determine the disk subsystem configuration. Disksubsystem throughput measurements are complex. To express a userrequirement in terms of “bytes per second” would be meaningless becausethe disk subsystem’s byte throughput changes as the database grows andbecomes fragmented, and as new applications are added.

The best way to understand disk I/O and users’ throughput requirements is tomonitor an existing server. Tools such as the Windows 2000 Performanceconsole can be used to examine the logical drive queue depth and disktransfer rate (as described in Chapter 11, “Windows 2000 Performanceconsole” on page 221). Logical drives that have an average queue depthmuch greater than the number of drives in the array are very busy. Thisindicates that performance would be improved by adding drives to the array.

Measurements show that server throughput for most server applicationworkloads increase as the number of drives configured in the server isincreased. As the number of drives is increased, performance is usuallyimproved for all RAID strategies. Server throughput continues to increaseeach time drives are added to the server. This can be seen in Figure 50.

In general, adding drives is one of the most effective changes that can bemade to improve server performance.

Adding drives


Figure 50. Improving performance by adding drives to arrays

This trend will continue until another server component becomes thebottleneck. In general, most servers are configured with an insufficientnumber of disk drives. Therefore, performance increases as drives are added.Similar gains can be expected for all I/O-intensive server applications such asoffice-application file serving, Lotus Notes, Oracle, DB2 and Microsoft SQLServer.

If you are using one of the IBM ServeRAID family of RAID adapters, you canuse the logical drive migration feature to add drives to existing arrays withoutdisrupting users or losing data.

7.7.3 Drive performance

Drive performance contributes to overall server throughput because fasterdrives perform disk I/O in less time. There are four major components to thetime it takes a disk drive to execute and complete a user request:

Configuration:Windows NT 4.0SQL Server 6.5ServeRAID II4.5 GB 7200 RPM183

221

283

0

50

100

150

200

250

300

4 DrivesRAID-0

6 DrivesRAID-0

8 DrivesRAID-0

Tps

For most server workloads, when the number of drives in the active logicalarray is doubled, server throughput will improve by about 50% until otherbottlenecks occur.

Rule of thumb


• Command overhead

This is the times it take for the drive’s electronics to process the I/Orequest. The time depends on whether it is a read or write request andwhether the command can be satisfied from the drive’s buffer. This value isof the order of 0.1 ms for a buffer hit to 0.5 ms for a buffer miss.

• Seek time

This is the time it takes to move the drive head from its current cylinderlocation to the target cylinder. As the radius of the drives has beendecreasing, and drive components have become smaller and lighter, sotoo has the seek time been decreasing. Average seek time is usually 5-7ms for most current SCSI-2 drives used in servers today.

• Rotational latency

Once the head is at the target cylinder, the time it takes for the targetsector to rotate under the head is called the rotational latency. Averagelatency is half the time it takes the drive to complete one rotation so it isinversely proportional to the RPM value of the drive:

- 5400 RPM drives have a 5.6 ms latency- 7200 RPM drives have a 4.2 ms latency- 10,000 RPM drives have a 3.0 ms latency

• Data transfer time

This value depends on the media data rate, which is how fast data can betransferred from the magnetic recording media, and the interface datarate, which is how fast data can be transferred between the disk drive anddisk controller (that is, the SCSI transfer rate).

The media data rate improves as a result of greater recording density andfaster rotational speeds. A typical value is 0.8 ms. The interface data ratefor SCSI-2 F/W is 20 MBps. With 4 KB I/O transfers (which are typical forWindows NT Server and Windows 2000), the interface data transfer time is0.2 ms. Hence the data transfer time is approximately 1 ms.

As you can see, the significant values that affect performance are the seektime and the rotational latency. For random I/O (which is normal for amulti-user server) this is true. Reducing the seek time will continue to improveas the physical drive attributes become less.

For sequential I/O (such as with servers with small numbers of usersrequesting large amounts of data) or for I/O requests of large block sizes (forexample 64 KB), the data transfer time does become important whencompared to seek and latency, so the use of Ultra SCSI, Ultra2 SCSI or Ultra3SCSI can have a significant positive effect on overall subsystem performance.


Likewise when caching and read-ahead is employed on the drivesthemselves, the time taken to perform the seek and rotation are eliminated,so the data transfer time becomes very significant.

The easiest way to improve disk performance is to increase the number ofaccesses that can be made simultaneously. This is achieved by using manydrives in a RAID array and spreading the data requests across all drives asdescribed in 7.7.2, “Number of drives” on page 141.

Table 15 shows the seek and latency values and buffer sizes for three ofIBM’s high-end drives.

Table 15. Comparing 10000 and 7200 RPM drives

7.7.4 Logical drive configuration

Using multiple logical drives on a single physical array is convenient formanaging the location of different files types. However, depending on theconfiguration, it can significantly reduce server performance.

When you use multiple logical drives, you are physically spreading the dataacross different sections of the array disks. If I/O is performed to each of thelogical drives, the disk heads have to seek further across the disk surfacethan when the data is stored on one logical drive. Using multiple logical drivesgreatly increases seek time and can slow performance by as much as 25%.

An example of this is creating two logical drives in the one RAID array andputting a database on one logical drive and the transaction log on the other.Because heavy I/O is being performed on both, the performance will be poor.If the two logical drives are configured with the operating system on one anddata on the other, then there should be little I/O to the operating system codeonce the server has booted so this type of configuration would be OK.

It is best to put the page file on the same drive as the data when using onelarge physical array. This is counterintuitive: Most think the page file shouldbe put on the operating system drive since the operating system will not seemuch I/O during runtime. However, this causes long seek operations as the

Disk Capacity RPM Seek Latency Buffersize

Media datatransfer rate(MBps)

Ultrastar 36LP 18.3 GB 7200 6.8 ms 4.17 ms 4 MB 248-400

Ultrastar 36LZ 18.3 GB 10K 4.9 ms 2.99 ms 4 MB 280-452

Ultrastar 72ZX 73.4 GB 10K 5.3 ms 2.99 ms 16 MB 280-473


head swings over the two partitions. Putting the data and page file on the dataarray keeps the I/O localized and reduces seek time.

Of course this is not the most optimal case, especially for applications withheavy paging. Ideally, the page drive will be a separate device that can beformatted to the correct stripe size to match paging. In general, mostapplications will not page when given sufficient RAM so usually this is not aproblem.

The fastest configuration is a single logical drive for each physical RAID array.Instead of using logical drives to manage files, you should create directoriesand store each type of file in a different directory. This will significantlyimprove disk performance by reducing seek times because the data will be asphysically close together as possible.

If you really want or need to partition your data and you have a sufficientnumber of disks, you should configure multiple RAID arrays instead ofconfiguring multiple logical drives in one RAID array. This will improve diskperformance; seek time will be reduced because the data will be physicallycloser together on each drive.

Note: If you plan to use RAID-5E arrays, you can only have one logical driveper array.

7.7.5 Stripe size

With RAID technology, data is striped across an array of hard disk drives.Striping is the process of storing data across all the disk drives that aregrouped in an array.

The granularity at which data from one file is stored on one drive of the arraybefore subsequent data is stored on the next drive of the array is called thestripe unit (also referred to as interleave depth). For the ServeRAID adapterfamily, the stripe unit can be set to a stripe unit size of 8 KB, 16 KB, 32 KB, or64 KB. With Netfinity Fibre Channel, a stripe unit is called a segment, andsegment sizes can also be 8 KB, 16 KB, 32 KB, or 64 KB.

The collection of these stripe units, from the first drive of the array to the lastdrive of the array, is called a stripe.

The stripe and stripe unit are shown in Figure 51.


Figure 51. RAID stripes and stripe units

Note: The term stripe size should really be called stripe unit size since itrefers to the length of the stripe unit (the piece of space on each drive in thearray)

Using stripes of data balances the I/O requests within the “logical drive”. Onaverage, each disk will perform an equal number of I/O operations, therebycontributing to overall server throughout. Stripe size has no effect on the totalcapacity of the logical disk drive.

7.7.5.1 Selecting the correct stripe size

The selection of stripe size affects performance. In general, the stripe sizeshould be at least as large as the median disk I/O request size generated byserver applications.

• Selecting too small a stripe size can reduce performance. In thisenvironment, the server application requests data that is larger than thestripe size, which results in two or more drives being accessed for eachI/O request. Ideally, only a single disk I/O occurs for each I/O request.

• Selecting too large a stripe size can reduce performance because a largerthan necessary disk operation might constantly slow each request. This isa problem particularly with RAID-5 where the complete stripe must beread from disk to calculate a checksum. Use too large a stripe, and extradata must be read each time the checksum is updated.

Selecting the correct stripe size is a matter of understanding the predominaterequest size performed by a particular application. Few applications use asingle request size for each and every I/O request. Therefore, it is notpossible to always have the ideal stripe size. However, there is always abest-compromise stripe size that will result in optimal I/O performance.

SU1SU4

SU2SU5

SU3SU6

Stripe

Stripe Unit


There are two ways to determine the best stripe size:

• Use a rule of thumb as per Table 16.• Monitor the I/O characteristics of an existing server.

The first and simplest way to choose a stripe size is to use Table 16. Thistable is based on tests performed by the Netfinity Performance Lab.

Table 16. Stripe size setting for various applications

Notes about Table 16:

• SQL Server 7.0 uses 8 KB I/O blocks but experiments have shown thatperformance can usually be improved by using double the I/O block size(that is, 16KB).

• Oracle uses multiple block sizes: 2 KB, 4 KB or 8 KB. While using a 16 KBstripe size is not the optimum for all cases, neither is it significantly slowereither. Further I/O analysis on specific customer data may determine that8 KB or 16 KB block sizes may produce better performance.

In general, stripe size only needs to be at least as large as the I/O size.Having a smaller stripe size implies multiple physical I/O operations for eachlogical I/O which will cause a drop in performance. Using a larger stripe sizeimplies a read-ahead function which may or may not improve performance.Table 16 offers rule-of-thumb settings — there is no way to offer the precisestripe size that will always give the best performance for every environmentwithout doing extensive analysis on the specific workload.

The second way to determine the correct stripe size involves observing theapplication while it is running using the Windows 2000 Performance console.The key is to determine the average data transfer size being requested by theapplication and select a stripe size that best matches. Unfortunately, thismethod requires the system to be running, so it either requires anothersystem running the same application or the reconfiguring of the existing disk

Applications Stripe size

Groupware (Lotus Domino, Exchange etc.) 16 KB

Database server (Oracle, SQL Server, DB2, etc.) 16 KB

File server (Windows 2000, Windows NT) 16 KB

Web server 8 KB

Video file server 64 KB

Other 8 KB


array once the measurement has been made (and therefore backup, reformatand restore operations).

The Windows 2000 Performance console or Windows NT 4.0 PerformanceMonitor can help you determine the proper stripe size. Select:

• Object: PhysicalDisk• Counter: Avg. Disk Bytes/Transfer• Instance: the drive that is receiving the majority of the disk I/O

Monitor this value. As an example, the trend value for this counter is shown asthe thick line in Figure 52. The running average is shown as indicated.

Figure 52. Average I/O size

Figure 52 represents an actual server application. It can be seen that theapplication request size (represented by Avg. Disk Bytes/Transfer) variesfrom a peak of 64 KB to about 20 KB for the two run periods.

As we said at the beginning of this section, in general, the stripe size shouldbe at least as large as the median disk I/O request size generated by theserver application.

Data drive average diskbytes per transfer:Range of 20 KB to 64 KBMaximum 64 KB

Running average


This particular server was configured with an 8 KB stripe size, whichproduced very poor performance. Increasing the stripe size to 16 KB wouldimprove performance and increasing the stripe size to 32 KB would increaseperformance even more. The simplest technique would be to place the timewindow around the run period and select a stripe size that is at least as largeas the average size shown in the running average counter.

7.7.5.2 Page file drive

Windows NT and Windows 2000 perform page transfers at up to 64 KB peroperation, so the paging drive stripe size can be as large as 64 KB. However,in practice, it is usually closer to 32 KB because the application might notmake demands for large blocks of memory which limits the size of the pagingI/O.

Monitor average bytes per transfer as described in 7.7.5.1, “Selecting thecorrect stripe size” on page 147. Setting the stripe size to this average sizecan make a significant increase in performance by reducing the amount ofphysical disk I/O that occurs due to paging.

For example, if the stripe size is 8 KB and the page manager is doing 32 KBI/O transfers, then four physical disk reads or writes must occur for eachpage/sec you see in the Performance console. If the system is paging 10pages/sec, then the disk will actually be doing 40 disk transfers/second.

7.7.6 SCSI bus organization and speed

Concern often exists over the performance effects caused by the number ofdrives on the SCSI bus, or the speed at which the SCSI bus runs. Yet, inalmost all modern server configurations, the SCSI bus is rarely the

If you wish to monitor disk activity you need to enable the physical diskcounters. In Windows NT 4.0, physical disk counters are disabled bydefault. To enable them, issue the command DISKPERF -Y then restart thecomputer. In Windows 2000, physical disk counters are enabled by default.

Keeping this setting on all the time draws about 2-3% CPU but if your CPUis not a bottleneck, this is irrelevant and can be ignored.

Type DISKPERF /? for more help on the DISKPERF command.

Activating disk performance counters


bottleneck. In most cases, optimal performance can be obtained by simplyconfiguring 10 drives per Ultra SCSI bus. If the application isbyte-I/O-intensive (as is the case with video or audio) five drives on oneUltra2 SCSI bus can be used for a moderate (10-20%) increase in systemperformance.

In general, it is rare that SCSI bus configuration or increasing SCSI busspeed can significantly improve overall server system performance.

Consider that servers must access data stored on disk for each of theattached users. Each user is requesting access to different data stored in aunique location on the disk drives. Disk accesses are almost always randombecause the server must multiplex access to disk data for each user. Thismeans that most server disk accesses require a seek and rotational latencybefore data is transferred across the SCSI bus.

7.7.6.1 SCSI bus speed

As described in 7.7.3, “Drive performance” on page 143, total disk seek andlatency times average about 8-12 ms, depending on the speed of the drive.Transferring a 2 KB block over a 40 MBps SCSI bus takes about 0.5 ms, orapproximately 1/20th of the total disk access time. It is easy to see thatincreasing the bus speed to 80 MBps will only improve the 1/20 portion oftime to roughly 1/40 of the total time, resulting in a small fractional gain inoverall performance.

Tests have shown that for random I/O, drive throughput usually does notapproach the limits of the SCSI bus.

In some cases, Ultra2 SCSI (80 MBps) can be shown to offer measurableperformance improvements over Ultra SCSI (40 MBps). This usually occurswhen measurements are made with a few drives (four to eight) running serverbenchmarks that transfer large blocks of data to and from the disk drives.System performance gains can be in the range of 5-10%. Typical examplesare file-serving and e-mail benchmarks, where data transfer time becomes alarger component of the total disk access time.

File-serving and e-mail benchmarks (and applications) transfer relativelylarge blocks of data (12 KB to 64 KB), which increases SCSI bus utilization.More importantly, however, these benchmarks usually build a relatively smallset of data files, resulting in artificially reduced disk seek times. In productionenvironments, disk drives are usually filled to at least 30-50% of theircapacity, causing longer seek times compared to the benchmark files thatmight only use 2-3% of the disk capacity. After all, building a 2 GB database


for a benchmark might seem like a large data set, but on a disk arraycontaining five 9 GB drives, that database utilizes less than 1/20th of the totalspace. This greatly reduces seek times, thereby inflating the performancecontribution of the SCSI bus.

Most IBM SCSI drive enclosures offer the ability to split the backplane to offerdual SCSI bus capability. This effectively provides the same performance ofUltra2 by using dual Ultra SCSI buses in one drive package. In the case of theEXP200, which supports Ultra2, the backplane can be split to providethroughput benefits similar to Ultra3 SCSI, provided that the othercomponents in the system can handle this level of throughput.

7.7.6.2 PCI bus

Don't forget that all this data must travel though the PCI bus. The peak datatransfer rate for 32-bit 33 MHz PCI is 132 MBps, but the maximum sustainedrate is only 80-100 MBps. The ServeRAID-2 adapter used three 40 MBpsUltra SCSI buses, which had the potential to sustain peak rates of 120 MBps.This transfer rate was perfectly matched with 32-bit 33 MHz PCI. Using Ultra2SCSI only provides the possibility of improving maximum sustainableperformance when both the adapter and the system support faster datatransfer rates.

The ServeRAID-3HB adapter offers Ultra2 SCSI transfer performance andcan transfer data over 64-bit PCI, which is better matched to the transferrequirements of its three Ultra2 SCSI buses. The adapter must be pluggedinto a 64-bit slot for Ultra2 SCSI capabilities to take maximum advantage ofits potential.

Ultra3 SCSI's 160 MBps rate has similar issues: even faster PCI-to-memoryperformance is required before maximum throughput can be achieved. Athree-channel Ultra3 SCSI RAID adapter can potentially deliver peak rates of480 MBps. For the moment, no PCI interface can offer such throughputperformance. In fact, most memory subsystems cannot offer that muchbandwidth for all of the PCI slots combined.

All this is not to say that Ultra2 and Ultra3 SCSI performance have no placeon servers. Several issues must be addressed. You should simply rememberthat these are important considerations when configuring a balanced server.

Spec-driven technologies, such as SCSI, are often motivated by desktopenvironments. In the desktop environment, applications tend to be moresequential, and the system usually has a single SCSI adapter that canmonopolize much of the PCI to memory bandwidth. In the desktop


environment, SCSI-3 provides significant performance gains. Because of themore random nature of server applications, these benefits often do nottranslate to server environments. The entire delivery path from memorythrough the PCI bus, over the adapter, and out to the drive must be optimizedbefore faster SCSI bus speeds will realize any appreciable systemperformance gains for any workload.

7.7.6.3 Multiple SCSI buses

The SCSI bus organization of drives on a multi-bus controller (such asServeRAID) does not significantly affect performance for most serverworkloads.

For example, in a four-drive configuration, it doesn’t matter whether youattach all drives to a single SCSI bus or if you attach two drives each to twodifferent SCSI buses. Both configurations will usually have identical disksubsystem performance. This applies to applications such as databasetransaction processing, which generate random disk operations of 2 KB or 4KB. The SCSI bus does not contribute significantly to the total time requiredfor each I/O operation. Each I/O operation usually requires drive seek andlatency times; therefore, the sustainable number of operations per second isreduced, causing SCSI bus utilization to be low.

For a configuration which runs applications that access image data or largesequential files, performance improvement can be achieved by using abalanced distribution of drives on the three SCSI buses of the ServeRAID.

7.7.7 Disk cache write-back versus write-through

Most people think that write-back mode is always faster because it allowsdata to be written to the disk controller cache without waiting for disk I/O tocomplete.

This is usually the case when the server is lightly loaded. However, as theserver becomes busy, the cache fills completely, causing data writes to waitfor space in the cache before being written to the disk. When this happens,data write operations slow to the speed at which the disk drives empty thecache. If the server remains busy, the cache is flooded by write requests,resulting in a bottleneck. This happens regardless of the size of the adapter’scache.

In write-through mode, write operations do not wait in cache memory thatmust be managed by the processor on the RAID adapter. When the server islightly loaded (the green zone on the left in Figure 53), write operations take


longer because they cannot be quickly stored in the cache. Instead, they mustwait for the actual disk operation to complete. Thus, when the server is lightlyloaded, throughput in write-through mode is generally lower than inwrite-back mode.

Figure 53. Comparing write-through and write-back modes under increasing load

However, when the server becomes very busy (the pink zone on the right inFigure 53), I/O operations do not have to wait for available cache memory.They go straight to disk, and throughput is usually greater for write-throughthan in write-back mode.

Write-through is also faster when battery-backup cache is installed, duepartly to the fact that the cache is mirrored. Data in the primary cache has tobe copied to the memory on the battery-backup cache card. This copyoperation eliminates a single point of failure, thereby increasing the reliabilityof the controller in write-back mode, but it takes time and slows writes,especially when the workload floods the adapter with write operations.

Comparingwrite-throughversus write-back

IBM ServeRAID-3HB8 KB random I/ORAID-5

Write-through

Write-back

Increasing Load

200

400

600

800

1000

1200

1400

1600

I/Os

per

Sec

on

d

Based on Figure 53, the following rule of thumb is appropriate:

• If the disk subsystem is very busy, use write-through mode.

• If the disks are configured correctly, and the server is not heavilyloaded, use write-back mode.

Rule of thumb


7.7.8 RAID adapter cache size

IBM performance tests show that the ServeRAID-3H adapter with 32 MB ofcache typically outperforms other RAID adapters with 64 MB for mostreal-world application workloads. Once the cache size is above the minimumrequired for the job, the extra cache usually offers little additionalperformance benefit.

The cache increases performance by providing data that would otherwise beaccessed from disk. However, in real-world applications, total data space isso much larger than disk cache size that, for random operations, there is verylittle statistical chance of finding the requested data in the cache. Forexample, a 50 GB database would not be considered very large by today'sstandards. A typical database of this size might be placed on an arrayconsisting of seven or more 9 GB drives. For random accesses to such adatabase, the probability of finding a record in the cache would be the ratio of32 MB/50 GB, or approximately 1 in 1,600 operations. Double the cache size,and this value is decreased by half; still a very discouraging hit-rate. You caneasily see that it would take a very large cache to increase the cache hit-rateto the point where caching becomes advantageous for random accesses.

In RAID-5 mode, significant performance gains from write-back mode arederived from the ability of the disk controller to merge multiple writecommands into a single disk write operation. In RAID-5 mode, the controllermust update the checksum information for each data update. Write-backmode allows the disk controller to keep the checksum data in adapter cacheand perform multiple updates before completing the update to the checksuminformation contained on the disk. In addition, this does not require a largeamount of RAM.

In most cases, disk array caches can usually provide high hit rates only whenI/O requests are sequential. In this case, the controller can pre-fetch data intothe cache so that on the next sequential I/O request, a cache hit occurs.Pre-fetching for sequential I/O requires only enough buffer space or cachememory to stay a few steps ahead of the sequential I/O requests. This can bedone with a small circular buffer.

The cache size needs to increase in proportion to the number of concurrentI/O streams supported by the array controller. The earlier ServeRAIDadapters supported up to 32 concurrent I/O streams, so 32 MB of cache wasdeemed enough to provide a high-performance hit rate for sequential I/O. Fornewer RAID adapters, the number of outstanding I/O requests can be as highas 128; thus, these adapters will have proportionally larger caches. (Note, it is


a coincidence that the number of I/O streams matches the size of the cachein MB)

Having a large cache often means more memory to manage when theworkload is heavy and during light loads very little cache memory is required.

Most people don't invest the time to think about how cache works. Withoutmuch thought, it's easy to reach the conclusion that “bigger is always better.”The drawback is that larger caches take longer to search and manage. Thiscan slow I/O performance, especially for random operations since there is avery low probability of finding data in the cache.

Benchmarks often do not reflect a customer production environment. Ingeneral, most “retail” benchmark results run with very low amounts of datastored on the disk drives. In these environments, a very large cache will havea high hit-rate that is artificially inflated compared to the hit-rate from aproduction workload.

In a production environment, an overly-large cache can actually slowperformance as the adapter continuously searches the cache for data that isnever found, before it starts the required disk I/O. This is the reason thatmany array controllers turn off the cache when the hit-rates fall below anacceptable threshold.

In identical hardware configurations it will take more CPU overhead tomanage 64 MB of cache compared to 32 MB, and even more for 128 MB. Thepoint is that bigger caches do not always translate to better performance.Although ServeRAID-4H has a 266 MHz PowerPC 750 with 1 MB L2, thisCPU is approximately 5-7 times faster than the 80 MHz CPU used onServeRAID-3H. Therefore ServeRAID-4H can manage the larger cachewithout running slower than ServeRAID-3H.

Furthermore, the amount of cache must be proportional to the number ofdrives attached. Typically cache hits are generated from sequential readahead. You do not need to read ahead very much to have 100% hits. Moredrives have more I/O streams to prefetch. ServeRAID-4H has 4 SCSI busesthat support up to 40 drives compared to 30 for ServeRAID-3H.

7.7.9 Device drivers

Device drivers play a major role in performance of the subsystem with whichthe driver is associated. A device driver is software written to recognize aspecific device. Most of the device drivers are vendor specific. These drivers


are supplied by the hardware vendor (such as IBM in the case of ServeRAID).Most of the device drivers can be downloaded from the Web.

Choosing the correct device driver for specific hardware is very important.The device drivers are also specific to the operating system. Some of thedevice drivers are supplied on the Windows NT CD-ROM, and some aresupplied with the hardware on diskette. A technically-competent personshould select a proper driver during installation. Selecting an incorrect devicedriver for a specific device (if it works at all) can cause poor performance ordata loss.

Most of the Netfinity systems have integrated SCSI adapters. We recommendyou refer to the technical manual of the specific model of Netfinity system toselect the correct driver. Windows NT can automatically detect many SCSIadapters and automatically load the appropriate device driver. For otheradapters, such as the ServeRAID, you need to instruct Windows NT to copythe device driver from the supplied diskette. The same applies to Windows2000.

The Windows 2000 CD-ROM contains a version of the ServeRAID driver thatis equivalent to the v3.5-level driver and will allow you to install the operatingsystem onto ServeRAID-attached disks. We recommend you install Windows2000 using that driver, then, once the installation is complete, upgrade to thelatest driver.

It should also be noted that often the latest driver is not the best or correctdriver to use. This is especially important with specific hardwareconfigurations that are certified by an application vendor. An example of thisis Microsoft’s Cluster Server. You must check the certified configuration todetermine what driver level is supported.

Before installing the latest version, check the IBM support Web site for thelatest hints and tips:

1. Go to URL http://www.pc.ibm.com/support2. Select Servers.3. In the Family pull-down list, select ServeRAID.4. Click on Hints and Tips from the navigation bar on the left side of the

page.5. In the category pull-down list, select Service Bulletins.

You may also want to examine the other hints and tips categories.

http://www.pc.ibm.com/support


7.7.10 Firmware

Version 3.5 provides a significant improvement in performance compared toearlier versions of ServeRAID software. Version 3.5 involved many firmwareoptimizations that resulted in system level gain in performance as much as20-25% for typical server workloads. This firmware version was also used tointroduce RAID-5E to the ServeRAID-3 family of adapters.

Version 3.5 of the firmware and device driver introduced automatic readahead algorithms that turned on and off the read ahead function based uponthe demands of the active workload. Whenever the adapter firmware detectedtransfers that would benefit from read ahead the option was dynamicallyturned on. If the I/O workload changed so that read ahead reduced the overallperformance it was turned off. This feature reduced the complexity ofconfiguring an array for maximum performance by automating the setting ofthe parameter.

Version 3.5 also improved performance by optimizing I/O for RAID-1E. Thisfeature improved performance by better balancing physical I/O operationsbetween the mirror drive pairs. The net gain in performance for RAID-1E wasas much as 66%.

Version 3.6 software for the ServeRAID-3 family of adapters introducedadditional performance enhancements, including:

• Refined instruction path length

This feature significantly improves the performance of the ServeRAID-3family of adapters when executing cache hit operations. Since manycustomers base purchase decisions by running small data sizebenchmarks, the design lab could not ignore performance obtained whileaccessing the majority of data from adapter cache. Version 3.6 offersgreater performance by restructuring the executed code for a better fit andto stay resident in the L1 processor cache. The on-board CPU now runssignificantly faster because most key instructions remain resident in the L1thereby reducing the CPU wait times for slower memory accesses.

• Greater concurrent I/O

This feature enables the ServeRAID-3 family of adapters to have up to 128concurrent outstanding I/O operations. This change increasesperformance for configurations that utilize large numbers of disk drives.Allowing a larger number of concurrent outstanding I/O operations enablesthe disk drives to optimize I/O by reordering seek operations.


• Removed the 8-drive limitation for 32 KB and 64 KB stripe sizes

Removing the 8 physical drive limitation for 32 KB and 64 KB stripe sizeslets you have configurations of up to 16 physical disks for applications thatrequire large block transfers. Applications such as video and imageserving can now use larger arrays. These larger arrays provide bothgreater capacity and increased throughput for a single physical drive.

In general, customers can expect to see as much as 20-25% improvements inthroughput for average business applications from these modifications. Figure54 shows the gains obtained for typical random I/O server applications. Thespecific configuration is 8 KB block, 67% read, and 33% write, randomtransactions.

Figure 54. Maximum ServeRAID family RAID-0 throughput performance

7.8 Fibre Channel

Fibre Channel introduces new techniques to attach storage to servers and asa result, it has unique performance issues that affect the overall performanceof a server. The purpose of this section is to provide a brief introduction to the

3467

2827

2265

1585

1141

0

1

2

3

4

I/Os

Per

Sec

ond

(Tho

usan

ds)

RAID-0 WB Maximum Controller ThroughputRequires Maximum Number Of Drives Attached

8K Byte Random 67% Read 33% Write Operations

ServeRAID-3Lcode v3.0

ServeRAID-2code v2.4

ServeRAID-3Hcode v2.7



Always upgrade the firmware on the ServeRAID card to the latest level.

Firmware levels


motivation behind Fibre Channel, to explain how Fibre Channel affects serverperformance and identify important issues for configuring Fibre Channel foroptimal performance.

SCSI has been the standard for server disk attachment for the last ten years.However, SCSI technology has recently been under stress as it attempts tosatisfy the I/O demands of current high-performance 4 and 8-way servers.Some of the fundamental problems with SCSI are its parallel cable designwhich limit cable length, transfer speed, and the maximum number of drivesthat can be attached to the cable. Another significant limitation is that amaximum of two systems can share devices attached to one SCSI bus. Thisis significant when using SCSI for server clustering configurations.

Fibre Channel was designed to be a transport for both network traffic and anI/O channel for attaching storage. In fact the Fibre Channel specificationprovides for many protocols such as 802.2, IP (Internet Protocol) and SCSI.Our discussion in this redbook will be limited to its use for disk storageattachment.

Fibre Channel provides low latency and high throughput capabilities. As aresult, Fibre Channel is rapidly becoming the next generation I/O technologyused to connect servers and high-speed storage. Fibre Channel addressesmany of the shortcomings of SCSI with improvement in the following areas:

• Cable distance• Bandwidth• Reliability• Scaleability

The parallel cable used for Ultra, Ultra2, and Ultra3 SCSI limit cable distancesto 25 meters or shorter. This is due to electromagnetic effects impactingsignal integrity as cable length increases. Parallel cables such as the typeused by SCSI tend to have signal interference problems because ofelectromagnetic coupling that occurs between parallel signals traversing thewires.

Serial technologies use fewer signals, typically two or four, compared to asmany as 68 for SCSI. Fewer signal lines means less electromagnetic energyemitted and less total signal interference from coupling of the electromagneticenergy into adjacent wires. Lower signal interference allows the serial cableto transfer data at much higher rates than is possible using a parallelconnection.

Fibre Channel provides the capability to use either a serial copper or fiberoptic link to connect the server with storage devices. Fiber optic technology


allows for storage to be located a maximum distance of up to 10 kilometersaway from the attaching server.

The same electromagnetic noise problems that limit SCSI cable length alsolimit the speed at which data can traverse the SCSI bus. First generationFibre Channel is capable of transmitting data at 1 Gbit per second in bothtransmit and receive directions. The most popular version of SCSI, Ultra2 islimited to 80 MBps (bytes) or 640 Mbps (bits).

This difference in performance (1 Gb vs. 0.64 Gb) does not appear to besignificant; however, Fibre Channel offers a full-duplex communication pathwhile SCSI is half-duplex. This means that Fibre Channel can achieve up to 2Gb throughput by transferring data on both send and receive paths at sametime. Therefore, the maximum bandwidth of current Fibre Channelimplementations is actually 2 Gb while it is only 640 Mb for Ultra2 SCSI.However, maximum bandwidth is often touted as an important specificationbut in actual use, sustainable bandwidth may be much less.

Another significant advantage of Fibre Channel is its ability to connectredundant paths between storage and one or more servers. Redundant FibreChannel paths improve server availability because cable or connector failuresdo not cause server down time because storage can be accessed via aredundant path. In addition, both Fibre Channel and SCSI throughput canscale by utilizing multiple channels or buses between the servers andstorage.

In addition to a simpler cable scheme, Fibre Channel offers improvedscaleability because it offers several very flexible connection topologies.Basic point-to-point connections can be made between a server and storagedevices providing a low-cost simple stand-alone connection. Fibre Channelcan also be used in both loop and switch topologies. These topologiesincrease server-to-storage connection flexibility. The Fibre Channel loopallows up to 127 devices to be configured to share the same Fibre Channelconnection. A device can be a server or a storage subsystem. Fibre Channelswitch topologies provide the most flexible configuration scheme bytheoretically providing the connection of up to 16 million devices!

The Fibre Channel specification provides many possibilities for how FibreChannel is configured but we will confine our discussion to theimplementation of the IBM Netfinity Fibre Channel RAID Controller. The IBMFibre Channel RAID Controller operation can be conceptualized by combiningLAN and disk array controller operations.


Figure 55 below illustrates the primary components in the IBM Fibre Channelconfiguration. The important factors contributing to performance are causedby the RAID controller and storage being attached to the server by a FibreChannel link. This introduces two factors which contribute to overall FibreChannel performance. These are:

• The throughput of the Fibre Channel links, shown as the FC bandwidtharrow

• The aggregate throughput of the RAID controller and link combination,shown as the FC-to-disk bandwidth arrow.

Figure 55. IBM Netfinity Fibre Channel RAID organization

In March 2000, IBM introduced Netfinity Fibre Array Storage Technology(FAStT). This new Fibre Channel technology employs second-generationFibre Channel integrated circuits which greatly improve throughputperformance. In addition, device drivers and firmware are optimized toenhance throughput performance. FAStT utilizes the same Fibre Channelprotocols used in first generation Netfinity Fibre Channel products but fourhost connection and four drive connection Fibre Channel links are supportedper controller pair to significantly improve total available fibre channelbandwidth.

Fibre

Channel

Host

Adapter

RAID

ControllerFC-AL

EXP-15

Netfinity server

Up to 60 (6x10) disk drivesper RAID controller pair

Netfinity FibreChannel RAID

controller

Optional secondRAID controller

NetfinityFibre

Channelhost

adapter

Optional secondFC adapter

FC bandwidth

FC-to-disk bandwidth


7.8.1 Fibre Channel performance issues

Let's look at what happens when a read I/O operation is requested to a FibreChannel subsystem, and the data requested is not located in the RAIDcontroller disk cache:

1. A read command is generated by the Netfinity server and the readcommand contains the logical block address of the data being requested.

2. The command is transmitted by the Fibre Channel host adapter to theRAID controller over the Fibre Channel link.

3. The RAID controller parses the read command and uses the logical blockaddress to issue the disk read command to the correct drive.

4. The disk drive performs the read operation and returns the data to theRAID controller.

5. The Fibre Channel electronics within the RAID controller format the datainto the Fibre Channel protocol format. The data is transferred to theNetfinity server over the Fibre Channel link.

6. Once in the Fibre Channel adapter, the data is transferred over the PCIbus into memory of the Netfinity server.

Of course, a large amount of the detail was left out, but this level ofobservation is sufficient to understand the most important performanceimplication of Fibre Channel.

The Fibre Channel link, like most network connections, sustains a datatransfer rate that is largely determined by the payload of the frame. Or statedanother way, the throughput of Fibre Channel is a function of the disk I/O sizebeing transferred. This is because Fibre Channel frames have a maximumdata payload of 2112 bytes. Data transfers for larger data sizes requiremultiple Fibre Channel frames.

Figure 56 illustrates the effects of disk request size on Fibre Channelthroughput. At small disk request sizes such as 2 KB the maximum FibreChannel throughput is about 20 MBps or about 20% the maximum transferrate of Fibre Channel. This is critical information as many people think themaximum 1 Gbps throughput is obtained for all operations.


Figure 56. Fibre Channel throughput vs. disk I/O size

Only when the disk I/O size is as large as 64 KB does Fibre Channel reachit's maximum sustainable throughput. In this case the maximum throughput isabout 82 MBps. But Fibre Channel is suppose to have one Gigabitthroughput? One Gigabit is roughly 100 MBps (taking into account a 2-bitserial overhead for every byte). The difference between this measured resultof 82 MBps and the theoretical maximum throughput of 1 Gbps (100 MBps)can be explained by overhead of command and control bits that accompanyeach Fibre Channel frame. This is discussed in the following sections.

7.8.1.1 Fibre Channel protocol layers

We can get a better appreciation for this overhead if we take a brief look atthe Fibre Channel layers and the Fibre Channel frame composition.

The Fibre Channel specification defines five independent protocol layers.These layers are structured so that each layer has a specific function toenable reliable communications for all of the protocols supported by FibreChannel standard.

0.5 1 2 4 8 16 32 64 128 256 512 1024

Transfer Size KBytes

0102030405060708090

100

MB

/Sec

Thr

ough

put

Fibre Channel throughput vs. disk I/O sizeIBM Fibre Channel solution

Overhead

Throughput


Figure 57. Fibre Channel functional levels

Figure 57 illustrates the five independent layers:

• FC-0 is the physical layer. This is comprised of the actual wire or opticalfibre over which data travels.

• FC-1 is the transmission protocol. The Transmission layer is responsiblefor encoding of the bits on the physical medium, for data transmissionerror detection, and for signal clock generation.

• FC-2 is important from a performance perspective because this is thelayer that is responsible for building the data frames that flow over theFibre Channel link. FC-2 is also responsible for segmenting large transferrequests into multiple Fibre Channel frames.

• FC-3 defines the common services layer. This layer is responsible fordefining the common services that are accessible across all Fibre Channelports. One of these services is the Name Server. The Name Serverprovides a directory of all the Fibre Channel nodes accessible on theconnection. For example a Fibre Channel switch would be a name serverand maintain a directory of all the ports attached to that switch. OtherFibre Channel nodes could query the switch to determine what nodeaddresses are accessible via that switch.

• FC-4 defines the protocol standards that can be used to transport dataover Fibre Channel. Some of these protocols include:

- SCSI (Small Computer Systems Interface)- HiPPI (High Performance Parallel Interface)- IPI (Intelligent Peripheral Interface)- SBCCS (Single Byte Command Code Set) to support ESCON

FC-3 Common Services Protocol

FC-2 Signaling and Framing Protocol

FC-1 Transmission Protocol

FC-0 Physical

FC-4 Mapping Protocol

SCSI HiPPi IPI IP 802.2SBCCS


- IP (Internet Protocol)- 802.2

Our discussion is limited to SCSI because Netfinity Fibre Channel RAIDcontroller products are based upon the SCSI protocol. Fibre Channel allowsthe SCSI protocol commands to be encapsulated and transmitted over FibreChannel to SCSI devices connected to the RAID controller unit. This issignificant because this technique allows Fibre Channel to be quicklydeveloped and function with existing SCSI devices and software.

7.8.1.2 The importance of the I/O size

Regarding the shape of the throughput chart in Figure 56 on page 164, thethroughput of Fibre Channel is clearly sensitive to the disk access size. Smalldisk access sizes have low throughput while larger blocks have greateroverall throughput. The reason for this can be seen by looking at the readcommand example we discussed in 7.8.1, “Fibre Channel performanceissues” on page 163.

In the case of a 2 KB read operation, the sequence is:

1. A SCSI read command is issued by the device driver to the Fibre Channelhost adapter at level FC-4.

2. On the Netfinity host side, the SCSI read command must flow down fromFC-4 to FC-0 before it is transferred over the Fibre Channel link to theexternal RAID controller.

3. The RAID controller also has a Fibre Channel interface that receives theread command at FC-0 and sends it up through FC-1, FC-2, FC-3, to theSCSI layer at FC-4.

4. The SCSI layer then sends the read command to the Fibre Channel RAIDcontroller.

5. The SCSI read command is issued to the correct disk drive.

6. When the read operation completes, data is transferred from the drive toSCSI layer FC-4 of the Fibre Channel interface within the RAID controller.

7. Now the read data must make the return trip down layers FC-4, FC-3,FC-2, FC-1 on the RAID controller side and onto the Fibre Channel link.

8. When the data arrives on the Fibre Channel link, it is transmitted to thehost adapter in the Netfinity server.

9. Again it must travel up the layers to FC-4 on the Netfinity side before theSCSI device driver responds with data to the requesting process.


Contrast the 2 KB read command with a 64 KB read command and theanswer becomes clear:

Like the 2 KB read command the 64 KB read command travels down FC-4,FC-3, FC-2, and to FC-1 on the Netfinity side. It also travels up the samelayers on the RAID controller side.

But here is where things are different. After the 64 KB read commandcompletes the data is sent to FC-4 of the Fibre Channel interface on the RAIDcontroller side. The 64 KB data travels down from FC-4, FC-3 and to FC-2. Atlayer FC-2 the 64 KB data is formatted into a 2112-byte payload to be sentover the link. But 64 KB does not fit into a 2112-byte payload. Therefore, layerFC-2 performs segmentation and breaks the 64 KB disk data up into 32separate Fibre Channel frames to be sent to the Netfinity Fibre Channelcontroller.

31 of these frames never had to traverse layers FC-4 and FC3 on the RAIDcontroller side. Furthermore, 31 of these frames never required a separateread command to be generated at all. They were transmitted with one readcommand.

Thus, reading data in large blocks introduces significant efficiencies becausemuch of the protocol overhead is reduced. Any transfer exceeding the 2112byte payload is shipped as “low-cost” frames back to the host. Thisknowledge explains why throughput at smaller frame sizes (Figure 56 onpage 164) is so low and throughput for larger frames improves as the disk I/Osize increases. The overhead of the FC-4, FC-3 layers and the additionalSCSI read or write commands slow throughput.

7.8.1.3 Configuring Fibre Channel for performance

The important point is to understand that degradation of throughput withsmaller I/O sizes occurs, and to use that information to better configure yourFibre Channel configuration.

One way to do this is to profile an existing server to get an idea of the averagedisk transfer size. This can easily be obtained using Performance Monitor andexamining the following physical disk counters:

• Average disk bytes/transfer

This counter can be graphed versus time to tell you the predominanttransfer size for the particular application. This value can be compared toFigure 56 on page 164 to determine the maximum level of throughout asingle Fibre Channel link can sustain for a particular application.


• Disk bytes/second

This counter tells you what the current disk subsystem is able to sustainfor this particular application. This value can also be compared to themaximum throughput obtained from Figure 56 on page 164 to determinewhether multiple links should be used to reach the target level ofthroughput demanded for the target number of users.

As a rule of thumb, all things being equal, double the number of usersrequires double the amount of disk I/O. For example, if the current server isdoing 8 KB transfers and supporting 100 users and you are asked to build aFibre Channel based server configuration to support 300 users, the analysisis fairly straightforward:

• At 8 KB, Fibre Channel can sustain about 52 MBps (from Figure 56 onpage 164).

• If the current server with 100 users is sustaining 10 MBps then a singleFibre Channel link will be sufficient to handle 300 users at 30 MBps (whichis less than the 52 MBps maximum).

• If the server were sustaining 20 MBps (a total requirement for 60 MBps)then it would be best to configure dual Fibre Channel adapters connectingthe host to the Fibre Channel RAID controller.

As well as adding a PCI host adapter, you can also improve performance byadding a second Fibre Channel controller module to the Netfinity FibreChannel RAID Controller Unit. With both the first-generation Netfinity FibreChannel and the FAStT technology, throughput nearly doubles for all transfersizes when a second controller is added to the system as shown in Figure 58.


Figure 58. Comparing single vs. dual controller throughputs

The rest of the challenges of optimizing Fibre Channel are similar toconfiguring a standard RAID controller. Disk layout and organization, such asRAID strategy, stripe size and the number of disks, all affect performance ofthe IBM Fibre Channel RAID controller in much the same way that it does forServeRAID. The same techniques used to determine these settings forServeRAID can be used to optimize the IBM Fibre Channel RAID controllersolution.

Figure 59 shows a comparison between ServeRAID and Fibre Channel.

0.5 1 2 4 8 16 32 64 128 256 512 1024

Transfer size (KB)

0

50

100

150

200

Thr

ough

put(

MB

ps)

FAStT single controller vs. dual controller

Netfinity FAStTSingle controller

Netfinity FAStTDual controllers

Some rules of thumb:

• Double the number of users requires double the amount of disk I/O.

• Use Figure 56 on page 164 to determine the maximum sustainablethroughput. If your expected throughput exceeds this value add asecond host adapter.

• Adding a second RAID controller module doubles the throughput.

Rules of thumb


Figure 59. Throughput comparisons

Figure 59 compares:

• ServeRAID-3HB with Version 3.6 of the firmware, BIOS and driver• ServeRAID-4H with Version 4 of the firmware, BIOS and driver• Netfinity Fibre Channel with one module in the RAID controller unit• Netfinity Fibre Channel with two modules in the RAID controller unit• Netfinity Fibre Array Storage Technology (FAStT) with one RAID controller

module• Netfinity Fibre Array Storage Technology with two RAID controller modules

Fibre Channel offers improved performance over SCSI as the ability toconfigure a larger number of drives per RAID controller. With ServeRAID-4H,you can have up to 56 drives connected to the adapter (14 drives per channelusing the new Netfinity EXP300 enclosure) and with Netfinity FAStT, up to220 drives can be connected. In addition, Fibre Channel offers benefitsrelated to high availability, such as fault tolerance and the distance betweenthe server and the disk enclosures.

Using a large number of drives in an array is the best way to increasethroughput for applications that have high I/O demands. These applications

1816

25101848

3729 3728

6868

Serve

RAID-3

HBv3

.6

Serve

RAID-4

H

FCsin

gle

FCdu

al

FAStTsin

gle

FAStTdu

al

0

1000

2000

3000

4000

5000

6000

7000

8000

I/Oo

per

atio

ns

per

seco

nd

Maximum throughput controller-to-disk

Maximum number ofdrives attached:ServeRAID: 30 disksFibre Channel: 60 disks

RAID-5 arrays8KB I/O sizeRandom I/O67/33 R/W mixArrays are 8% full


include database transaction processing, decision support, e-commerce,video serving, and groupware such as Lotus Notes and Microsoft Exchange.

7.8.2 Tuning with Netfinity FAStT Storage Manager

Netfinity FAStT Storage Manager Version 7 is the software that lets youmanage the Netfinity FAStT RAID Controller. It includes its own performancemonitoring tool, the Subsystem Management Performance Monitor whichgives you information about the performance aspects of your Fibre Channelsubsystem.

Note: This performance monitor tool is not related to the Windows NTPerformance Monitor tool.

Figure 60. Subsystem Management Performance Monitor

This section describes how to use data from the Subsystem ManagementPerformance Monitor and what tuning options are available in the StorageManager for optimizing the Fibre Channel subsystem’s performance.

Use the Subsystem Management Performance Monitor to monitor storagesubsystem performance in real-time and save performance data to a file forlater analysis. You can specify the logical drives and/or controllers to monitorand the polling interval. Also, you can receive storage subsystem totals,which is data that combines the statistics for both controllers in anactive-active controller pair.


Table 17 describes the data that is displayed for selected devices.

Table 17. Subsystem management performance monitor parameters

7.8.2.1 Balancing the I/O load

The Total I/O data field is useful for monitoring the I/O activity to a specificcontroller and a specific logical drive. This field helps you identify possible I/Ohot spots.

Identify actual I/O patterns to the individual logical drives and compare thosewith the expectations based on the application. If a particular controller hasconsiderably more I/O activity than expected, consider moving a array to theother controller in the storage subsystem using the Array > ChangeOwnership option.

Since I/O loads are constantly changing, it can be difficult to perfectly balanceI/O load across controllers and logical drives. The logical drives and dataaccessed during your polling session depends on which applications andusers were active during that time period. It is important to monitor

Data field Description

Total I/Os Total I/Os performed by this device since the beginning of the polling session. Formore information, see 7.8.2.1, “Balancing the I/O load” on page 172.

Read percentage The percentage of Total I/Os that are read operations for this device. Writepercentage can be calculated as 100 minus this value. For more information, see7.8.2.3, “Optimizing the I/O request rate” on page 173.

Cache hitpercentage

The percentage of reads that are processed with data from the cache rather thanrequiring a read from disk. For more information, see 7.8.2.3, “Optimizing the I/Orequest rate” on page 173.

Current K/B persecond

Average transfer rate during the polling session. The transfer rate is the amount ofdata in Kilobytes that can be moved through the I/O Data connection in a second (alsocalled throughput). For more information, see 7.8.2.2, “Optimizing the transfer rate”on page 173.

Maximum K/B persecond

The maximum transfer rate that was achieved during the Performance Monitor pollingsession. For more information, see 7.8.2.2, “Optimizing the transfer rate” on page173.

Current I/O persecond

The average number of I/O requests serviced per second during the current pollinginterval (also called an I/O request rate). For more information, 7.8.2.3, “Optimizingthe I/O request rate” on page 173.

Maximum I/O persecond

The maximum number of I/O requests serviced during a one- second interval over theentire polling session. For more information, see 7.8.2.3, “Optimizing the I/O requestrate” on page 173.


performance during different time periods and gather data at regular intervalsso you can identify performance trends. The performance monitor tool allowsyou to save data to a comma-delimited file so you can import it to aspreadsheet for further analysis.

If you notice the workload across the storage subsystem (Storage SubsystemTotals Total I/O statistic) continues to increase over time while applicationperformance decreases, this can indicate the need to add additional storagesubsystems to your enterprise. By doing this, you can continue to meetapplication needs at an acceptable performance level.

7.8.2.2 Optimizing the transfer rate

As described in 7.8.1, “Fibre Channel performance issues” on page 163, thetransfer rates of the controller are determined by the application I/O size andthe I/O request rate. In general, a small application I/O request size results ina lower transfer rate, but provides a faster I/O request rate and a shorterresponse time. With larger application I/O request sizes, higher throughputrates are possible. Understanding your typical application I/O patterns cangive you an idea of the maximum I/O transfer rates that are possible for agiven storage subsystem.

Because of the dependency on I/O size and transmission media, the onlytechnique you can use to improve transfer rates is to improve the I/O requestrate. Use the Windows 2000 Performance console (or Windows NTPerformance Monitor) to gather I/O size data so you understand themaximum transfer rates possible. Then use tuning options available inStorage Manager to optimize the I/O request rate so you can reach themaximum possible transfer rate.

7.8.2.3 Optimizing the I/O request rate

The factors that affect the I/O request rate include:

• I/O access pattern (random or sequential) and I/O size• Whether write caching is enabled• Cache hit percentage• RAID level• Segment size• Number of drives in the arrays or storage subsystem• Fragmentation of files• Logical drive modification priority

Note: Fragmentation affects logical drives with sequential I/O accesspatterns, not random I/O access patterns


To determine if your I/O has sequential characteristics, try enabling aconservative cache read-ahead multiplier (4, for example) using the Logicaldrive > Properties option. Then examine the logical drive cache hitpercentage to see if it has improved. An improvement indicates your I/O has asequential pattern.

Use the Windows 2000 Performance console (or Windows NT PerformanceMonitor) to determine the typical I/O size for a logical drive.

Higher write I/O rates are experienced with write-caching enabled comparedto disabled, especially for sequential I/O access patterns. Regardless of yourI/O pattern, it is recommended that you enable write-caching to maximize I/Orate and shorten application response time.

7.8.2.4 Optimizing the cache hit percentage

A higher cache hit percentage is also desirable for optimal applicationperformance and is positively correlated with I/O request rate.

If the cache hit percentage of all logical drives is low or trending downward,and you do not have the maximum amount of controller cache memoryinstalled, this could indicate the need to install more memory.

If an individual logical drive is experiencing a low cache hit percentage,consider enabling cache read-ahead (or prefetch) for that logical drive. Cacheread-ahead can increase the cache hit percentage for a sequential I/Oworkload. If cache read-ahead is enabled, the cache reads the data from thedisk. But in addition to the requested data, the cache also fetches more data,usually from adjacent data blocks on the drive. This feature increases thechance that a future request for data could be fulfilled from the cache ratherthan requiring a disk access.

The cache read-ahead multiplier values specify the multiplier to use fordetermining how many additional data blocks are read into cache. Choosing ahigher cache read-ahead multiplier can increase the cache hit percentage.

If you have determined that your I/O has sequential characteristics, tryenabling an aggressive cache read-ahead multiplier (8, for example) usingthe Logical drive > Properties option. Then examine the logical drive cachehit percentage to see if it has improved. Continue to customize logical drivecache read-ahead to arrive at the optimal multiplier. (In the case of a randomI/O pattern, the optimal multiplier is zero.)


7.8.2.5 Choosing an appropriate RAID level

Use the read percentage for a logical drive to determine actual applicationbehavior. Applications with a high read percentage will do very well usingRAID-5 logical drives because of the outstanding read performance of theRAID-5 configuration.

However, applications with a low read percentage (write-intensive) do notperform as well on RAID-5 logical drives because of the way a controllerwrites data and redundancy data to the drives in a RAID-5 array. If there is alow percentage of read activity relative to write activity, you might considerchanging the RAID level of a array from RAID-5 to RAID-1 for fasterperformance.

7.8.2.6 Choose an optimal logical drive modification priority

The modification priority defines how much processing time is allocated forlogical drive modification operations versus system performance. The higherthe priority, the faster logical drive modification operations complete but theslower system I/O is serviced.

Logical drive modification operations include reconstruction, copyback,initialization, media scan, defragmentation, change of RAID level, and changeof segment size.

The modification priority is set for each logical drive using a slider bar on theLogical drive > Properties dialog. There are five relative settings on thereconstruction rate slider bar ranging from Low to Highest. The actual speedof each setting is determined by the controller. Choose the Low setting tomaximize the I/O request rate. If the controller is idle (not servicing any I/O) itwill ignore the individual logical drive rate settings and process logical drivemodification operations as fast as possible.

7.8.2.7 Choosing an optimal segment size

A segment is the amount of data, in kilobytes, that the controller writes on asingle drive in a logical drive before writing data on the next drive. WithServeRAID, this is the stripe unit size or stripe size. Data blocks store 512bytes of data and are the smallest units of storage. The size of a segmentdetermines how many data blocks it contains. For example, an 8 KB segmentholds 16 data blocks and a 64 KB segment holds 128 data blocks.

Note: The segment size was expressed in number of data blocks in previousversions of this storage management software. It is now expressed in KB.


When you create a logical drive, the default segment size is a good choice forthe expected logical drive usage. The default segment size can be changedusing the Logical drive > Change Segment Size option.

If your typical I/O size is larger than your segment size, increase yoursegment size in order to minimize the number of drives needed to satisfy anI/O request.

If you are using the logical drive in a single-user, large I/O environment suchas multimedia application storage, performance is optimized when a singleI/O request can be serviced with a single array data stripe (the segment sizemultiplied by the number of drives in the array used for I/ O). In this case,multiple disks are used for the same request, but each disk is only accessedonce.

7.8.2.8 Minimize disk accesses by defragmentation

Each access of the drive to read or write a file results in spinning of the driveplatters and movement of the read/write heads. Make sure the files on yourarray are defragmented. When the files are defragmented, the data blocksmaking up the files are next to each other so the read/write heads do not haveto travel all over the disk to retrieve the separate parts of the file. Fragmentedfiles are detrimental to the performance of a logical drive with sequential I/Oaccess patterns.

7.9 Disk subsystem rules of thumb

A performance relationship can be developed for the disk subsystem. Thisrelationship is based upon the RAID strategy, number of drives, and the diskdrive model. The disk subsystem rules of thumb are stated in Table 18.

Table 18. Disk subsystem rules of thumb

Performance ofthis configuration

Is equivalent to...

RAID-0 33-50% more throughput than RAID-1 (same number of drives)

RAID-1E 33-50% more throughput than RAID-5 (same number of drives)

RAID-5E 10-20% more throughput than RAID-5.

Doubling number ofdrives

50% increase in drive throughput (until disk controller becomesa bottleneck)


A ServeRAID-3HB can support:

• Up to about 30 10K RPM drives before a performance bottleneck occurs.• Up to about 30 7200 RPM drives before a performance bottleneck occurs.

A ServeRAID-4H can support

• Up to about 60 10K RPM drives before a performance bottleneck occurs.• Up to about 60 7200 RPM drives before a performance bottleneck occurs.

One 10,000 RPMdrive

10-50% improvement over 7200 RPM drives (50% whenconsidering RPM only, 10% when comparing with 7200 RPMdrives with rotational positioning optimization)

Ultra2 SCSI 5-10% more throughput than Ultra SCSI for typical serverenvironments.

Ultra3 SCSI 5-10% more throughput than Ultra2 SCSI for typical serverenvironments.

Single logical drive 25% increase in throughout compared to a multiple logical driveconfiguration.

Performance ofthis configuration

Is equivalent to...

chapter 7. disk subsystem -...

Documents