storage system architecture

Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.

Storage Systems Architecture Introduction - 1

© 2007 EMC Corporation. All rights reserved.

Section 2 - Storage Systems ArchitectureSection 2 - Storage Systems Architecture

Welcome to Section 2 of Storage Technology Foundations – Storage Systems Architecture.

Copyright © 2007 EMC Corporation. All rights reserved.

These materials may not be copied without EMC's written consent.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

EMC2, EMC, Navisphere, CLARiiON, and Symmetrix are registered trademarks and EMC Enterprise Storage, The Enterprise Storage Company, The EMC Effect, Connectrix, EDM, SDMS, SRDF, Timefinder, PowerPath, InfoMover, FarPoint, EMC Enterprise Storage Network, EMC Enterprise Storage Specialist, EMC Storage Logix, Universal Data Tone, E-Infostructure, Access Logix, Celerra, SnapView, and MirrorView are trademarks of EMC Corporation.

All other trademarks used herein are the property of their respective owners.



© 2007 EMC Corporation. All rights reserved. Storage Systems Architecture Introduction - 2

Section ObjectivesUpon completion of this section, you will be able to:

Describe the physical and logical components of a host

Describe common connectivity components and protocols

Describe features of intelligent disk storage systems

Describe data flow between the host and the storage array

The objectives for this section are shown here. Please take a moment to read them.



© 2007 EMC Corporation. All rights reserved. Storage Systems Architecture Introduction - 3

In This Section This section contains the following modules:

1. Components of a Host

2. Connectivity

3. Physical Disks

4. RAID Arrays

5. Disk Storage Systems

Additional Information:

Apply Your Knowledge

Data Flow Exercise (Student Resource Guide ONLY)

Case Studies (Student Resource Guide ONLY)

This section is comprised of the 5 modules shown here.

This section also contains Apply Your Knowledge information, a Data Flow Exercise, and two Case Studies.

The Apply Your Knowledge information is presented on-line at the end of Module 5. The Data FlowExercise and two Case Studies are only available in the Student Resource Guide. Please make sure to download the Student Resource Guide and review these materials prior to taking the on-line assessment.


Components of a Host - 1

© 2007 EMC Corporation. All rights reserved. Components of a Host - 1

Components of a HostUpon completion of this module, you will be able to:

List the hardware and software components of a host

Describe key protocols and concepts used by each component

In this module, we look at the hardware and software components of a host, as well as the key protocols and concepts that make these components work. This provides the context for how data typically flows within the host, as well as between the hosts and storage systems.

The objectives for this module are shown here. Please take a moment to read them.




Examples of Hosts

Laptop

Server

Group of Servers

Mainframe

A host could be something small, like a laptop, or it could be larger, such as a server, a group or cluster of servers, or a mainframe. The host has physical (hardware) and logical (software) components. Let’s look at the physical components first.




Physical Components of a Host

Bus

I/O Devices

CPU Storage

The most common physical components found in a host system include the Central Processing Unit (CPU), Storage, and Input/Output Devices (I/O).

The CPU performs all the computational processing (number-crunching) for the host. This processing involves running programs, which are a series of instructions that tell the CPU what to do.

Storage can be high-speed, temporary (volatile, meaning that the content is lost when power is removed) storage, or permanent magnetic or optical storage media.

I/O devices allow the host to communicate with the outside world.

Let’s look at each of these elements, starting with the CPU.




CPU

CPU

Bus

BusALU

Registers

L1 Cache

The CPU consists of three major parts: The Arithmetic Logical Unit, Registers, and the L1 Cache.

The Arithmetic Logic Unit (ALU) is the portion of the CPU that performs all the manipulation of data, such as addition of numbers.

The Registers hold data that is being used by the CPU. Because of their proximity to the ALU, registers are very fast. CPUs will typically have only a small number of registers – 4 to 20 is common.

L1 cache is additional memory which is associated with the CPU. It holds data and program instructions that are likely to be needed by the CPU in the near future. The L1 cache will be slower than registers, but there will be more storage space in the L1 cache than in the registers – 16 KB is common. Although L1 cache is optional, it is found on most modern CPUs.

The CPU connects to other components in the host via a bus. Buses will be discussed in the Connectivity module of this Section.




Storage

…

0123

n

Data 0

Data n

Data 2Data 3

Data 1

Address Content

Disk

Memory

Storage in a host is comprised of memory modules and magnetic or optical media.

Memory provides access to data at electronic speeds as it is implemented using silicon chips and has no mechanical parts. Generally, there are two types of memory within a host:

Random Access Memory (RAM) - This is the most common form of memory. It allows direct access to any memory location and can have data written into it or read from it Read Only Memory (ROM) - contains data that can be read, but not changed. It is usually used for data needed during internal routines such as system startup

Modern hosts can have large amounts of memory – 16 GB and upwards. The slide shows a representation of memory addressing. Each memory location is given a unique address which is used for reading/writing data from and to memory.

Examples of media-based host storage include:Hard diskCDROM or DVDROMFloppy diskTape drive




Storage Hierarchy – Speed and Cost

Speed

Slow

Fast

CostHighLow

Tape Optical disk

Magnetic disk

RAM

L2 cache L1 cache

CPU registers

In any host, there is a variety of storage types. Each type has different characteristics of speed, cost, and capacity. As a general rule, faster technologies cost more and, as a result, are more scarce.

CPU registers are extremely fast but limited in number to a few tens of locations at most, and are expensive in terms of both cost and power use. As we move down the list, speeds decrease along with cost.

Magnetic disks are generally fixed, whereas optical disk and tape use removable media. The cost of optical and tape media per MB stored is much lower than that of magnetic disk.




I/O DevicesHuman interface– Keyboard– Mouse– Monitor

Computer-computer interface– Network Interface Card (NIC)

Computer-peripheral interface– USB (Universal Serial Bus) port– Host Bus Adapter (HBA)

I/O devices allow a host to interact with the outside world by sending and receiving data. The basic I/O devices, such as the keyboard, mouse and monitor, allow users to enter data and view the results of operations. Other I/O devices allow hosts to communicate with each other or with peripheral devices, such as printers and cameras.




HBAsHost

Apps

Volume Management

DBMS Mgmt Utilities

File System

Multi-pathing Software

Device Drivers

HBA HBA HBA

Operating System

The host connects to storage devices using special hardware called a Host Bus Adapter (HBA). HBAs are generally implemented as either an add-on card or a chip on the motherboard of the host. The ports on the HBA are used to connect the host to the storage subsystem. There may be multiple HBAs in a host.

The HBA has the processing capability to handle some storage commands, thereby reducing the burden on the host CPU.




Logical Components of a HostHost

Apps

Volume Management

DBMS Mgmt Utilities

File System


Device Drivers

HBA HBA HBA

Operating System

Hosts generally include software components such as: Applications - provide a point of interaction either between the user and the host or between hosts Operating system - controls all aspects of the computing environment. It manages the user interface and the internal operations of all hardware components of the systemThe Operating System:− Provides the services required for applications to access data−Monitors and responds to user actions and the environment−Organizes and controls the hardware components−Connects hardware components to the application program layer and the users−Manages system activities such as storage and communication

File System (and Files) - provides a logical structure for data access and data storage Device drivers:−Allows the operating system to be aware of, and use a standard interface to access and control a

specific device (i.e., printer, speakers, mouse, keyboard, video, storage devices)− Provides the appropriate protocols to the host to allow access to the device




Host

Apps

Volume Management

DBMS Mgmt Utilities

File System


Device Drivers

HBA HBA HBA

Operating System

File Systems

The file system is the general name given to the host-based logical structures and software routines used to control access to data storage.

The file system block is the smallest ‘container’ allocated to a file’s data. Each filesystem block is a contiguous area of physical disk capacity.

Blocks can range in size, depending on the type of files being stored and accessed. The block size is fixed (by the operating system) at the time of file system creation.Since most files are larger than the pre-defined filesystem block size, a file’s data spans multiple filesystem blocks. However, the filesystem blocks containing all of the file’s data may not necessarily be contiguous on a physical disk. Over time, as files grow larger, the file system becomes increasingly fragmented.

In multi-user, multi-tasking environments, filesystems manage shared storage resources using:Directories, paths and structures to identify file locationsVolume Managers to hide the complexity of physical disk structuresFile locking capabilities to control access to files. This is important when multiple users or applications attempt to access the same file simultaneously




File System: Metadata ExamplesUNIX (UFS)

File type and permissions

Number of links

Owner and group IDs

Number of bytes in the file

Last file access

Last file modification

Windows (NTFS)

Time stamp and link count

File name

Access rights

File data

Index information

Volume information

The number of files created and accessed by a host can be very large. Instead of using a linear or flat structure (similar to having many objects in a single box), a filesystem is divided into directories (smaller boxes), or folders.

Directories:Organize file systems into containers which may hold files as well as other (sub)directoriesHold information about files they contain

A directory is a special type of file containing a list of filenames and associated metadata (information or data about the file). When a user attempts to access a given file by name, the name is used to look up the appropriate entry in the directory. That entry holds the corresponding metadata.




File Systems: Journaling and LoggingImproves data integrity and system restart time over non-journaling file systems

Uses a separate area called a log or journal– May hold all data to be written– May hold only metadata

Disadvantage - slower than other file systems– Each file system update requires at least 1 extra write – to the log

Non-journaling file systems create a potential for lost files because they may use many separate writes to update their data and metadata. If the system crashes during the write process, metadata or data may be lost or corrupted. When the system reboots, the filesystem attempts to update the metadata structures by examining and repairing them. This operation takes a long time on large file systems. If there is insufficient information to recreate the desired or original structure, files may be misplaced or lost and file systems corrupted.

A journaling file system uses a separate area called a log, or journal. This journal may contain all the data to be written (physical journal), or may contain only the metadata to be updated (logical journal). Before changes are made to the filesystem, they are written to this separate area. Once the journal has been updated, the operation on the filesystem can be performed. If the system crashes during the operation, there is enough information in the log to "replay" the log record and complete the operation.

Journaling results in a very quick filesystem check by only looking at the active, most recently accessed parts of a large file system. In addition, because information about the pending operation is saved, the risk of files being lost is lessened.

A disadvantage of journaling filesystems is they are slower than other file systems. This slowdown is the result of the extra operations that have to be performed on the journal each time the filesystem is changed. The much shortened time for file system check and the integrity provided by journaling far outweighs this disadvantage. Nearly all file system implementations use journaling.




Volume ManagementHost

Apps

Volume Management

DBMS Mgmt Utilities

File System


Device Drivers

HBA HBA HBA

Operating System

The volume manager is an optional intermediate layer between the file system and the physical disks. It sits between the file system and the physical disk system. It ‘aggregates’ several hard disks to form a large, virtual disk and makes this virtual disk visible to higher level programs and applications. It optimizes access to storage and simplifies the management of storage resources.




How Files are Moved to and from StorageTeacher

Configures / Manages

File System Files

Mapped by file system to

Course File(s)

Reside in

File System Blocks

Disk Physical Extents

Consisting of

LVM Logical Extents

Residing in

Mapped by LVM to

Disk Sectors

Managed by Disk Storage Subsystem

This represents how files are moved to and from Storage:

1. A teacher designs course materials using an application and stores them as files on a filesystem.

2. These files are mapped to units of data called filesystem blocks, which are mapped to disk sectors by the operating system, in the absence of a Logical Volume Manager.

3. When a Logical Volume Manager (LVM) is used, filesystem blocks are mapped to logical extents, which in turn are mapped to disk physical extents. These physical disk extents map to disk sectors.




Module SummaryKey points covered in this module:

Hosts typically have:– Hardware: CPU, memory, buses, disks, ports, and interfaces– Software: applications, operating systems, file systems, device

drivers, volume managers

Journaling enables: – very fast file system checks in the event of system crash – provides better integrity for file system structure

HBAs are used to connect hosts to storage devices

These are the key points covered in this module. Please take a moment to review them.




Check Your KnowledgeWhat are some examples of hosts?

Describe the hardware components found in most hosts.

What is the function of the operating system?

What is the function of the file system?

What is volume management?

Check your knowledge of this module by taking some time to answer the questions shown on the slide.


Connectivity - 1

© 2007 EMC Corporation. All rights reserved. Connectivity - 1

ConnectivityUpon completion of this module, you will be able to:

Describe the physical components of a networked storage environment

Describe the logical components (communication protocols) of a networked storage environment

In the previous module, we looked at the host environment. In this module, we discuss how the host is connected to storage, and the protocols used for communication between them.



Connectivity - 2


Physical Components – Host with Internal Storage

Bus

Disk

Cable

Host

Port

Port

HBA

CPU

There are three key connectivity components associated with hosts: Bus – for example, connecting the CPU to memoryPorts – connections to external devices such as printers, scanners, or storage Cables – copper or fiber optic “wires” connecting a host to internal or external devices

A host with internal storage may be anything from a laptop to a large enterprise server. All of the components are internal to the host enclosure.


Connectivity - 3


Bus Technology Serial

Serial Bi-directional

Parallel

A bus is a collection of paths that facilitate data transmission from one part of the computer to another.

Physical components communicate across a bus by sending packages of data between the devices. These packets can travel in a serial path or in parallel paths. In serial communication, the bits travel one behind the other. In parallel communication, the bits can move along multiple paths simultaneously.

A simple analogy to describe buses is a highway:

A Serial Bus is a one-way, single-lane highway where data packets travel in a line in one direction.

A Bi-directional Serial Bus is a two-lane road where data packets travel in a line in both directionssimultaneously

A Parallel Bus is a multi-lane, highway. This could be a bi-directional, multi-lane highway where they travel in different lanes in both directions simultaneously.

Note: The Parallel Bi-directional Bus is not shown in this slide.

.


Connectivity - 4


Bus TechnologySystem Bus – connects CPU to Memory

Local (I/O) Bus – carries data to/from peripheral devices

Bus width measured in bits

Bus speed measured in MHz

Throughput measured in MB/S

Generally, there are at least two types of buses in a computer system:System Bus – carries data from the processor to memoryLocal or I/O Bus – carries data to/from peripheral devices such as storage devices. The local bus is a high-speed pathway that connects directly to the processor

The size of a bus, known as its width, is important because it determines how much data can be transmitted at one time. For example, a 16-bit bus can transmit 16 bits of data, whereas a 32-bit bus can transmit 32 bits of data. The width of a bus may be compared to the number of lanes on a highway.

Every bus has a clock speed measured in MHz. A fast bus allows data to be transferred faster, which makes applications run faster.


Connectivity - 5


Connectivity Protocols Protocol = a defined format for communication – allows the sending and receiving devices to agree on what is being communicated.

Tightly ConnectedEntities

DirectlyAttachedEntities

Network Connected

Entities

Protocol is a defined format, in this case, for communication between hardware or software components. Communication protocols are defined for systems and components that are:

Tightly connected entities – such as central processor to RAM, or storage buffers to controllers –use standard BUS technology (e.g. System bus or I/O – Local Bus)Directly attached entities or devices connected at moderate distances – such as host to printer or host to storageNetwork connected entities – such as networked hosts, Network Attached Storage (NAS) or Storage Area Networks (SAN)

We will discuss the communication protocols (logical components) found in each of these connectivity models, starting with the tightly connected or bus protocols.


Connectivity - 6


Communication Protocols

Host

Apps

Operating System

PCI

SCSI or IDE/ATA Device Drivers

The protocols for the local (I/O) bus and for connections to an internal disk system include:PCIIDE/ATASCSI

The next few slides examine each of these.


Connectivity - 7


Bus Technology - PCIPeripheral Component Interconnect (PCI) defines the local bus system within a computer

It is an interconnection between microprocessor and attached devices, in which expansion slots are spaced closely for high-speed operation

Has Plug and Play functionality

PCI is 32/64 bit

Throughput is 133 MB/sec

The Peripheral Component Interconnect (PCI) is a specification defining the local bus system within a computer. The specification standardizes how PCI expansion cards, such as network cards or modems, install themselves and exchange information with the central processing unit (CPU).

In more detail, a Peripheral Component Interconnect (PCI) includes:an interconnection system between a microprocessor and attached devices, in which expansion slots are spaced closely for high-speed operationplug and play functionality that makes it easy for a host to recognize a new card32 or 64 bit data a throughput of 133 MB/sec

PCI Express is an enhanced PCI bus with increased bandwidth.


Connectivity - 8


IDE/ATA Integrated Device Electronics (IDE) / Advanced Technology Attachment (ATA)

Most popular interface used with modern hard disks

Good performance at low cost

Desktop and laptop systems

Inexpensive storage interconnect

The most popular interface protocol used in modern hard disks is the one most commonly known as IDE. This interface is also known as ATA.

IDE/ATA hard disks are used in most modern PCs, and offer excellent performance at relatively low cost.


Connectivity - 9


SCSI - Small Computer System InterfaceMost popular hard disk interface for servers

Higher cost than IDE/ATA

Supports multiple simultaneous data access

Currently both parallel and serial forms

Used primarily in “higher end” environments

Small Computer Systems Interface, SCSI, has several advantages over IDE that make it preferable for use in higher-end machines. It is far less commonly used than IDE/ATA in PCs due to its higher cost and the fact that its advantages are not useful for the typical home or business desktop user.

SCSI began as a parallel interface, allowing the connection of devices to a PC, or other servers, with data being transmitted across multiple data lines. SCSI itself, however, has been broadened greatly in terms of its scope, and now includes a wide variety of related technologies and standards.


Connectivity - 10


SCSI Model

Target

Initiator

As you can see from the diagram, a SCSI device that ‘starts’ a communication is an “initiator”, and a SCSI device that services a request is a “target”.

You should not necessarily think of initiators as hosts, and targets as storage devices. Storage devices may initiate a command to other storage devices or switches, and hosts may be targets and receive commands from the storage devices.

After initiating a request to the target, the host can process other events without having to wait for a response from the target. After it finishes processing, the target signals a command complete or a status message back to the host.


Connectivity - 11


SCSI Model

Target ID

Initiator ID

LUNs

Components of a SCSI communication include:Initiator ID – uniquely identifies an initiator that is used as an “originating address”Target ID – uniquely identifies a target. Used as the address for exchanging commands and status information with initiatorsLogical Unit Numbers (LUNs) – identifies a specific Logical Unit in a target. Logical Units can be more than a single disk


Connectivity - 12


SCSI Addressing

Initiator ID - a number from 0 to 15 with the most common value being 7.Target ID - a number from 0 to 15LUN - a number that specifies a device addressable through a target.

Initiator ID Target ID LUN

Initiator ID is the original initiator ID number (used to send responses back to the initiator from the storage device). A SCSI host bus adapter (referred to as a controller) can be implemented in two ways:

an onboard interface an ‘add in’ card plugged into the system I/O bus

Target ID is the value for a specific storage device. It is an address that is set on the interface of the device such as a disk, tape or CDROM.

LUN is Logical Unit Number of the device. It reflects the actual address of the device, as seen by the target.


Connectivity - 13


Disk Identifier - Addressing

c0 –Controller/Initiator/HBA

PeripheralController

t0

Target

LUNs

d0 d1 d2

Host Addressing– Controller– Target– LUN

c0 t0 d0

For example, a logical device name (used by a host) for a disk drive may be: cn|tn|dn, where cn is the controller tn is the target ID of the devices such as t0, t1, t2 and so ondn is the device number, which reflects the actual address of the device unit. This is usually d0 for most SCSI disks because there is only one disk attached to the target controller.

In intelligent storage systems, discussed later, each target may address many LUNs.


Connectivity - 14


SCSI - Pros and ConsPros:– Fast transfer speeds, up to 320

megabytes per second– Reliable, durable components– Can connect many devices with a

single bus, more than just HDs– SCSI host cards can be put in

almost any system– Full backwards compatibility

Cons:– Configuration and setup specific

to one computer– Unlike IDE, few BIOS support the

standard– Overwhelming number of

variations in the standard, hardware, and connectors

– No common software interfaces and protocol

SCSI has many significant advantages in relation to IDE. They include: a faster transfer speed (Note: 320 MB/s refers to parallel SCSI. Serial SCSI may be different.)robust software and hardwarecan connect many devices to a computerallows SCSI Host Adapter cards to be put into almost any system supports a remarkable level of backwards compatibility


Connectivity - 15


Comparison IDE/ATA vs. SCSI

Feature IDE/ATA SCSIConnectivity Market Internal Storage Internal and External

Storage

Speed (MB/sec) 100/133/150 320

Hot Pluggable No Yes

Expandability Easier to set up Very good but veryexpensive to set up

Cost/Performance Good High cost/Fasttransfer speed

Expandability and number of devices - SCSI is superior to IDE/ATA. This advantage of SCSI only matters if you actually need this much expansion capability as SCSI is more involved and expensive to set up.

Device Type Support – SCSI holds a significant advantage over IDE/ATA in terms of the types of devices each interface supports.

Cost – the IDE/ATA interface is superior to the SCSI interface.

Performance – These factors influence system performance for both interfaces:Maximum Interface Data Transfer Rate: Both interfaces presently offer very high maximum interface rates, so this is not an issue for most PC users. However, if you are using many hard disks at once, for example in a RAID array, SCSI offers better overall performance.Device-Mixing Issues: IDE/ATA channels that mix hard disks and CD-ROMs are subject to significant performance hits due to the fact these devices are operating at different speeds (hard disks read and write relatively quickly when compared to CDROM drives). Also, the IDE channel that can only support a single device at a time must wait for the slower optical drive to complete a task. SCSI does not have this problem. Device Performance: When looking at particular devices, SCSI can support multiple devices simultaneously while IDE/ATA can only support a single device at a time.

Configuration and set-up – IDE/ATA is easier to set up, especially if you are using a reasonably new machine and only a few devices. SCSI has a significant advantage over IDE/ATA in terms of hard disk addressing issues.


Connectivity - 16


Physical Components – Host with External Storage

Bus

Disk

Cable

Host

Port

Port

HBA

CPU

A host with external storage is usually a large enterprise server. Components are identical to those of a host with internal storage. The key difference is in the external storage interfaces used.

.


Connectivity - 17


Fibre Channel

Fibre Channel

Storage Arrays

Host

AppsDBMS Mgmt Utils

File SystemLVM

Multipathing Software

Device Drivers

HBA HBA HBA

Fibre Channel is a high–speed interconnect used in networked storage to connect servers to shared storage devices. Fibre Channel components include HBAs, hubs, switches, cabling, and disks.

The term Fibre Channel refers to both the hardware components and the protocol used for communication between nodes.


Connectivity - 18


External Storage Interfaces – A ComparisonSCSI– Limited distance– Limited device count– Usually limited to single initiator– Single-ported drives

Fibre Channel– Greater distance– High device count in SANs– Multiple initiators– Dual-ported drives

The two most popular interfaces for external storage devices are SCSI and Fibre Channel (FC). SCSI is also commonly used for internal storage in hosts; FC is almost never used internally.


Connectivity - 19


Fibre Channel Connectivity

Switches Storage

Hosts

When computing environments require high speed connectivity, they use sophisticated equipment to connect hosts to storage devices.

Physical connectivity components in networked storage environments include:HBA (Host-side interface) – Host Bus Adapters connect the host to the storage devicesOptical cables – fiber optic cables to increase distance, and reduce cable bulkSwitches – used to control access to multiple attached devicesDirectors – sophisticated switches with high availability componentsBridges – connections to different parts of a network


Connectivity - 20



The physical components of a networked storage environment

The logical components (communication protocols) of a networked storage environment



Connectivity - 21


Check Your KnowledgeWhat are the key physical connectivity components of a small systems environment?

What are the key physical connectivity components of networked storage computing environments?

What are the key logical connectivity protocols found in all computing environments?



Connectivity - 22


Physical Disks - 1

© 2007 EMC Corporation. All rights reserved. Physical Disks - 1

Physical DisksAfter completing this module, you will be able to:

Describe the major physical components of a disk drive and their function

Define the logical constructs of a physical disk

Describe the access characteristics for disk drives and their performance implications

Describe the logical partitioning of physical drives

There are several methods for storing data, however, in this module, the focus is on disk drives. Disk drives use many types of technology to perform their job: mechanical, chemical, magnetic, electrical. Our intent is not to make you an expert on every detail about the drive - rather you should have a high level understanding of how both the physical and logical parts of a drive work. This enables you to see how these parts impact system capacity, reliability, and performance.



Physical Disks - 2


Lesson: Disk Drive Components Upon completion of this lesson, you will be able to:

Describe the physical components of a disk drive

Describe the physical structure of a disk drive platter

Discuss how the geometry of a disk impacts how data is recorded on a platter

Differentiate between the logical organization of data and the physical organization on a disk drive

The focus of this lesson is on the components of a disk drive and how they work. Additionally, it is important to understand how the data is organized on the disk based on its disk geometry.


Physical Disks - 3


Disk Drive Components: Platters

0011010011101010101000110100111010101010

10110101011010101010

01010100111010101010

A hard drive contains a series of rotating platters within a sealed case. The sealed case is known as Head Disk Assembly, or HDA.

A platter has the following attributes: It is a rigid, round disk which is coated with magnetically sensitive material. Data is stored in binary code (0s and 1s). It is encoded by polarizing magnetic areas, or domains, on the disk surface.Data can be written to and read from both surfaces of a platter.A platter’s storage capacity varies across drives. There is an industry trend toward higher capacity as technology improves. −Note: The drive’s capacity is determined by the number of platters, the amount of data which

can be stored on each platter, and how efficiently data is written to the platter.

Note: These concepts apply to disk drives used in systems of all sizes.


Physical Disks - 4


Disk Drive Components: Spindle

Spindle

Platters

Multiple platters are connected by a spindle. The spindle is connected to a motor which rotates at a constant speed. The spindle rotates continuously until power is removed from the spindle motor. Many hard drive failures occur when the spindle motor fails.

Disk platters spin at speeds of several thousand revolutions per minute. These speeds increase as technologies improve, though there is a physical limit to the extent to which they can improve.


Physical Disks - 5


Disk Drive Components: Read/Write Heads

Data is read and written by read/write heads, or R/W heads. Most drives have two R/W heads per platter, one for each surface of the platter.

When reading data, they detect magnetic polarization on the platter surface.When writing data, they change the magnetic polarization on the platter surface.

Since reading and writing data is a magnetic process, the R/W heads never actually touch the surface of the platter. There is a microscopic air gap between the read/write heads and the platter. This is known as the head flying height.

When the spindle rotation has stopped, the air gap is removed and the R/W heads rest on the surface of the platter in a special area near the spindle called a landing zone. The landing zone is coated with a lubricant to reduce head/platter friction. Logic on the disk drive ensures that the heads are moved to the landing zone before they touch the surface.

If the drive malfunctions and a read/write head accidentally touches the surface of the platter outside of the landing zone, it is called a head crash. When a head crash occurs, the magnetic coating on the platter gets scratched and damage may also occur to the R/W head. A head crash generally results in data loss.


Physical Disks - 6


Disk Drive Components: Actuator

Actuator

Spindle

Read/write heads are mounted on the actuator arm assembly, which positions the read/write head at the location on the platter where data needs to be written or read.


Physical Disks - 7


Physical Disk Structures: Actuator Arm Assembly

Actuator

R/W Head

R/W Head

The read/write heads for all of the platters in a drive are attached to one actuator arm assembly and move across the platter simultaneously. Notice there are two read/write heads per platter, one for each surface.


Physical Disks - 8


Disk Drive Components: Controller

Bottom View of Disk Drive

HDA

Controller

Interface

Power Connector

The controller is a printed circuit board, mounted at the bottom of the disk drive. It contains a microprocessor (as well as some internal memory, circuitry, and firmware) that controls:

power to the spindle motor and control of motor speedhow the drive communicates with the host CPUreads/writes by moving the actuator arm, and switching between R/W headsoptimization of data access


Physical Disks - 9


Physical Disk Structures: Sectors and Tracks

Sector

Track

Platter

Data is recorded in tracks. A track is a concentric ring around the spindle which contains data. A track can hold a large amount of data. Track density describes how tightly packed the tracks are on a platter. Tracks are numbered from the outer edge of the platter, starting at track zero.A track is divided into sectors. A sector is the smallest individually-addressable unit of storage.The number of sectors per track is based upon the specific drive.Sectors typically hold 512 bytes of user data. Some disks can be formatted with larger sectors.A formatting operation performed by the manufacturer writes the track and sector structure on the platter.

Each sector stores user data as well as other information, including its sector number, head number (or platter number) and track number. This information aids the controller in locating data on the drive, but it also takes up space on the disk. Thus there is a difference between the capacity of an unformatted disk and a formatted one. Drive manufacturers generally advertise the formatted capacity.

The first PC hard disks typically held 17 sectors per track. Today's hard disks can have a much larger number of sectors in a single track. There can be thousands of tracks on a platter, depending on the size of the drive.


Physical Disks - 10


Platter Geometry and Zoned-Bit Recording

Platter Without Zones

Sector

Track

Platter With Zones

Since a platter is made up of concentric tracks, the outer tracks can hold more data than the inner ones because they are physically longer than the inner tracks. However, in older disk drives, the outer tracks had the same number of sectors as the inner tracks, which means that the data density was very low on the outer tracks. This was an inefficient use of the available space.

Zoned-bit recording uses the disk more efficiently. It groups tracks into zones that are based upon their distance from the center of the disk. Each zone is assigned an appropriate number of sectors per track. This means that a zone near the center of the platter has fewer sectors per track than a zone on the outer edge.

In zoned-bit recording:outside tracks have more sectors than inside trackszones are numbered, with the outermost zone being Zone 0 tracks within a given zone have the same number of sectors

Note: The media transfer rate drops as the zones move closer to the center of the platter, meaning that performance is better on the zones created on the outside of the drive. Media transfer rate is covered later in the module.


Physical Disks - 11


Physical Disk Structures: Cylinders

Cylinder

Tracks, Cylinders and Sectors

Tracks and sectors organize data on a single platter. Cylinders help organize data across platters on a drive.

A cylinder is the set of identical tracks on both surfaces of each of the drive’s platters. Often the drive head location is referred to by cylinder number rather than by track number.

Because all of the read-write heads move together, each head is always physically located at the same track number. In other words, one head cannot be on track zero while another is on track 10.

.


Physical Disks - 12


Logical Block Addressing

Physical Address = CHS Logical Block Address = Block #

Sector

CylinderHead Block 0

Block 16

Block 32

Block 48

Block 8(lower surface)

At one time, drives used physical addresses made up of the Cylinder, Head, and Sector number (CHS) to refer to specific locations on the disk. This meant that the host had to be aware of the geometry of each disk that was used.

Logical Block Addressing (LBA) simplifies addressing by a using a linear address for accessingphysical blocks of data. The disk controller performs the translation process from LBA to CHS address. The host only needs to know the size of the disk drive (how many blocks).

Logical blocks are mapped to physical sectors on a 1:1 basisBlock numbers start at 0 and increment by one until the last block is reached (E.g., 0, 1, 2, 3 … (N-1))Block numbering starts at the beginning of a cylinder and continues until the end of that cylinder This is the traditional method for accessing peripherals on SCSI, Fibre Channel, and newer ATA disksAs an example, we’ll look at a new 500 GB drive. The true capacity of the drive is 465.7 GB, which is in excess of 976,000,000 blocks. Each block will have its own unique address

In the slide, the drive shows 8 sectors per track, 8 heads, and 4 cylinders. We have a total of 8 x 8 x 4 = 256 blocks. The illustration on the right shows the block numbering, which ranges from 0 to 255.


Physical Disks - 13


Drive Partitioning and Concatenation

A

Concatenation -

One Logical Volume

Partitioning -

Multiple Logical Volumes

ABC

D

Partitioning divides the disk into logical containers (known as volumes), each of which can be used for a particular purpose.

Partitions are created from groups of contiguous cylindersA large physical drive could be partitioned into multiple Logical Volumes (LV) of smaller capacityBecause partitions define the disk layout, they are generally created when the hard disk is initially set up on the host Partition size impacts disk space utilizationThe host filesystem accesses partitions, with no knowledge of the physical structure.

Concatenation groups several smaller physical drives and presents them collectively as one large logical drive to the host. This is typically done using the Logical Volume Manager on the host.


Physical Disks - 14


Lesson SummaryKey points covered in this lesson:

Physical drives are made up of:– HDA

Platters connected via a spindleRead/write heads which are positioned by an actuator

– ControllerControls power, communication, positioning, and optimization

Data is structured on a drive using tracks, sectors, and cylinders

The geometry of a disk impacts how data is recorded on a platter

These are the key points covered in this lesson. Please take a moment to review them.


Physical Disks - 15


Lesson: Disk Drive PerformanceUpon completion of this lesson, you will be able to:

Describe the factors that impact the performance of a drive

Describe how drive reliability is measured

The focus of this lesson is on the factors that impact how well a drive works, in particular, the performance and reliability of the drive.

Since a disk drive is a mechanical device, it takes much more time than the electronic speeds of memory. The length of time to read or write data on the disk is dependant primarily upon three factors: Seek time, Rotational Delay also known as Latency, and Transfer Rate.

The objectives for this lesson are shown here. Please take a moment to read them.


Physical Disks - 16


Disk Drive Performance: PositioningSeek time is the time for read/write heads to move between tracks

Seek time specifications include:– Full stroke– Average– Track-to-track

Seek times describe the time it takes to position the read/write heads radially across the platter. The following specifications are often published:

Full Stroke - the time it takes to move across the entire width of the disk, from the innermost track to the outermost Average – the average time it takes to move from one random track to another (normally listed as the time for one-third of a full stroke) Track-to-Track – the time it takes to move between adjacent tracks

Each of these specifications is measured in milliseconds (ms).

Notes:

Average seek times on modern disks typically are in the range of 3 to 15 ms.

Seek time has more impact on reads of random tracks on the disk rather than on adjacent tracks.

To improve seek time, data is often written only to a subset of the available cylinders (either on the inner or outer tracks), and the drive is treated as though it has a lower capacity than it really has, e.g. a 500 GB drive is set up to use only the first 40 % of the cylinders, and is treated as a 200 GB drive. This is known as short-stroking the drive.


Physical Disks - 17


Disk Drive Performance: Rotational Speed/Latency

The actuator moves the read/write head over the platter to a particular track, while the platter spins to position the a particular sector under the read write head.

Rotational latency is the time it takes the platter to rotate and position the data under the read/write head.

Rotational latency is dependent upon the rotation speed of the spindle and is measured in milliseconds (ms)The average rotational latency is one-half of the time taken for a full rotation Like seek times, rotational latency has more of an impact on reads or writes of random sectors on the disk than on the same operations on adjacent sectors

Since spindle speed contributes to latency, the faster the disk spins, the quicker the correct sector will rotate under the heads—thus leading to a lower latency.

Rotational latency is around 5.5 ms for a 5,400 rpm drive, and around 2.0 ms for a 15,000 rpm drive.


Physical Disks - 18


Disk Drive Performance: Command Queuing

Request 1

Request 2

Request 3

Request 4

1234

Request 1

Request 2

Request 3

Request 4

1324

Without Command Queuing

With Command Queuing

1

2

34

1

2

34

If commands are processed as they are received, time is wasted if the read/write head passes over data that is needed one or two requests later. To improve drive performance, some drive manufacturers include logic that analyzes where data is stored on the platter relative to the data access requests. Requests are then reordered to make best use of the data’s layout on the disk.

This technique is known as Command Queuing (also known as Multiple Command Reordering, Multiple Command Optimization, Command Queuing and Reordering, Native Command Queuing or Tagged Command Queuing).

In addition to being performed at the physical disk level, command queuing can also be performed by the storage system that uses the disk.


Physical Disks - 19


Disk Drive Performance: Data Transfer Rate

InterfaceInterface BufferBufferHBAHBA

Disk Drive

Internal transfer rate measured here

External transfer rate measured here

The following steps take place when data is read from/written to the drive:Read

1. Data moves from the disk platters to the heads 2. Data moves from the heads to the drive's internal buffer3. Data moves from the buffer through the interface to the host HBA

Write1. Data moves from the HBA to the internal buffer through the drive’s interface2. Data moves from the buffer to the read/write heads3. Data moves from the disk heads to the platters

The Data Transfer Rate describes the MB/second that the drive can deliver data to the HBA. Given that internal and external factors can impact performance transfer rates are refined to use:

Internal transfer rate - the speed of moving data from the disk surface to the R/W heads on a single track of one surface of the disk. This is also known as the burst transfer rate− Sustained internal transfer rate takes other factors into account, such as seek times

External transfer rate - the rate at which data can be moved through the interface to the HBA. The burst transfer rate is generally the advertised speed of the interface (e.g., 133 MB/s for ATA/133)− Sustained external transfer rate are lower than the interface speed

Note: Internal transfer rates are almost always lower, sometimes appreciably lower, than the external transfer rate.


Physical Disks - 20


Drive Reliability: MTBFMean Time Between Failure

Amount of time that one can anticipate a device to work before an incapacitating malfunction occurs– Based on averages– Measured in hours

Determined by artificially aging the product

Mean Time Between Failure (MTBF) is the amount of time that one can anticipate a device to work before an incapacitating malfunction occurs. It is based on averages and therefore is used merely to provide estimates. MTBF is measured in hours (e.g., 750,000 hours).

MTBF is based on an aggregate analysis of a huge number of drives, so it does not help to determine how long a given drive will actually last. MTBF is often used along with the service life of the drive, which describes how long you can expect the drive’s components to work before they wear out (e.g., 2 years).

Note: MTBF is a statistical method developed by the U.S. military as a way of estimating maintenance levels required by various devices. It is generally not practical to test a drive before it becomes available for sale (750,000 hours is over 85 years!). Instead, MTBF is tested by artificially aging the drives. This is accomplished by subjecting them to stressful environments such as high temperatures, high humidity, fluctuating voltages, etc.


Physical Disks - 21



Drive performance is impacted by a number of factors including:– Seek time– Rotational latency– Command queuing– Data transfer rate

Drive reliability is measured using MTBF



Physical Disks - 22



Physical drives are made up of a number of components– HDA – houses the platters, spindles, actuator assemblies (which

include the actuator and the read/write heads) – Controller - Controls power, communication, positioning, and

optimization

Data is structured on a drive using tracks, sectors, and cylinders

Drive performance is impacted by seek time, rotational latency, command queuing, and data transfer rate



Physical Disks - 23


Check Your KnowledgeDescribe the purpose of the actuator, the read/write head, and the controller on a drive.

What is the difference between a track, a sector, and a cylinder?

Why is zoned-bit recording used?

What is the difference between seek time and rotational latency?

What is the difference between internal and external data transfer rates?

What purpose does the MTBF specification serve?



Physical Disks - 24


RAID Arrays - 1

© 2007 EMC Corporation. All rights reserved. RAID Arrays - 1

RAID ArraysAfter completing this module, you will be able to:

Describe what RAID is and the needs it addresses

Describe the concepts upon which RAID is built

Compare and contrast common RAID levels

Recommend the use of the common RAID levels based on performance and availability considerations

In the previous module, we looked at how a disk drive works. Disk drives can be combined into disk arrays to increase capacity.

An individual drive has a certain life expectancy before it fails, as measured by MTBF. Since there are many drives, potentially hundreds or even thousands of drives in disk array, the probability of a drive failure increases significantly. As an example, if the MTBF of a drive is 750,000 hours, and there are 100 drives in the array, then the MTBF of the array becomes 750,000 / 100, or 7,500 hours. RAID (Redundant Array of Independent Disks) was introduced to mitigate this problem.

RAID arrays enable you to increase capacity, provide higher availability (in case of a drive failure), and increase performance (through parallel access). In this module, we will look at the concepts that provide a foundation for understanding disk arrays with built-in controllers for performing RAID calculations. Such arrays are commonly referred to as RAID Arrays. We will also learn about a few commonly implemented RAID levels and the type of protection they offer.


RAID Arrays - 2


RAID - Redundant Array of Independent Disks

RAIDController

RAIDController

RAID Array

Host

RAID (Redundant Arrays of Independent Disks) combines two or more disk drives in an array into a RAID set or a RAID group. The RAID set appears to the host as a single disk drive. Properly implemented RAID sets provide:

Higher data availabilityImproved I/O performanceStreamlined management of storage devices

Historical Note: In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published a paper entitled, "A Case for Redundant Arrays of Inexpensive Disks (RAID)." This paper described various types of disk arrays, referred to by the acronym RAID. At the time, data was stored largely on large, expensive disk drives (called SLED, or Single Large Expensive Disk). The term inexpensive was used in contrast to the SLED implementation. The term RAID has been redefined to refer to independent disks, to reflect the advances in the storage technology.

RAID storage has now grown from an academic concept to an industry standard.


RAID Arrays - 3


RAID Components

RAIDController

RAIDController

Logical Array

Logical Array

Physical Array

RAID Array

Host

Physical disks inside a RAID array are usually contained in smaller sub-enclosures. These sub-enclosures, or physical arrays, hold a fixed number of physical disks, and may also include other supporting hardware, such as power supplies.

A subset of disks within a RAID array can be grouped to form logical associations called logical arrays, also known as a RAID set or a RAID group. The operating system may see these disk groups as if they were regular disk volumes. Logical arrays facilitate the management of a potentially huge number of disks. Several physical disks can be combined to make large logical volumes.

Generally, the array management software implemented in RAID systems handles:Management and control of disk aggregations (e.g. volume management)Translation of I/O requests between the logical disks and the physical disksData regeneration if disk failures occur


RAID Arrays - 4


RAID Levels0 Striped array with no fault tolerance

1 Disk mirroring

3 Parallel access array with dedicated parity disk

4 Striped array with independent disks and a dedicated parity disk

5 Striped array with independent disks and distributed parity

6 Striped array with independent disks and dual distributed parity

Combinations of levels (I.e., 1 + 0, 0 + 1, etc.)

There are some standard RAID configuration levels, each of which has benefits in terms of performance, capacity, data protection, etc.

The discussion centers around the commonly used levels and commonly used combinations of levels.


RAID Arrays - 5


Data Organization: Strips and Stripes

Stripe 1Stripe 2Stripe 3

Strips

RAID sets are made up of disks. Within each disk, there are groups of contiguously addressed blocks, called strips. The set of aligned strips that spans across all the disks within the RAID set is called a stripe.

Strip size (also called stripe depth) describes the number of blocks in a strip, and is the maximum amount of data that is written to or read from a single disk in the set before the next disk is accessed (assuming that the accessed data starts at the beginning of the strip). −All strip in a stripe have the same number of blocks. −Decreasing strip size means that data is broken into smaller pieces when spread across the

disks.Stripe size describes the number of data blocks in a stripe. − To calculate the stripe size, multiply the strip size by the number of data disks.

Stripe width refers to the number of data strips in a stripe (or, put differently, the number of data disks in a stripe).


RAID Arrays - 6


RAID 0 – Striped Array with no Fault Tolerance

RAIDController

RAIDControllerBlock 4Block 4 Block 4Block 4Block 3Block 3 Block 3Block 3Block 2Block 2 Block 2Block 2Block 1Block 1 Block 1Block 1Block 0Block 0 Block 0Block 0

Host

RAID 0 stripes the data across the drives in the array without generating redundant data.Performance - better than JBOD because it uses striping. The I/O rate, called throughput, can be very high when I/O sizes are small. Large I/Os produce high bandwidth (data moved per second) with this RAID type. Performance is further improved when data is striped across multiple controllers with only one drive per controller.Data Protection – no parity or mirroring means that there is no fault tolerance. Therefore, it is extremely difficult to recover data.Applications – those that need high bandwidth or high throughput, but where the data is not critical, or can easily be recreated.

Striping improves performance by distributing data across the disks in the array. This use of multiple independent disks allows multiple reads and writes to take place concurrently.

When a large amount of data is written, the first piece is sent to the first drive, the second piece to the second drive, and so on.The pieces are put back together again when the data is read.Striping can occur at the block (or block multiple) level or the byte level. Stripe size can be specified at the Logical Volume Manager level from the host – software RAID. Or depending on the vendor, can be set at the array level – in case of hardware RAID.

Notes on striping:Increasing the number of drives in the array increases performance because more data can be read or written simultaneously. A higher stripe width indicates a higher number of drives and therefore better performance.Striping is generally handled by the controller and is transparent to the host operating system.


RAID Arrays - 7


RAID 1 – Disk Mirroring

RAIDController

RAIDControllerBlock 1Block 1 Block 1Block 1Block 1Block 1Block 0Block 0 Block 0Block 0Block 0Block 0

Host

RAID 1 uses mirroring to improve fault tolerance. A RAID 1 group consists of 2 (typically) or more disk modules. Every write to a data disk is also a write to the mirror disk(s). This is transparent to the host. If a disk fails, the disk array controller uses the mirror drive for data recovery and continuous operation. Data on the replaced drive is rebuilt from the mirror drive.

Benefits - high data availability and high I/O rate (small block size) Drawbacks - total number of disks in the array equaling 2 times the data (useable) disks. This means that the overhead cost equals 100%, while usable storage capacity is 50%Performance – improves read performance, but degrades write performanceData Protection - improved fault tolerance over RAID 0Disks – at least two disksCost – expensive due to the extra capacity required to duplicate dataMaintenance - low complexityApplications - applications requiring high availability and non-degraded performance in the event of a drive failure


RAID Arrays - 8


RAID 0+1 – Striping and Mirroring

RAIDController

RAIDController

Block 3Block 3

Block 2Block 2

Block 1Block 1

Host

RAID 0

Block 0Block 0

Block 3Block 3Block 2Block 2Block 1Block 1Block 0Block 0

RAID 1

RAID 0+1 is one way of combining the speed of RAID 0 with the redundancy of RAID 1. RAID 0+1 is implemented as a mirrored array whose basic elements are RAID 0 stripes.

Benefits - medium data availability, high I/O rate (small block size), and the ability to withstand multiple drive failures as long as they occur on the same stripeDrawbacks - total number of disks equal two times the data disks, with overhead cost equaling 100%Performance - high I/O rates; writes are slower than reads because of mirroringData Protection - medium reliabilityDisks - even number of disks (4 disk minimum to allow striping)Cost - very expensive because of the high overheadApplications – imaging and general file server


RAID Arrays - 9


RAID 0+1 – Striping and Mirroring

RAIDController

RAIDController

Block 3Block 3

Block 2Block 2

Block 1Block 1

Host

RAID 0

Block 0Block 0

RAID 1

Block 3Block 3

Block 2Block 2

Block 1Block 1

Block 0Block 0

Block 3Block 3

Block 2Block 2

Block 1Block 1

Block 0Block 0

In the event of a single drive failure, the entire stripe set is faulted. Normal processing can continue with the mirrors. How ever, rebuild of the failed drive will involve copying data from the mirror to the entire stripe set. This will result in increased rebuild times as compared to RAID 1+0 solution. This makes RAID 0+1 implementation less common than RAID 1+0.


RAID Arrays - 10


RAID 1+0 – Mirroring and Striping

RAIDController

RAIDController

Block 3Block 3

Block 3Block 3

Block 1Block 1

Host

RAID 1Block 0Block 0Block 0Block 0

Block 1Block 1

RAID 0

Block 2Block 2Block 2Block 2

RAID 1+0 (or RAID 10, RAID 1/0, or RAID A) also combines the speed of RAID 0 with the redundancy of RAID 1, but it is implemented a different manner than RAID 0+1. RAID 1+0 is a striped array whose individual elements are RAID 1 arrays - mirrors.

Benefits - high data availability, high I/O rate (small block size), and the ability to withstand multiple drive failures as long as they occur on different mirrors Drawbacks - total number of disks equal two times the data disks, with overhead cost equaling 100%Data Protection - high reliabilityDisks - even number of disks (4 disk minimum, to allow striping)Cost - very expensive, because of the high overheadPerformance: High I/O rates achieved using multiple stripe segments. Writes are slower than reads, because they are mirroredApplications – databases requiring high I/O rates with random data, and applications requiring maximum data availability


RAID Arrays - 11


RAID 1+0 – Mirroring and Striping

RAIDController

RAIDController

Host

RAID 1

Block 0Block 0

Block 0Block 0

RAID 0

Block 2Block 2

Block 2Block 2 Block 3Block 3

Block 3Block 3

Block 1Block 1

Block 1Block 1Block 0Block 0

Block 2Block 2

In the event of a drive failure, normal processing can continue with the surviving mirror. Only the data on the failed drive has to be copied over from the mirror for the rebuild, as opposed to rebuilding the entire stripe set in RAID 0+1. This results in faster rebuild times for RAID 1+0 and makes it a more common solution than RAID 0+1.

Note that under normal operating conditions both RAID 0+1 and RAID 1+0 provide the same benefits. These solutions are still aimed at protecting against a single drive failure and not against multiple drive failures.


RAID Arrays - 12


RAID Redundancy: Parity

Parity Disk

0

84

1

95

2

106

3

117

0 1 2 3

8 9 10 114 5 6 7

RAIDController

RAIDController

Host

Parity is a redundancy check that ensures that the data is protected without using a full set of duplicate drives.

If a single disk in the array fails, the other disks have enough redundant data so that the data from the failed disk can be recovered.Like striping, parity is generally a function of the RAID controller and is transparent to the host. Parity information can either be:− Stored on a separate, dedicated drive (RAID-3)−Distributed with the data across all the drives in the array (RAID-5)


RAID Arrays - 13


Parity Calculation

Parity

Data

Data

Data

Data

4

2

3

5

14

5 + 3 + 4 + 2 = 14

The middle drive fails:

5 + 3 + ? + 2 = 14

? = 14 – 5 – 3 – 2

? = 4

RAID Array

This example uses arithmetic operations to demonstrate how parity works. It illustrates the concept, but not the actual mechanism.

Think of parity as the sum of the data on the other disks in the RAID set. Each time data is updated, the parity is updated as well, so that it always reflects the current sum of the data on the other disks.Note: While parity is calculated on a per stripe basis, the diagram omits this detail for the sake of simplification. If a disk fails, the value of its data is calculated by using the parity information and the data on the surviving disks. If the parity disk fails, the value of its data is calculated by using the data disks. Parity will only need to be recalculated, and saved, when the failed disk is replaced with a new disk.

In the event of a disk failure, each request for data from the failed disk requires that the data be recalculated before it can be sent to the host. This recalculation is time-consuming, and decreases the performance of the RAID set. Hot spare drives, introduced later, provide a way to minimize the disruption caused by a disk failure.

The actual parity algorithm use the Boolean exclusive-OR (XOR) operations.


RAID Arrays - 14


RAID 3 – Parallel Transfer with Dedicated Parity Disk

RAIDController

RAIDController

Block 1Block 1

Block 2Block 2

Block 3Block 3

P 0 1 2 3

Block 0Block 0Block 3Block 3Block 2Block 2Block 1Block 1Block 0Block 0

ParityGenerated

Host

RAID Level 3 stripes data for high performance and uses parity for improved fault tolerance. Data is striped across all the disks, but one in the array. Parity information is stored on a dedicated drive, so that data can be reconstructed if a drive fails.

RAID 3 always reads and writes complete stripes of data across all the disks. There are no partial writes that update one out of many strips in a stripe.

Benefits - the total number of disks is less than in a mirrored solution (e.g. 1.25 times the data disks for group of 5), good bandwidth on large data transfersDrawbacks - poor efficiency in handling small data blocks. This makes it not well suited to transaction processing applications. Data is lost if multiple drives fail within the same RAID 3 Group.Performance - high data read/write transfer rate. Disk failure has a significant impact on throughput. Rebuilds are slow.Data Protection - uses parity for improved fault toleranceStriping – byte level to multiple block level, depending on vendor implementationApplications - applications where large sequential data accesses are used such as medical and geographic imaging


RAID Arrays - 15


RAID 4 – Striping with Dedicated Parity Disk

RAIDController

RAIDControllerP 0 1 2 3

Block 0Block 0

Block 0Block 0

Block 4Block 4

Block 1Block 1

Block 5Block 5

Block 2Block 2

Block 6Block 6

Block 3Block 3

Block 7Block 7

P 0 1 2 3P 0 1 2 3

P 4 5 6 7P 4 5 6 7

ParityGenerated

Block 0Block 0

P 0 1 2 3P 0 1 2 3

Host

RAID Level 4 stripes data for high performance and uses parity for improved fault tolerance. Data is striped across all the disks, but one in the array. Parity information is stored on a dedicated disk so that data can be reconstructed if a drive fails.

The data disks are independently accessible, and multiple reads and writes can occur simultaneously.Benefits - the total number of disks is less than in a mirrored solution (e.g., 1.25 times the data disks for group of 5), good read throughput, and reasonable write throughput. Drawbacks – the dedicated parity drive can be a bottleneck when handling small data writes. This RAID level is not well suited to transaction processing applications. Data loss if multiple drives fail within the same RAID 4 Group.Performance - high data read transfer rate. Poor to medium write transfer rate. Disk failure has a significant impact on throughputData Protection - uses parity for improved fault tolerance.Striping – usually at the block (or block multiple) levelApplications – general purpose file storage

RAID 4 is much less commonly used than RAID 5, discussed next. The dedicated parity drive is a bottleneck, especially when a disk failure has occurred.


RAID Arrays - 16


Block 0Block 0

P 0 1 2 3P 0 1 2 3

RAID 5 – Independent Disks with Distributed Parity

Block 7Block 7

RAIDController

RAIDControllerP 0 1 2 3

Block 0Block 4Block 0

Block 1Block 1

Block 5Block 5

Block 2Block 2

Block 6Block 6

Block 3Block 3

ParityGenerated

Block 0Block 0

P 0 1 2 3P 0 1 2 3

Block 4Block 4

P 4 5 6 7P 4 5 6 7P 4 5 6 7P 4 5 6 7

Block 4Block 4

P 4 5 6 7

Block 4ParityGenerated

Host

RAID 5 does not read and write data to all disks in parallel like RAID 3. Instead, it performs independent read and write operations. There is no dedicated parity drive; data and parity information is distributed across all drives in the group.

Benefits - the most versatile RAID level. A transfer rate greater than that of a single drive but with a high overall I/O rate. Good for parallel processing (multi-tasking) applications/environments. Cost savings due to the use of parity over mirroring.Drawbacks - slower transfer rate than RAID 3. Small writes are slow, because they require a read-modify-write (RMW) operation. Write to a single block involves two reads (old block and old parity) and two writes (new block and new parity). There is degradation in performance in recovery and reconstruction modes and data loss if multiple drives within the same group are lost.Performance - high read data transaction rate, medium write data transaction rate. Low ratio of parity disks to data disks. Good aggregate transfer rateData Protection - single disk failure puts volume in degraded mode. Difficult to rebuild (as compared to RAID level 1).Disks - 5-disk and 9-disk groups are popular. Most implementations allow other RAID set sizes.Striping – block level, or multiple block levelApplications - file and application servers, database servers, WWW, email, and News servers

Read operations do not involve parity calculations. In this case of 5-disk RAID 5 group, a maximum of 5 independent reads can be performed. As a write operation involves two disks (parity disk and the data disk), a maximum of two independent writes can be performed in this configuration. So a maximum of 5 independent reads or two independent writes can be performed on a 5-disk RAID 5 group.


RAID Arrays - 17


RAID 6 – Dual Parity RAIDTwo disk failures in a RAID set leads to data unavailability and data loss in single-parity schemes, such as RAID-3, 4, and 5

Increasing number of drives in an array and increasing drive capacity leads to a higher probability of two disks failing in a RAID set

RAID-6 protects against two disk failures by maintaining two parities– Horizontal parity which is the same as RAID-5 parity– Diagonal parity is calculated by taking diagonal sets of data blocks

from the RAID set members

Even-Odd, and Reed-Solomon are two commonly used algorithms for calculating parity in RAID-6

The details of diagonal parity generation and rebuilds are beyond the scope of this foundations course.


RAID Arrays - 18


RAID Implementations Hardware (usually a specialized disk controller card)– Controls all drives attached to it– Performs all RAID-related functions, including volume management– Array(s) appear to the host operating system as a regular disk drive– Dedicated cache to improve performance– Generally provides some type of administrative software

Software – Generally runs as part of the operating system – Volume management performed by the server– Provides more flexibility for hardware, which can reduce the cost– Performance is dependent on CPU load– Has limited functionality

As a broad distinction, hardware RAID is implemented by intelligent storage systems external to the host, or, at minimum, intelligent controllers in the host that offload the RAID management functions from the host.

Software RAID usually describes RAID that is managed by the host. Typically it is implemented via Logical Volume Manager on the host. The disadvantage of software RAID is that is uses host CPU cycles that would be better utilized to run applications. Software RAID often looks attractive initially because it does not require the purchase of additional hardware. The initial cost savings are soon exceeded by the expense of using a costly server to perform I/O operations that it performs inefficiently at best.


RAID Arrays - 19


Hot Spares

RAIDController

RAIDController

A hot spare is an idle component (often a drive) in a RAID array that becomes a temporary replacement for a failed component. For example:

The hot spare drive takes the failed drive’s identity in the array.

Data recovery takes place. How this happens is based on the RAID implementation:If parity was used, data is rebuilt onto the hot spare from the parity and data on the surviving drives.If mirroring was used, data is rebuilt using the data from the surviving mirror drive.

The failed drive is replaced with a new drive at some time later.

One of the following occurs:The hot spare replaces the new drive permanently—meaning that it is no longer a hot spare and a new hot spare must be configured on the system.When the new drive is added to the system, data from the hot spare is copied to the new drive. The hot spare returns to its idle state, ready to replace the next failed drive.

Note: The hot spare drive needs to be large enough to accommodate the data from the failed drive.

Hot spare replacement can be: Automatic - when a disk’s recoverable error rates exceed a predetermined threshold, the disk subsystem tries to copy data from the failing disk to a spare one. If this task completes before the damaged disk fails, the subsystem switches to the spare and marks the failing disk unusable. (If not, it uses parity or the mirrored disk to recover the data, as appropriate). User initiated - the administrator tells the system when to do the rebuild. This gives the administrator control (e.g., rebuild overnight so as not to degrade system performance); however, the system is vulnerable to another failure because the hot spare is now unavailable. Some systems implement multiple hot spares to improve availability.


RAID Arrays - 20


Hot Swap

RAIDController

RAIDController

RAIDController

RAIDController

RAIDController

RAIDController

Like hot spares, hot swaps enable a system to recover quickly in the event of a failure. With a hot swap, the user can replace the failed hardware (such as a controller) without having to shut down the system.


RAID Arrays - 21



What RAID is and the needs it addresses

The concepts upon which RAID is built

Some commonly implemented RAID levels



RAID Arrays - 22


Check Your KnowledgeWhat is a RAID array?

What benefits do RAID arrays provide?

What methods can be used to provide higher data availability in a RAID array?

What is the primary difference between RAID 3 and RAID 5?

What is Read-Modify-Write in RAID 5?

What is advantage of using RAID 6?

What is a hot spare?



Disk Storage Systems - 1

© 2007 EMC Corporation. All rights reserved. Disk Storage Systems - 1

Intelligent Storage SystemsAfter completing this module, you will be able to:

Describe the components of an intelligent storage system

Describe the configuration of a logical disk

Discuss the methods employed to ensure that a host can access a storage volume

Discuss back end volume protection

Discuss front end host configuration

Describe the I/O flow from the back end to the physical disks

At this point, you have learned how disks work and how they can be combined to form RAID arrays. Now we are going to build on those concepts and add intelligence to those arrays, making them even more powerful. Throughout this module we refer to this as an intelligent storage system.

The objectives for this module are shown here. Please take a moment to read them




Lesson: Intelligent Storage System Overview After completing this lesson, you will be able to:

List the benefits of intelligent storage systems

Compare and contrast integrated and modular approaches to intelligent storage systems

Describe the I/O flow through the storage system

Describe the logical elements of an intelligent storage system

This module contains two lessons. In this lesson, we take a high level look at the components of a disk storage system as well as two approaches to implementing them: integrated and modular.

The objectives for this lesson are shown here. Please take a moment to read them.




What is an Intelligent Storage SystemIntelligent Storage Systems are RAID arrays that are:

Highly optimized for I/O processing.

Have large amounts of cache for improving I/O performance.

Have operating environments that provide: – Intelligence for managing cache – Array resource allocation – Host access to array resources – Connectivity for heterogeneous hosts– Advanced array based local and remote replication options

Let’s start by asking the question, “What is an intelligent storage system?” It is a disk storage system which distributes data over several devices, and manages access to that data.

Intelligent storage systems have an operating environment. The operating environment can be viewed as an “operating system” for the array. They also have large amounts of cache. Sophisticated algorithms manage cache to optimize the read/write requests from the hosts. Large capacity drives can be partitioned or “sliced” into smaller units. These smaller units, in turn, can be presented to hosts as individual disk drives. Array management software can also enable multiple hosts to access the array via the same I/O channel. The operating environment ensures that each host can only access the disk resources allocated to it.




Benefits of an Intelligent Storage SystemIntelligent storage system provides several benefits over a

collection of disks in an array or even a RAID array:– Improved performance– Easier data management– Improved resource allocation and utilization – Very high levels of data availability and data protection– Array based technologies for local and remote replication– Optimized backup/restore functionalities– Improved flexibility and scalability

Intelligent storage systems, a collection of disks in an array, and RAID arrays, all provide increased data storage capacity. However, intelligent storage systems provide more benefits, as listed in the slide.




Monolithic (Integrated) Storage Systems

Monolithic

FC PortsPort Processors

Cache

RAID Controllers

Intelligent storage systems generally fall into one of two categories, monolithic and modular. Monolithic storage systems are generally aimed at the enterprise level, centralizing data in a powerful system with hundreds of drives. They have the following characteristics:

Large storage capacityLarge amounts of cache to service host I/Os efficiently and optimallyRedundant components for improved data protection and availabilityMany built in features to make them more robust and fault tolerantUsually connect to mainframe computers or very powerful open systems hostsMultiple front end ports to provide connectivity to multiple serversMultiple back end Fibre Channel or SCSI RAID controllers to manage disk processing.

This system is contained within a single frame or interconnected frames (for expansion) and can scale to support increases in connectivity, performance, and capacity as required. Monolithic storage systems can handle large amounts of concurrent I/Os from numerous servers and applications. They are quite expensive compared to modular storage systems (discussed in the next slide). Many of their features and functionality might be required only for mission critical applications in large enterprises.

Note: Monolithic arrays are sometimes called integrated arrays, enterprise arrays, or cache centric arrays.




Modular Storage Systems

Rack

Servers

Disk Modules

Control Modulewith Disks

FC Switches

Modular

Host Interface

Cache

RAID

Controller AController A

Host Interface

Cache

RAID

Controller BController B

Modular storage systems provide storage to a smaller number of (typically) Windows or Unix servers than larger integrated storage systems. Modular storage systems are typically designed with two controllers, each of which contains host interfaces, cache, RAID processors, and disk drive interfaces. They have the following characteristics:

Smaller total storage capacity and lesser global cache, than monolithic arraysFewer front end ports for connection to serversPerformance can degrade as the number of connected servers increasesLimited redundancyFewer options for array based local and remote replication

Note: Modular storage systems are sometimes called midrange or departmental storage systems.

It should also be noted that the distinction between monolithic and modular arrays is becoming increasingly blurred. Traditionally, monolithic arrays have been associated with large enterprises and modular arrays with small/medium businesses. With proper classification of application requirements (such as performance, availability, scalability), modular arrays can now be found in several enterprises, providing optimal storage solutions at a lower cost (than monolithic arrays).




Components of an Intelligent Storage System

Intelligent Storage System

CacheCache

Front End Back End

Cache

Physical Disks

Host Connectivity

At a high level, the components of an intelligent storage systems are:Front EndCacheBack End Physical disks




Intelligent Storage System: Front End


Ports

Host Connectivity

Controllers

Front End Back End

Cache

Physical Disks

The front end controller receives and processes I/O requests from the host. Hosts connect to the storage system via ports on the front end controller.

Ports are the external interfaces for connectivity to the host. Each storage port has processing logic responsible for executing the appropriate transport protocol for storage connections. For example, it could use SCSI, Fibre Channel, or iSCSI.Behind the storage ports are controllers which communicate with the cache and back end to provide data access.

The number of front-end ports on a modular storage system generally ranges from 1-8; 4 is typical. On a large monolithic array, port counts as high as 64 or 128 are common.




Front End Command Queuing

FRONTEND

FRONTEND

Request 1

Request 2

Request 3

Request 4

1234

FRONTEND

FRONTEND

Request 1

Request 2

Request 3

Request 4

1324

Without Command Queuing

With Command Queuing

1

2

34

1

2

34

As seen earlier, command queuing processes multiple concurrent commands based on the organization of the data on disk, regardless of the order in which the commands were received.

The command queuing software reorders commands so as to make the execution more efficient, and assigns each command a tag. This tag identifies when the command will be executed, just as the number you take at the deli determines when you will be served.

Some disk drives, particularly SCSI and Fibre Channel disks, are intelligent enough to manage their own command queuing. Intelligent storage systems can make use of this native disk intelligence, and may supplement it with queuing performed by the controller.

There are several command queuing algorithms that can be used. Here are some of the common ones. First In, First Out – commands are executed in the order in which they arrive. This is identical to having no queuing, and is therefore inefficient in terms of performance.Seek Time Optimization - faster than First In, First Out. However, two requests could be on cylinders that are very close to each other, but in very different places within the track. Meanwhile, there might be a third sector that is a few cylinders further away but much closer overall to the location of the first request. Optimizing seek times only, without regard for rotational latency, will not normally produce the best results.Access Time Optimization - combines seek time optimization with an analysis of rotational latency for optimal performance.




Intelligent Storage System: Cache


Host Connectivity

Front End Back End

Cache

Physical Disks

Cache improves system performance by isolating the hosts from the mechanical delays associated with physical disks. You have already seen that accessing data from a physical disk usually takes several milliseconds, because of seek times and rotational latency; accessing data from high speed memory typically takes less than a millisecond. The performance of reads as well as writes may be improved by the use of cache. Cache is discussed in more detail in the next lesson.




Intelligent Storage System: Back End

Host Connectivity

PortsControllers

Front End Back End

Cache

Physical Disks


The back end controls the data transfers between cache and the physical disks. Physical disks are connected to ports on the back end.

The back end provides the communication with the disks for read and write operations. The controllers on the back end:

Manages the transfer of data between the I/O bus and the disks in the storage systemHandles addressing for the device - translating logical blocks into physical locations on the diskProvides additional, but limited, temporary storage for dataProvides error detection and correction – often in conjunction with similar features on the disks

To provide maximum data protection and availability, dual controllers provide an alternative path to physical disks, in case of a controller or a port failure. This reliability is enhanced if the disks used are dual-ported; each disk port can connect to a separate controller. Having multiple controllers also facilitates load balancing. Having more than one port on each controller provides additional protection in the event of port failure. Typically, disks can be accessed via ports on controllers of two different back ends.




Intelligent Storage System: Physical Disks

Host Connectivity

Front End Back End

Cache

Physical Disks


Physical disks are where data is stored. Drives are connected to the controller with either SCSI (SCSI interface and copper cable) or Fibre Channel (optical or copper) cables. When a storage system is used in environments where performance is not critical, ATA drives may be used. The connection to the drives will then be made via parallel ATA (PATA) or serial ATA (SATA) copper cables. Some storage systems allow a mixture of SCSI or Fibre Channel drives and ATA drives. The higher performing drives are used for application data storage, while the slower ATA drives are used for backup and archiving.




What the Host Sees – Physical Drive Partitioning


LUN 0LUN 1LUN 2

LUN 0LUN 1

LUN 2

Host

Host

Back EndPhysical Disks

Cache

Since intelligent storage systems have multiple disk drives, they use the disks in various ways to provide optimal performance and capacity. For example:

A large physical drive could be subdivided into multiple virtual disks of smaller capacity. This is similar to drive partitioning discussed in Section 2.Several physical drives can be combined together and presented as one large virtual drive. This is similar to drive concatenation discussed in Section 2.Typically physical drives are grouped into RAID sets or RAID groups. LUNs with the desired level of RAID protection are then created from these RAID sets and presented to the hosts.

The mapping of the LUNs to their physical location on the drives is managed by the controller.




What the Host Sees – RAID Sets and LUNs


LUN 0

LUN 1

Host

Host

LUN 0

LUN 1

Back End Physical DisksCache

In this example, a RAID set consisting of 5 disks has been sliced, or partitioned, into several LUNs. LUNs 0 and 1 are shown. Note how a portion of each LUN resides on each disk in the RAID set.




Logical Device Names

HostVolumeManager

Host

/dev/rdsk/c1t1d0/dev/rdsk/c1t1d1

\\.\PhysicalDrive0

VolumeManager


LUN 0LUN 1LUN 2

LUN 0LUN 1

LUN 2

Back EndPhysical Disks

Cache

This example shows a single physical disk divided into 3 LUNs: LUN 0, 1 and 2. The LUNs are presented separately to the host or hosts. A host will see a LUN as if it were a single disk device. The host is not aware that this LUN is only a part of a larger physical drive.

The host assigns logical device names to the LUNs; the naming conventions vary by platform. Examples are shown for both Unix and Windows addressing.





An intelligent disk storage system:– Is highly optimized for I/O processing– Has an operating environment which, among other things, manages

cache, controls resource allocation, and provides advanced local and remote replication capabilities

– Has a front end, cache, a back end, and physical disks– The physical disks can be partitioned into LUNs or can be grouped

into RAID sets, and presented to the hosts

Please take a few moments to review the key points covered in this lesson.




Lesson: Cache – A Closer LookAfter completing this lesson, you will be able to:

Describe the benefit of cache in intelligent storage systems

Describe how cache is structured

Describe cache hits and misses

Describe algorithms to manage cache

We already mentioned that cache plays a key role in an intelligent storage system. At this point, let’s take a closer look at what cache is and how it works.




What is Cache in a Storage SystemA memory space used by an intelligent storage system to reduce the time required to service I/O requests from the host

CacheRead

RequestWrite

Request

Acknowledgment

Physical disks are the slowest components of an intelligent storage system. If the disk has to be accessed for every I/O operation from the host, response times are very high. Cache helps in reducing the I/O response times. Cache can improve I/O response times in the following two ways:

Read cache holds data that is staged into it from the physical disks. Discussed later, data can be staged into cache ahead of time upon detection of read access patterns from hosts.Write cache holds data written by a host to the array until it can be committed to disk. Holding writes in cache and acknowledging them immediately to host, prior to committing to disk, isolates the host from inherent mechanical delays of the disk (such as rotational and seek latencies). Other benefits of write caching are discussed later in this lesson.

Cache is volatile – loss of power leads to loss of data resident in cache, that has not yet been committed to disk. Storage system vendors solve this problem in various ways. The memory may be powered by a battery until AC power is restored, or battery power may be used to write the content of cache to disk. In the event of an extended power failure, this is the best option. Intelligent storage systems can have upwards of 256GB of cache and hundreds of physical disks. Potentially, there could be a large amount of data to be committed to numerous disks. In this case, the batteries may not provide power for sufficient amount of time to write each piece of data to the appropriate disk. Some vendors use a dedicated set of physical disks to “dump” the content of cache during a power failure. This is usually referred to as vaulting, and the dedicated disks are called vault drives. When power is restored, data from these disks are read and then written to the correct disks.




How Cache is Structured

Data Store

Tag RAM

The amount of user data that the cache can hold is based on the cache size and design. Cache normally consists of two areas:

Data store - the part of the cache that holds the dataTag RAM – the part of the cache that tracks the location of the data in the data store. Entries in this area indicate where the data is found in memory, and also where the data belongs on disk. Additional information found here will include a ‘dirty bit’ – a flag that indicates that data in cache has not yet been committed to disk. There may also be time-based information such as the time of last access. This information will be used to determine which cached information has not been accessed for a long period of time, and may be discarded.

<Continued>



Configuration and implementation of cache varies between vendors. In general, these are the options:A reserved set of memory addresses for reads and another reserved set of memory addresses for writes. This implementation is known as dedicated cache. Cache management, such as tracking the addresses currently in use, those that are available, and the addresses whose content has to be committed to disk, can become quite complex in this implementation.In a global cache implementation, both reads and writes can use any of the available memory addresses. Cache management is more efficient in this implementation, as only one global set of addresses has to be managed.− Some global cache implementations allow the users to specify the percentage of cache that has

to be available for reads and the percentage of cache that has to be available for writes. This implementation is common in modular storage arrays.

− In other global cache implementations, the ratio of cache available for reads vs. writes might be fixed, or the array operating environment can dynamically adjust this ratio based on the current workload. These implementations are typically found in integrated storage arrays.

In integrated arrays, all the front end and back end directors have access to all regions of the cache. In modular arrays, each controller (typically two) has access to its own cache on-board. A fault in memory, for example failure of a memory chip, would lead to loss of any uncommitted data held in it. Vendors use different approaches to mitigate this risk:

Pro-actively “scrub” all regions of memory. Faults can be detected ahead of time, and the faulty region can be isolated or fenced, and taken out of use. This is similar to bad block relocation on physical disks.Mirror all writes within cache. Similar to RAID 1 mirroring of disks, each write can be held in two different memory addresses, well separated from each other. Each write would be placed on two independent memory boards, for example. In the event of a fault, the write data will still be safe in the mirrored location and can be committed (de-staged) to disk. Since reads are staged from the disk to cache, if there is a fault, an I/O error could be returned to the host, and the data can be staged back into a different location in cache to complete the read request. The read service time would be elongated, how ever there is no risk of lost data. As only writes are mirrored, this method will lead to better utilization of available cache for data store.A third approach would be to mirror all reads and all writes in cache. In this implementation, when data is read from the disk to be staged into cache, it is written to two different locations. Likewise writes from hosts will be held in two different locations. This effectively reduces the amount of usable cache by half. As reads and writes are treated on equal footing, the management overhead would be less than that of mirroring writes alone.

In either of the two mirroring approaches, the problem of cache coherency is introduced. Cache coherency means that data in the two different cache addresses are identical at all times. It is the responsibility of the array operating environment to ensure coherency.




ReadRequest

Read Cache ‘Hits’ and ‘Misses’

CacheRead

Request

Cache

Data found in cache = ‘Hit’

No data found = ‘Miss’

When a host issues a read request, the front end controller accesses the Tag RAM to determine whether the required data is already available in cache.

If the requested data is found in the cache, it is known as a cache hit.The data is sent directly to the host, with no disk operation required. This provides fast response times.

If the data is not found in cache, the operation is known as a cache miss.When there is a cache miss, the data must be read from disk. The back end controller accesses the appropriate disk and retrieves the requested data.Data is typically placed in cache and then sent to the host.

The read cache hit ratio (or hit rate), usually expressed as a percentage, describes how well the read cache is performing. To determine the hit ratio, divide the number of read cache hits by the total number of read requests.

Cache misses lengthen I/O response times. The response time depend on factors such as rotational latency, and seek times, as discussed earlier.

A read cache hit can take about a millisecond, while a read cache miss can take many times longer. Remember that average disk access times for reads are often in the 10 ms range.




Algorithms Used to Manage CacheLeast Recently Used (LRU)– Discards least recently used data

Most Recently Used (MRU)– Discards most recently used data

Read Ahead (pre-fetch)– Monitors read requests from hosts to

detect sequential access– If sequential access is detected, then

data is read from the disk into cache before it is requested by the host

New Data

Oldest Data

Cache is a finite resource. Even though the intelligent storage systems can have hundreds of GB of cache, when all cache addresses are used up for data, some addresses have to be freed up to accommodate new data. Waiting until a cache full condition occurs to free up addresses is inefficient and leads to performance degradation. The array operating environment should proactively maintain a set of free addresses and/or a list of addresses that can be potentially freed up when required. Algorithms used for cache management are:

Least Recently Used (LRU) – access to data in cache is monitored continuously, and the addresses that have not been accessed in a “long time” can be freed up immediately, or can be marked as being candidates for re-use. This algorithm assumes that data not accessed in a while will not be requested by the host. The length of time that an address should be inactive prior to being freed up is dependent on the implementation. Quite clearly, if an address contains write data, not yet committed to disk, the data will of course be written to disk before the address is re-used.Most Recently Used (MRU) – is the converse of LRU. Addresses that have been accessed most recently will be freed up or marked as potential candidates for re-use. This algorithm assumes that data that has been accessed in the immediate past may not be required for a while.Read Ahead – if the read requests are sequential, i.e. a contiguous set of disk blocks, several more blocks not yet requested by the host can be read from disk and placed in cache. When the host subsequently requests these blocks, these read operations will be read hits. In general, there is an upper limit to the amount of data that is pre-fetched. The percentage of pre-fetched data that is actually used is also monitored. A high percentage would imply that the algorithm is correctly predicting the sequential access pattern. A low percentage would indicate that effort is being wasted in performing pre-fetch, and that the access pattern from the host is not truly sequential.

Some implementations allow for data to be “pinned” in cache permanently. The pinned addresses will not participate in the LRU or the MRU considerations. Note that the slide shows a depiction of the LRU.




WriteRequest

Write Algorithms

WriteRequest

Write-through Cache

Write-back

Acknowledgement

Acknowledgement

Cache

Cache

Acknowledge-ment

Write-through cache – data is placed in cache, immediately written to disk, and acknowledgement is sent to the host. As data is committed to disk as it arrives, the risk of data loss is low. Write response times will be longer because of the mechanical delays of the disk.

Write-back cache – data is placed in cache and immediately acknowledged to the host. At a later time, data from several writes are committed (de-staged) to disk. Uncommitted data is exposed to risk of loss in the event of failures. Write response times are much faster as the write operations are isolated from the mechanical delays of the disk.

Cache could also be by-passed under certain conditions, such as very large write I/O sizes. In this implementation, writes are sent directly to disk.




Write Cache: PerformanceManage peak I/O requests “bursts” through flushing– Least-recently used pages are flushed from cache to the drives

For maximum performance:– Provide headroom in write cache for I/O bursts

Coalesce small host writes into larger disk writes– Improve sequentiality at the disk

Some of the things that improve performance include:Manage peak I/O requests by absorbing large groups of writes—called bursts—without becoming bottlenecked by the speed of a physical disk. This is known as burst smoothing.Merging several writes to the same area into a single operation

The algorithms that manage cache should adapt to changing data access patterns. The actual algorithms used are vendor-specific.





Cache is a memory space used by an intelligent storage system to reduce the time required to service I/O requests from the host

Cache can speed up both read and write operations

Algorithms to manage cache include:– Least Recently Used (LRU)– Most Recently Used (MRU)– Read Ahead (pre-fetch)

Cache write algorithms include:– Write-through– Write-back






Intelligent Storage Systems are RAID Arrays that are highly optimized for I/O processing

Monolithic storage systems are generally aimed at the enterprise level, centralizing data in a powerful system with hundreds of drives

Modular storage systems provide storage to a smaller number of (typically) Windows or Unix servers than larger integrated storage systems

Cache in intelligent storage systems accelerates response times for host I/O requests





Check Your Knowledge What are the parts of an Intelligent Storage System?

What are the differences between a monolithic and a modular array?

What is the difference between a read cache hit and a read cache miss?

What is the difference between Least Recently Used and Most Recently Used algorithms?

What is the difference between Write-through and Write-back cache?





Apply Your KnowledgeUpon completion of this case study, you will be able to:

Describe the basic architecture of the EMC CLARiiON modular storage array

Describe the basic architecture of the EMC Symmetrix integrated storage array

At this point, we will apply what you learned in this module to some real world examples. In this case, we look at the architecture of the EMC CLARiiON and EMC Symmetrix storage arrays.




CLARiiON CX3-80 Architecture

Power supply

Power supply

Fan FanFan

SPS

Up to 480 drives max per storage system (CX3-80)

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

4Gb/s LCC

UltraScaleStorage Processor

UltraScaleStorage Processor

Fibre Channel

Mirrored cache

Fibre Channel

CPU

Mirrored cache

CPU

FC FC

CPU

FC

CPU

FCFC FC FCFC

Fan

2/4 Gb/s Fibre Channel Back End

2/4 Gb/s Fibre Channel Back End

1/2/4 Gb/s Fibre Channel Front End

CLARiiON Messaging Interface (CMI)Multi-Lane PCI-Express bridge link

SPS

The CLARiiON architecture includes fully redundant, hot swappable components—meaning the system can survive the loss of a fan or a power supply, and the failed component can be replaced without powering down the system.

The Standby Power Supplies (SPSs) maintain power to the cache for long enough to allow its content to be copied to a dedicated disk area (called the vault) if a power failure should occur.Storage Processors communicate with each other over the CLARiiON Messaging Interface (CMI) channels. They transport commands, status information, and data for write cache mirroring between the Storage Processors. CMI is used for peer-to-peer communications in the SAN space and may be used for I/O expansion in the NAS space. The CX3-80 uses PCI-Express as the high-speed CMI path. PCI Express architecture delivers advance I/O technology delivering high bandwidth per pin, superior routing characteristics, and improved reliability. When more capacity is required, additional disk array enclosures containing disk modules can be easily added. Link Control Cards (LCC) connect shelves of disks.




Assigning CLARiiON LUNs to HostsCLARiiON disks are grouped into RAID Groups– Disks from any enclosure may be used in a RAID Group– All disks in a RAID Group must be either Fibre Channel or ATA– A RAID Group is the ‘RAID set’ discussed earlier– A RAID Group may be a single disk, or RAID Level 0, 1, 1/0, 3 or 5

The RAID Group is then partitioned into LUNs– All LUNs in a RAID Group will be the same RAID Level

The LUNs are then made accessible to hosts– CLARiiON-resident software ensures that LUNs are seen only by the

hosts that own them

Making LUNs available to a host is a 3-step process:

1. Create a RAID Group

Choose which physical disks should be used for the RAID Group and assign those disks to the group. Each physical disk may be part of one RAID Group only.

2. Create LUNs on that RAID Group

LUNs may be created (Note: The CLARiiON term is ‘bound’) on that RAID Group. The first LUN that is bound has a RAID Level selected by the user; all subsequent LUNs must be of the same RAID Level.

3. Assign those LUNs to hosts

When LUNs have been bound, they are assigned to hosts. Normal host procedures, such as partitioning, formatting and labeling, are then be performed to make the LUN usable. The CLARiiON software that controls host access to LUNs, a process known as LUN masking, is called Access Logix.




EMC Symmetrix DMX ArrayDirect Matrix Interconnect

Dynamic Global Memory

Enginuity Operating Environment

Processing Power

Flexible Back-End Configurations

Fault-tolerant Design

The Symmetrix DMX series arrays delivers the highest levels of performance and throughput for high-end storage. It incorporates the following features:

Direct Matrix Interconnect− Up to 128 direct paths from directors and memory− Up to 128 GB/s data bandwidth; up to 6.4 GB/s message bandwidth

Dynamic Global Memory− Up to 512 GB Global Memory− Intelligent Adaptive Pre-fetch− Tag-based cache algorithms

Enginuity Operating Environment− Foundation for powerful storage-based functionality− Continuous availability and advanced data protection− Performance optimization and self-tuning− Advanced management− Integrated SMI-S compliance

Advanced processing power− Up to 130 PowerPC Processors− Four or eight processors per director

High-performance back end− Up to 64 2 Gb/s Fibre Channel paths (12.8 GB/s maximum bandwidth)− RAID 0, 1, 1 + 0, 5− 73, 146, and 300 GB 10,000 rpm disks; 73 and 146 GB 15,000 rpm disks; 500 GB 7,200 rpm disks

A fully fault-tolerant design− Nondisruptive upgrades and operations− Full component-level redundancy with hot-swappable replacements− Support: Dual-ported disks and global-disk hot spares− Redundant power supplies and integrated battery backups− Remote support and proactive call-home capabilities




Symmetrix DMX Series Direct Matrix Architecture

This shows the logical representation of the Symmetrix DMX architecture. The Front-end (host connectivity directors and ports), Cache (Memory) and the Back-end (directors/ports which connect to the physical disks) are shown.

Front-end: Hosts connect to the DMX via front-end ports (shown as ‘Host Attach”) on Front-end directors. DMX supports ESCON, FICON, Fibre Channel and iSCSI front-end connectivity.

Back-end: The disk director ports (back-end) are connected to Disk Array Enclosures. The DMX back-end employs an arbitrated loop design and dual-ported disk drives. I/Os to the physical disks are handled by the back-end.

Cache: All front-end I/Os (reads and writes) to the Symmetrix have to pass through the cache, this is unlike some arrays which will allow I/Os to by pass cache altogether. Let us take a look at how the Symmetrix handles front-end read and write operations:Read: A read is issued by a server. The Symmetrix will look for the data in the cache, if the data is in cache it will be read from cache and sent to the server via the front-end port – This is a read hit. If the data is not in cache, then the Symmetrix will go to the physical disks on the back-end, fetch the data into cache and then send the data from the cache to the requesting server – This is a read miss.Write: A write is issued by a sever. The write will be received in cache and a write complete will be immediately issued to the server. Data will be de-staged from the cache to the back-end at a later time.Enhanced global memory technology supports multiple regions and sixteen connections on each global memory director, one to each director. Each director slot port is hard-wired point-to-point to one port on each global memory director board. If a director is removed from a system, the usable bandwidth is not reduced. If a memory board is removed, the usable bandwidth is dropped.




Symmetrix DMX: Dual-ported Disk and Redundant Directors

Disk Director 1 Disk Director 16

P

S

P

S

P

S

P

S

S

P

S

P

S

P

S

P

P = Primary Connection to DriveS= Secondary Connection for Redundancy

Symmetrix DMX back-end employs an arbitrated loop design and dual-ported disk drives. Each drive connects to two paired Disk Directors through separate Fibre Channel loops. Port Bypass Cards prevent a Director failure or replacement from affecting the other drives on the loop. Directors have four primary loops for normal drive communication and four secondary loops to provide alternate path, if the other director fails.




Configuring Symmetrix Logical Volumes (SLV)

Initial configuration of Symmetrix Logical Volumes is done via the Symmetrix Service Processor and the SymmWin interface/application– A configuration file (IMPL.BIN) is created and loaded on to the array

Subsequent configuration changes can be performed online using EMC ControlCenter (GUI) or by using Solutions Enabler (CLI)

Physical Disk

Physical Disk

Physical Disk

Physical Disk

Physical Disk Symmetrix Service Processor

Running SymmWin Application

All Symmetrix arrays have a Service Processor running the SymmWin application. Initial configuration of Symmetrix arrays has to be performed by EMC personnel via the Symmetrix Service Processor.

Physical disks (in the disk array enclosures) are sliced into hypers, or disk slices, and protection schemes (RAID1, RAID5, etc.) are then incorporated, creating the Symmetrix logical volumes (discussed in the next slide). A Symmetrix logical volume is the entity that is presented to a host via a Symmetrix front-end port. The host views the Symmetrix logical volume as a physical drive. Do not confuse Symmetrix logical volumes with host-based logical volumes. Symmetrix logical volumes are defined by the Symmetrix configuration, while host-based logical volumes are configured by Logical Volume Manager software.

EMC ControlCenter and Solutions Enabler are software packages which are used to monitor and manage the Symmetrix. Solutions Enabler has a command line interface, while ControlCenter provides a Graphical User Interface (GUI). ControlCenter is a very powerful storage management tool, managing the Symmetrix is one of the many things it can do.




RAID1 – Symmetrix Logical VolumeRAID1 SLV– Data is written to two hyper volumes on two different physical disks

which are accessed via two different disk directors

Host is unaware of data protection being applied

Physical Drive

LV 04B M2

Different Disk Director

Physical Drive

LV 04B M1

Disk Director

Logical Volume 04B

Host AddressTarget = 1LUN = 0

Hyper

Volumes

Mirroring provides the highest level of performance and availability for all applications. Mirroring maintains a duplicate copy of a logical volume on two physical drives. The Symmetrix maintains these copies internally by writing all modified data to both physical locations. The mirroring function is transparent to attached hosts, as the hosts view the mirrored pair of hypers as a single Symmetrix logical volume.

A RAID1 SLV: Two hyper volumes from two different disks on two different disk directors are logically presented as an RAID1 SLV. The hyper volumes are chosen from different disks on different disk directors to provide maximum redundancy. The SLV is given an Hexadecimal address. In the example, SLV 04B is a RAID1 SLV whose hyper volumes exist on the physical disks in the back-end of the array.

The SLV is then mapped to one or more Symmetrix front-end ports (a target and LUN ID is assigned at this time). The SLV can now be assigned to a server. The server views the SLV as a physical drive.

On a fully configured Symmetrix DMX3 array, one can have up to 64,000 Symmetrix logical volumes. The maximum number of SLVs on a DMX is a function of the number of disks, disk directors, and the protection scheme used.




Data ProtectionMirroring (RAID 1) – Highest performance, availability and functionality – Two hyper mirrors form one Symmetrix Logical Volume located on separate

physical drives

Parity RAID (not available on DMX3)– 3 +1 (3 data and 1 parity volume) or 7 +1 (7 data and 1 parity volume)

Raid 5 Striped RAID volumes– Data blocks are striped horizontally across the members of the RAID group

( 4 or 8 member group); parity blocks rotate among the group members

RAID 10 Mirrored Striped Mainframe Volumes

Dynamic Sparing

SRDF (Symmetrix Remote Data Facility)– Mirror of Symmetrix logical Volume maintained in a separate Symmetrix

Data protection options are configured at the volume level and the same Symmetrix can employ a variety of protection schemes.

Dynamic Sparing: Disks in the back-end of the Array which are reserved for use when a physical diskfails. When a physical disk fails, the dynamic spare is used as a replacement.

SRDF is a remote replication solution and is discussed later on in the Business Continuity section of this course.




Assigning Symmetrix Logical Volumes to HostsConfigure Symmetrix Logical Volumes

Map Symmetrix Logical Volumes to Front-end ports– Performed via EMC ControlCenter or Solutions Enabler

Make Symmetrix Logical Volumes accessible to hosts– SAN Environment

Zone Hosts to Front-end portsPerform LUN Masking

Can be performed via EMC ControlCenter or Solutions EnablerLUN Masking information is maintained on the Symmetrix in the VCM Database (VCMDB)

LUN Masking information is also flashed to all the front-end directors

Assigning Symmetrix Logical Volumes to hosts is a 3 step process:

1. Configure SLVs

2. Map the SLVs to front-end portsWhen a SLV is created, it is not assigned to any front-end port, thus one must assign SLVs to front-end ports before a host can access the same. Mapping is the task of assigning SLVs to front-end ports. For redundancy, map a device to more that one front-end port.

3. Make SLVs accessible to hostsSAN Environment – Zoning and LUN masking has to be performed. Zoning and LUN Masking in the SAN is discussed later in this course.



© 2007 EMC Corporation. All rights reserved. Data Flow Exercise - 2

Q: Architecture ExerciseIdentify the components of a data storage environment:

C

A

ED

G

B F

Match the letter in the diagram with the appropriate component:

___ Host

___ Intelligent Storage System

___ RAID Set

___ Cache

___ Front End

___ Back End

___ Connectivity




Q: Data Flow Exercise – Write OperationIn this example, the storage system uses write-back cache.

Identify the operations performed when writing data to disk, then list them in the correct order:

CF

B

D

E

A

1. Fill in the letter in the diagram that corresponds to the appropriate operation. Hint: Not all of the operations are used.

___ Host sends data to storage system

___ Data is written to physical disk some time later

___ Data is written to cache

___ Data is written to physical disk immediately

___ An acknowledgement is sent to the host

___ Data is returned to the host

___ Data is sent to back end

___ Back end receives status of write operation

2. List the operations in the correct order.




Q: Data Flow Exercise – Read Cache HitIdentify the operations performed for a read request from the host. List the operations in the correct order.

BD

AC

Fill in the letter in the diagram that corresponds to the appropriate operations Hint: Not all of the operations are used.

___ Host sends read request to storage system

___ Data is read from physical disk when requested by the LRU algorithm


___ Data is read from physical disk immediately

___ Read data is sent to the host

___ Status is returned to the host


___ Back end receives status of read operation

___ Cache is searched, and data is found there

___ Cache is searched, and data is not found there

___ Data placed in cache by a previous read or write operation




Q: Data Flow Exercise – Read Cache MissIdentify the operations performed for a read request from the host. List

the operations in the correct order.

C

D A

E

F

B

1. Fill in the letter in the diagram that corresponds to the appropriate operations Hint: Not all of the operations are used.

___ Host sends read request to storage system

___ Data is read from physical disk when requested by the LRU algorithm


___ Data is read from physical disk immediately

___ Read data is sent to the host

___ Status is returned to the host


___ Back end receives status of read operation

___ Cache is searched, and data is found there

___ Cache is searched, and data is not found there

2. List the operations in the correct order.

CASE STUDY 1

Business Profile: Acme Telecom is involved in mobile wireless services across the United States and has about 5000 employees worldwide. This company is Chicago based and has 7 regional offices across the country. Although Acme is doing well financially, they continue to feel competitive pressure. As a result, the company needs to ensure that the IT infrastructure takes advantage of fault tolerant features. Current Situation/Issues: • The company uses a number of different applications for communication, accounting, and

management. All the applications are hosted on individual servers with disks configured as RAID 0. • All financial activity is managed and tracked by a single accounting application. It is very important

for the accounting data to be highly available. • The application performs around 15% write operations, and the remaining 85 % are reads. • The accounting data is currently stored on a 5-disk RAID 0 set. Each disk has an advertised formatted

capacity of 200 GB, and the total size of their files is 730 GB. • The company performs nightly backups and removes old information—so the amount of data is

unlikely to change much over the next 6 months. The company is approaching the end of the financial year and the IT budget is depleted. Buying even one new disk drive will not be possible. How would you suggest that the company restructure their environment? You will need to justify your choice based on cost, performance, and availability of the new solution. RAID level to use: Advantages: Disadvantages:

CASE STUDY 2

Business Profile: Acme Telecom is involved in mobile wireless services across the United States and has about 5000 employees worldwide. This company is Chicago based and has 7 regional offices across the country. Although Acme is doing well financially, they continue to feel competitive pressure. As a result, the company needs to ensure that the IT infrastructure takes advantage of fault tolerant features. Current Situation/Issues: • The company uses a number of different applications for communication, accounting, and

management. All the applications were hosted on individual servers with disks configured as RAID 0. • The company changed the RAID level of their accounting application based on your

recommendations 6 months ago. • It is now the beginning of a new financial year and the IT department has an increased budget. You

are called in to recommend changes to their database environment. • You investigate their database environment closely, and observe that the data is stored on a 6-disk

RAID 0 set. Each disk has an advertised formatted capacity of 200 GB and the total size of their files is 900 GB. The amount of data is likely to change by 30 % over the next 6 months and your solution must accommodate this growth.

• The application performs around 40% write operations, and the remaining 60 % are reads. The average size of a read or write is small, at around 2 KB.

How would you suggest that they restructure their environment? A new 200 GB disk drive costs $1000. The controller can handle all commonly used RAID levels, so will not need to be replaced. What is the cost of the new solution? Justify your choice based on cost, performance, and data availability of the new solution. RAID level to use: Advantages: Disadvantages:




Section SummaryKey Points covered in this Section:

Physical and logical components of a host

Common connectivity components and protocols

Features of intelligent disk storage systems

Data flow between the host and the storage array

Apply Your Knowledge

Data Flow Exercise

Case Studies

These are the key points covered in this section. Please take a moment to review them.

This concludes the training. Please proceed to the Course Completion slide to take the Assessment.

storage system architecture

Documents

feel competitive pressure

student resource guide

key points covered

intelligent storage systems

mobile wireless services

disk storage systems

networked storage environment

intelligent storage system