storage systems cse 598d, spring 2007 lecture 15: consistency semantics, introduction to...

Storage SystemsStorage SystemsCSE 598d, Spring 2007CSE 598d, Spring 2007

Lecture 15: Consistency Semantics, Lecture 15: Consistency Semantics, Introduction to Network-attached StorageIntroduction to Network-attached Storage

March 27, 2007March 27, 2007

Agenda• Last class– Consistency models: Brief Overview

• Next– More details on consistency models– Network storage introduction

• NAS vs SAN• DAFS• Some relevant technology and systems innovations• FC, Smart NICs, RDMA, …

– A variety of topics on file systems (and other storage-related software)• Log-structured file systems• Databases and file systems compared• Mobile/poorly connected systems, highly distributed & P2P storage• NFS, Google file system• Asynchronous I/O• Flash-based storage• Active disks, object-based storage devices (OSD)• Archival and secure storage• Storage virtualization and QoS

– Reliability, (emerging) miniature storage devices

Problem Background and Definition

• Consistency issues were first studied in the context of shared-memory multi-processors and we will start our discussion in the same context– Ideas generalize to any distributed system with shared storage

• Memory consistency model (MCM) of an SMP provides a formal specification of how the memory system will appear to the programmer– Places restrictions on the values that can be returned by a read in

a shared-memory program execution– An MCM is a contract between the memory and the programmer

• Why different models?– Trade-offs involved between “strictness” of consistency

guarantees, implementation efforts (hardware, compiler, programmer), system performance

Atomic/Strict Consistency

• Most intuitive, naturally appealing• Any read to a memory location x returns the value stored by

the most recent write operation to x• Defined w.r.t. a “global” clock

– That is the only way “most recent” can be defined unambiguously• Uni-processors typically observe such consistency

– A programmer on a uni-processor naturally assumes this behavior– E.g., As a programmer, one would not expect the following code segment to print 1

or any other value than 2• A = 1; A = 2; print (A);

– Still possible for compiler and hardware to improve throughput by re-ordering instructions

• Atomic consistency can be achieved as long as data and control dependencies are adhered to

• Often considered a base model (for evaluating MCMs that we will see next)

Atomic/Strict Consistency

• What happens on a multi-processor?– Even on the smallest and fastest multi-processor, global time can

not be achieved!– Achieving atomic consistency not possible– But not a hindrance, since programmers manage quite well with

something weaker than atomic consistency– What behavior do we expect when we program on a multi-

processor?• What we DO NOT expect: a global clock• What we expect:

– Operations from a process will execute sequentially» Again: A = 1; A =2; print (A) should not print 1

• And then we can use Critical section/Mutual exclusion mechanisms to enforce desired order among instructions coming from different processors

– So we expect a MCM less strict than atomic consistency. What is this consistency model, what are its properties, and what does the hardware/software (compiler) have to do to provide it?

Sequential Consistency• What we typically expect from a shared-memory multi-

processor system is captured by sequential consistency– Lamport [1979]: A multi-processor is sequentially consistent if

the result of any execution is the same as if • The operations of all the processors were executed in some

sequential order– That is, memory accesses occur atomically w.r.t. other memory

accesses• The operations of each individual processor appear in this

sequence in the order specified by its program– Equivalently, any valid interleaving is acceptable as long as all

processes see the same ordering of memory references

– Programmer’s view

Memory

P1 P3 P3 Pn

Example: Sequential Consistency

P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1

• Not atomically consistent because:– R(y)2 by P3 reads a value that has not been written yet– W(x)1 and W(y)2 appear commuted at P3

• But sequentially consistent– SC doesn’t have the notion of global clock


P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1


• But sequentially consistent• What about?

P1: W(x)1P2: W(y)2 R(y)2 R(x)0

R(x)1P3: R(y)2 R(x)0 R(x)1


P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1


• But sequentially consistent• What about?

P1: W(x)1P2: W(y)2 R(y)2 R(x)0

R(x)1P3: R(x)1 R(y)0 R(y)2

Causal Consistency• Hutto and Ahamad, 1990• Each operation is either “causally related” or “concurrent” with another

– When a processor performs a read followed later by a write, the two operations are said to be causally related because the value stored by the write may have been dependent upon the result of the read

– A read operation is causally related to an earlier write that stored the data retrieved by the read– Transitivity applies – Operations that are not causally related are said to be concurrent.

• A memory is causally consistent if all processors agree on the order of causally related writes– Weaker than SC that requires all writes to be seen in the same order

P1: W(x)1 W(x)3 P2: R(x)1 W(x)2 P3: R(x)1 R(x)3 R(x)2 P4: R(x)1 R(x)2 R(x)3

W(x)1 and W(x)2 causally relatedW(x)2 and W(x)3 not causally related!

Summary: Uniform MCMs

Atomic consistency

Sequential consistency

Processor consistency

Causal consistency

PRAM consistencyCache consistency

Slow memory

UNIX and session semantics

• UNIX file sharing semantics on a uni-processor system– When a read follows a write, the read returns the value

just written– When two writes happen in quick succession, followed

by a read, the value read is that stored by the last write

• Problematic for a distributed system– Theoretically achievable if single file server and no

client caching

• Session semantics– Writes made visible to others only upon the closing of a

file

Delta Consistency• Any write will become visible within at most Delta

time units– Barring network latency– Meanwhile … all bets are off!– Push versus pull– Compare with sequential, causal, etc. in terms of valid

orderings of operations

• Related: Mutual consistency with parameter Delta– A given set of “objects” are within Delta time units of each

other at all times as seen by a client– Note that it is OK to be stale with respect to the server by

more than Delta!– Generally, specify two parameters

• Delta1: Freshness w.r.t. server• Delta2: Mutual consistency of related objects

File Systems Consistency Semantics

• What is involved in providing these semantics?• UNIX semantics easy to implement on a uni-processor• Session semantics: session state at the server• Delta consistency: timeouts, leases• Meta-data consistency

– Some techniques we have seen• Journaling, LFS, Meta-data journaling: ext3• Synchronous writes• NVRAM: expensive, unavailable

– Disk scheduler enforced ordering!• File system passes sequencing restrictions to the disk scheduler• Problem: Disk scheduler can not enforce an ordering among requests not yet

visible to it– Soft updates

• Dependency information is maintained for meta-data blocks in write-back cache on a per-field and/or per-pointer granularity

Network-attached Storage

• Introduction to important ideas and technologies• Lots of slides, will cover some in class, post all on Angel• Subsequent classes will cover some topics in depth

Direct Attached Storage

• Problems/shortcomings in enterprise/commercial settings– Sharing of data difficult– Programming and client access inconvenient– Wastage of data– More?

“Remote” Storage• Idea: Separate storage from the clients and application servers and

locate it on the other side of a scalable networking infrastructure– Variants on this idea that we will see soon

• Advantages– Reduction in wasted capacity by pooling devices and consolidating unused

capacity formerly spread over many directly-attached storage devices– Reduced time to deploy new storage

• Client software is designed to tolerate dynamic changes in network resources but not the changing of local storage configurations while the client is operating

– Backup made more convenient• Application server involvement removed

– Management simplified by centralizing storage under a consolidate manager interface

– Availability improved (potentially)• All software and hardware is specifically developed and tested to run together

• Disadvantages– Complexity, more expertise needed

• Implies more set-up and management cost

Network Attached Storage

File interface exported to rest of the network

Storage Area Network (SAN)

Block interface exported to rest of the network

SAN versus NAS

Source: November 2000/Vol. 43, No. 11 COMMUNICATIONS OF THE ACM

Differences between NAS and SAN

• NAS– TCP/IP or UDP/IP protocols and Ethernet networks– High-level requests and responses for files– NAS devices translate file requests into operations on disk

blocks– Cheaper

• SAN– Fibre Channel and SCSI– More scalable– Clients translate files access to operate on specific disk– Data block level – Expensive– Separation of storage traffic from general network traffic

• Beneficial from security, performance

NAS File Servers• Pre-configured file servers• Consists of one or more

internal servers with pre-configured capacity

• Have a stripped down OS; any component not associated with file services is discarded

• Connected via Ethernet to LAN

• OS stripping makes it more efficient than a general purpose OS

• Have plug and play functionality

Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang Mueller

NAS Network Performance

• NAS and traditional network file systems use IP-based protocols over NIC devices.

• A consequence of this deployment is poor network performance.

• The main culprits often cited include: - Protocol processing in network stacks

- Memory copying - Kernel overhead including system calls and context switches.


Figure depicting sources of TCP/IP overhead


Protocol Processing• Data transmission involves the OS services for memory and

process management, the TCP/IP protocol stack and the network device and its device driver.

• The network per-packet costs include the overhead to execute the TCP/IP protocol code, allocate and release memory buffers, and device interrupts for packet arrival and transmit completion.

• The per-byte costs include overheads to move data within the end to end system and to compute checksums to detect data corruption in the network.


Current implementation for data transmission requires the same data to be

copied at several stages.

Memory Copy


• An NFS client requesting data stored on a NAS server with internal SCSI disk would involve:

- Hard Disk to RAM transfer using SCSI, PCI and system buses - RAM to NIC transfer using the System and PCI buses

• For a traditional NFS this would further involve a transfer from the application memory to the kernel buffer cache of the transmitting computer before forwarding to the network card.

Accelerating Performance

• Two starting points to accelerate network file system performance are :

- The underlying communication protocol TCP/IP was designed to provide a reliable framework for data exchange

over an unreliable network. The TCP/IP stack is complex and CPU-intensive. Example alternate: VIA/RDMA

- The Network file system Development of new network file systems which have a reliable network

connection requirement. Network file systems could be modified to use thinner communication

protocols Example alternate: DAFS

Proposed SolutionsTCP/IP offloading Engines (TOEs)• An increasing number of network adapters are able to

compute internet checksum• Some adapters can now perform TCP or UDP protocol

processing

Copy Avoidance• Several buffer management schemes had been proposed to

either reduce or eliminate data copying

Proposed SolutionsFibre Channel • Fibre Channel reduces the communication overhead by offloading

transport processing to the NIC instead of using the host processor• Zero copying is facilitated by direct communication between the

host memory and the NIC device

Direct-Access Transport• Requires NIC support for remote DMA• User-level networking made possible through user-mode process

interacting directly with the NIC to send or receive messages with minimal kernel intervention

• Reliable message transport network

Proposed SolutionsNIC Support Mechanism • NIC device exposes an array of connection descriptors to the

system’s physical address space• During connection setup time network device driver maps a free

descriptor into the user virtual address space• This grants user process a direct and safe access to the NIC’s

buffers and registers • This facilitates user-level networking and copy avoidance

Proposed SolutionsUser-Level File System

• Kernel policies for file system caching and prefetching do not favor some applications

• The migration of OS functions into user level libraries allow user applications more control and specialization.

• Clients would run in user mode as libraries linked directly with applications.This reduces the overhead due to system calls

• Clients may evolve independent of the operating system• Clients could also run on any OS, with no special kernel support

except the NIC device driver.

Virtual Interface And RDMA

• The virtual interface architecture facilitates fast and efficient data exchange between applications running on different machines

• VIA reduces complexity by allowing applications (VI consumers) to communicate directly with the network card (VI NIC) via common memory areas, bypassing the operating system

• The VI provider is the NIC and its device driver

• RDMA is a communication model supported on the VIA which allow applications to read and write memory areas of processes running on different computers

VI Architecture and RDMA


Remote DMA (RDMA)

VIA Model

LANai

CPU

LANai

CPU

senddoorbel

l

receivedoorbell

user addressspace

receivedescriptorsend

descriptor

data packetsin NIC

memory

Myrinet NIC Myrinet NIC

1

2

4 5 9

73

sendbuffer

receivebuffer

86 1

0

user addressspace

host host

InfiniBand

• “Infinite Bandwidth”• A Switch-based I/O interconnect architecture• Low pin count serial architecture• Infiniband Architecture(IBA) defines a System Area

Network (SAN) – IBA SAN is a communications and management

infrastructure for I/O and IPC• IBA defines a switched communications fabric

– high bandwidth and low latency • Backed by top companies in the industries; Compaq, Dell,

Hewlett Packard, IBM, Intel, Microsoft and sun

Limits of the PCI Bus

• Parallel Component Interconnect (PCI)– Introduced in 1992– Has become the standard bus architecture for servers– PCI bus

• 32-bit/33MHz -> 64-bit/66 MHz– PCI-X

• The latest version 64 bits at: PCI-X 66, PCI-X 133, PCI-X 266 and PCI-X 533 [4.3GBps]

– Other PCI concerns include• Bus sharing• Bus speed• Scalability• Fault Tolerance

PCI Express• High-speed point-to-point architecture that is

essentially a serialized,packetizedversion of PCI • General purpose serial I/O bus for chip-to-chip

communication, USB 2.0 / IEEE 1349b interconnects,and high-end graphics – viable AGP replacement

• Bandwidth 4 Gigabit/second full duplex per lane – Up to 32 separate lanes – 128 Gigabit/second

• Software-compatible with PCI device driver model • Expected to coexist with and not displace technologies

like PCI-X in the foreseeable future

Benefits of IBA• Bandwidths• An open and industry-inclusive standard• Improved connection flexibility and scalability• Improved reliability• Offload communications processing from the OS and CPU• Wide access to a variety of storage systems• Simultaneous device communication • Built-in security, quality of Service• Support for Internet Protocol version (IPv6)• Fewer and better managed system interrupts• Support for up to 64000 addressable devices• Support for copper cable and optic fiber

InfiniBand Components• Host Channel Adapter (HCA)

– An interface to a host and supports all software Verbs • Target Channel Adapter (TCA)

– Provides the connection to an I/O device from InfiniBand• Switch

– Fundamental component of an IB fabric– Allows many HCAs and TCAs to connect to it and handles

network traffic.• Router

– Forwards data packets from a local network to other external subnets

• Subnet Manager– An application responsible for configuring the local subnet

and ensuring its continued operation

An IBA SAN

InfiniBand Layers• Physical Layer

Link Pin Count Signaling Rate

Data Rate

Full-Duplex Data Rate

1x 4 2,5 Gb/s 2 Gb/s 4 Gb/s (500 MB/s)

4x 16 10 Gb/s 8 Gb/s 16 Gb/s (2 GB/s)

12x 48 30 Gb/s 24 Gb/s 48 Gb/s (6 GB/s)

InfiniBand Layers• Link Layer

– Is central to the IBA and includes packet layout, point to point link instructions, switching within a local subnet and data integrity

– Packets• Data and management packets

– Switching• Data forwarding within a local subnet

– QoS• Supported by Virtual lanes • is a unique logical communication link that shares a single physical link• Up to 15 virtual lane per physical link (VL0 – VL15)• Packet is assigned a priority

– Credit Based Flow Control• Used to manage data flow between two point-to-point links

– Integrity check using CRC

InfiniBand Layers• Networking Layer

–Responsible for routing packets from one subnet to another–The global route header (GRH) located within a packet includes an

IPv6 address for the source and destination of each packet• Transport Layer

–Handles the order of packet delivery as well as partitioning, multiplexing and transport services that determine reliable connections

Infiniband Architecture• The Queue Pair Abstraction

–2 queues of communication meta data (send & recv)–Registered buffers which to send from/recv to

“Architectural Interactions of I/O Networks and Inter-networks”, Philip Buonadonna, Intel Research & University of California, Berkeley

Direct Access File System• A new network file system derived from NFS version 4• Tailored to use remote DMA (RDMA) which requires the

virtual interface (VI) framework• Introduced to combine the low overhead of SAN products

with the generality of NAS file servers• Communication between a DAFS server and client is done

through RDMA• Client side caching of locks for easier subsequent access to

same file• Clients can be implemented as a shared library in user space

or in the kernel

DAFS Architecture


Direct Access File System

DAFS Protocol • Defined as a set of send and request formats and their

semantics• Defines recommended procedural APIs to access DAFS

services from a client program• Assumes a reliable network transport and offers server-

directed command flow• Each operation is a separate request but also supports

request chaining• Defines features for session recovery and locking primitives


Direct Access Data Transfer • Supports direct variants of data transfer operations such as read,

write, setattr etc.• Direct transfer operations to and from client-provided memory

using RDMA read and write operations• Client registers each memory region with local kernel before

requesting direct I/O on region• API defined primitives register and unregister for memory region

management; register returns a region descriptor• Registration issues a system call to pin buffer regions in physical

memory, then loads page translations for the region into a lookup table on the NIC


RDMA Operations • RDMA operations for direct I/O are initiated by the server.• Client write request to server includes a region token for the

buffer containing the data• Server then issues a RDMA read to fetch data from client

and responds with a write request response after RDMA completion


Asynchronous I/O and Prefetching • Supports fully asynchronous API interface which enables

clients to pipeline I/O operations and overlap them with application processing

• Event notification mechanisms delivers asynchronous completions and client may create several completion groups

• DAFS can be implemented as a user library to be linked with applications or within the kernel.


Figure depicting DAFS and NFS Client ArchitecturesSource: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html


Server Design and Implementation

• The kernel server design is fashioned on an event driven state transition diagram

• The main event triggering state transitions are:

recv_done, send_done and bio_done

Figure 1. An event-driven DAFS server

Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html


Event Handlers • Each network or disk event is associated with a handler routine• recv_done - Client initiated transfer is complete. This signal is

asserted by the NIC and initiates the processing of an incoming RPC request

• send_done - Server initiated transfer is complete. The handler for this signal releases all the locks involved in the RDMA operation and returns an RPC response

• bio_done - Block I/O request from disk is complete. This signal is raised by the disk controller and wakes up any thread that is blocking on a previous disk I/O


Server Design and Implementation • Server performs disk I/O using the zero-copy buffer cache

interface• This interface facilitates the locking pages and their mappings• Buffers involved in RDMA need to be locked during the entire

transfer duration• Transfers are initiated using RPC handlers and processing is

asynchronous• Kernel buffer cache manager registers and de-registers buffer

mappings to the NIC on the fly, as physical pages are returned or removed from the buffers


Server Design and Implementation • Server creates multiple kernel threads to facilitate I/O

concurrency • A single listener thread monitors for new transport connections.

Other worker threads handle data transfer• Arriving messages generate a recv_done interrupt which is

processed by a single handler for the completion group • Handler queues up incoming RPC requests and invokes a worker

thread to start data processing• A thread locks all the necessary file pages in the buffer cache,

creates RDMA descriptors and issues RDMA operations• After RDMA completion, a send_done signal is sent which

initiates the clean up and release of all resources associated with the completed operation

Communication Alternatives


Experimental Setup

Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html

Experimental SetupSystem Configuration

• Pentium III 800 MHz clients and servers

• Server cache 1GB, 133MHz memory bus

• 9GB Disks, 10K RPM Seagate Cheetah, 64-bit/33MHz PCI bus

• VI over Giganet cLAN 1000 adapter (DAFS)

• UDP/IP over Gigabit Ethernet, Alteon Tigon-II adapters (NFS)

Experimental Setup• NFS block I/O transfer size is set at mount time• Packets sent in fragmented UDP packets• Interrupt coalescing is set to high on Tigon-II• Checksum offloading enabled on Tigon-II

• NFS-nocopy required modifying Tigon-II firmware, IP fragmentation code, file cache code,VM system and Tigon-II driver, to facilitate header splitting and page remapping

Experimental ResultsThe table below shows the results for one-byte round trip latency andbandwidth. The higher latency in Tigon-II was due to datapath crossingthe kernel UDP/IP stack

Experimental ResultsBandwidth and Overhead• Server pre-warmed with 768MB dataset• Designed to stress on network data transfer• Hence client caching not considered

Sequential Configuration• DAFS client utilized the asynchronous I/O API• NFS had read-ahead enabled Random Configuration• NFS tuned for best-case performance at each request size by

selecting a matching NFS transfer size

Experimental Results

Experimental ResultsTPIE Merge• The sequential record merge program combines n sorted

input files of x y-bytes each into a single sorted output file• Depicts raw sequential I/O performance with varying

amounts of processing• Performance is limited by the client CPU

Experimental Results

Experimental ResultsPostMark• A synthetic benchmark used in measuring file system

performance over workloads composed of many short-lived, relatively small files

• Creates a pool of files with random sizes followed by sequence of file operations

Experimental ResultsBerkeley DB• Synthetic workload composed of read-only transactions,

processing one small record at random from a B-tree

Disk Storage Interfaces

• Parallel ATA (IDE, E-IDE)• Serial ATA (SATA)• Small Computer System Interface

(SCSI)• Serial Attached SCSI (SAS)• Fiber Channel (FC)

"It's More Then the Interface" By Gordy Lutz of Seagate, August, 2002.

Parallel ATA• 16-bit bus• Two bytes per bus transaction• 40-pin connector• Master/slave shared bus

• Bandwidth25MHz strobex 2 for double data rate clockingx 16bits per edge/ 8 bits per byte-------------------------------------= 100MBytes/sec

Serial ATA (SATA)• 7-pin connector• Point to Point connections for dedicated bandwidth• Bit-by-bit

– One single signal path for data transmission– The other signal path for acknowledgement

• Bandwidth 1500MHz embedded clockx 1 bit per clockx 80% for 8b10b encoding/ 8 bits per byte-------------------------------------= 150MBytes/sec

• 2002 -> 150MB/sec • 2004 -> 300MB/sec • 2007 -> 600MB/sec

8b10b encoding• IBM Patent• Used in SATA, SAS, FC and

InfiniBand• Convert 8 bits data to 10 bits

codes• Provides better synchronization

than Manchester encoding

Small Computer Systems Interface (SCSI)

• SCSI for high-performance storage market

• SCSI-1 proposed in 1986• Parallel Interface• Maximum cabling distance is 12 meters• Terminators required• Bus width is 8-bit (narrow)• 16 devices per bus• A device with a high priority has a bus

SCSI (cont’d)• Peer-to-peer connection (channel) • 50/68 pins

• Hot repair not provided• Multiple buses needed beyond 16 devices• Low bandwidth• Distance limitation

SCSI Roadmap• Wide SCSI (16-bit bus)• Fast SCSI (double data rate)

Serial Attached SCSI (SAS)

• ANSI standard in 2003• Interoperability with SATA• Full-duplex• Dual-port• 128 devices• 10 meters

Dual port• ATA, SCSI and SATA support a single

port• Controller is a single point of failure• SAS and FC support dual port

SAS Roadmap

http://www.scsita.org/aboutscsi/sas/SAS_roadmap2004.html

Fibre Channel (FC)• Developed to backbone technology of

LANs• The name is a misnomer

– Runs on copper also– 4 wire cable or fiber optic

• 10 km or less per link• 126 devices per loop• No terminators• Installed base of Fibre Channel devices*

– $2.45 billion FC HBAs in 2005– $5.4 billion FC switches in 2005

*Source: Gartner, Dec 13, 2001

FC (cont’d)• Advantage

– High bandwidth– Secure– Zero-copy send and receive– Low host CPU utilization– FCP (Fibre Channel Protocol)

• Disadvantage– Not a wide-area network– Separate physical network infrastructure – Expensive– Different management mechanisms– Interoperability from difference vendors

Fiber Channel Topologies

Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained

Fiber Channel Ports• N-Port: Node port• F-Port: Fabric port• L-Port: Loop port

– Only connect to AL• E-Port: Expansion port

– Connect two switches• G-Port: Generic port• B-Port: Bridge port

– Bridge to other networks (IP, ATM, etc)• NL-Port: Node_Loop_port

– Can connect both in fabric and in AL• FL-Port: Fabric_Loop_port

– Makes a fabric to connect to a loop


Arbitrated Loop in FC


Routing mechanisms in switch

• Store-forward routing

• Cut-through routing

William James Dally and Brian Towles, Principles and practices of Interconnection networks, chapter 13

Fibre Channel Hub and Switch

• Switch– Thousands of

connections– Bandwidth per device

is nearly constant– Aggregate bandwidth

increases with increased connectivity

– Deterministic latency

• Hub– 126 Devices– Bandwidth per device

diminished with increased connectivity

– Aggregate bandwidth is constant with increased connectivity

– Latency increases as the number of devices increases

Fibre Channel Structure

Fibre Channel Bandwidth

• Clock rate is 1.0625GHz• 1.0625[Gbps] x

2048[payload]/2168[payload+overhead] x 0.8[8b10b]/8[bits] = 100.369 MB/s

Cable types in FC

FC Roadmap

ProductNaming

Throughput(MB/s)

T11 Spec Completed

(Year)

Market Availability

(Year)

1GFC 200 1996 1997

2GFC 400 2000 2001

4GFC 800 2003 2005

8GFC 1,600 2006 2008

16GFC 3200 2009 2011

32GFC 6400 2012 Market Demand

64GFC 12800 2016 Market Demand

128GFC 25600 2020 Market Demandhttp://www.fibrechannel.org/OVERVIEW/Roadmap.html

Interface Comparison

Market Segments

It’s more than interface, Seagate, 2003

Interface Trends - Previous


Interface Trends – Today and Tomorrow


IP Storage

IP Storage (cont’d)• TCP/IP is used as a storage interconnect to

transfer block level data.• IETF working group, the IP Storage (IPS) • iSCSI, iFCP, and FCIP protocols

• Cheaper• Provides one technology for a client to connect

to servers and storage devices• Increases operating distances• Improves availability of storage systems• Can utilize network management tools


iSCSI (Internet SCSI)• iSCSI is a Transport for SCSI Commands

– iSCSI is an End to End protocol– iSCSI can be implemented on Desktops, Laptops

and Servers– iSCSI can be implemented with current TCP/IP

Stacks– iSCSI can be implemented completely in a HBA

• Overcomes the distance limitation• Cost-effective

Protocol Stack - iSCSI

Packet and Bandwidth - iSCSI

• iSCSI overhead: 78 Bytes – 14 (Ethernet) + 20 (IP) + 20 (TCP) + 4 (CRC) + 20 (Interframe

Gap)– iSCSI header occurs 48 bytes per SCSI command

• 1.25[Gbps] x 1460[payload]/1538[payload+overhead] x 0.8[8b10b]/8[bits] = 113.16 MB/s

• Bi-Directional Payload Bandwidth: 220.31 MB/s

Problems with iSCSI• Limited Performance because

– Protocol overhead in TCP/IP– Interrupts are generated for each

network packet– Extra copies when sending and

receiving data

iSCSI Adapter Implementations

• Software approach– Show the best performance– This approach is very competitive due to fast modern

CPUs

• Hardware Approaches– Relatively slow CPU compared to host CPU– Development speed is also slower than that in host

CPU– Performance improvement is limited without

superior advances in embedded CPU– Can show performance improvement in highly-

loaded systems

Prasenjit Sarkar, Sandeep Utamchandani, Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, FAST 2003

iFCP (Internet Fiber Channel Protocol)

• iFCP is a gateway-to-gateway protocol for the implementation of a fibre channel fabric over a TCP/IP transport

• Allow users to interconnect FC devices over a TCP/IP network at any distance

• Traffic between fibre channel devices is routed and switched by TCP/IP network

• iFCP maps each FC address to an IP address and each FC session to an TCP session

• FC messaging and routing services are terminated at the gateways so that are not merged

• Data backup and replication• mFCP uses UDP/IP

How does iFCP work?

Types of iFCP communication

FCIP (Fiber Channel over IP)

• TCP/IP-based tunneling protocol to encapsulate fibre channel packets

• Allow users to interconnect FC devices over a TCP/IP network at any distance (same as iFCP)

• Merges connected SANs into a single FC fabric• Data backup and replication• Gateways

–used to interconnect fibre channel SANs to the IP network –set up connections between SANs or between fibre

channel devices and SANs

FCIP (Fiber Channel over IP)

Comparison between FCIP and iFCP

IP Storage Protocols: iSCSI, iFCP and FCIP

RAS• Reliability

– The basic InfiniBand link connection is comprised of only four signal wires

– IBA accommodates multiple ports for each I/O unit– IBA provides multiple CRCs

• Availability– An IBA fabric in inherently redundant, with multiple paths to

sources assuring data delivery– IBA allows the network to heal itself if a link fails or is

reporting errors– IBA has a many-to-many server-to-I/O relationship

• Serviceability– Hot-pluggable

Feature Infini Band Fibre Channel 1Gb & 10 Gb Ethernet

PCI-X

Bandwidth 2.5 , 10, 30 Gb/s 1, 2.1 Gb/s 1, 10 Gb/s 8.51 Gb/s

Bandwidth Full-Duplex

5, 20, 60 Gb/s 2.1 , 4.2 GB/s 2, 20 Gb/s N/A

Pin Count 4, 16, 48 4 4 / 8 90

Media Copper/Fiber Copper/Fiber Copper/Fiber PCB

Max Length Copper

250 / 125 m 13m 100m inches

Max Length Fiber 10 km km km N/A

Partitioning X X X N/A

Scalable Link Width

X N/A N/A N/A

Max Payload 4 KB 2KB 1.5 KB No Packets

A classification of storage systems

(warning - not comprehensive)

• Isolated– E.g., A laptop/PC with a local file system– We know how these work– File systems were first developed for centralized computer systems as an OS

facility providing a convenient programming interfact to (disk) storage– Subsequently acquired features like AC, file-locking that made them useful for

sharing of data and programs• Distributed

– Why?• Sharing, scalability, mobility, fault tolerance, …

– “Basic” Distributed file system• Give the illusion of local storage when the data is spread across a network (usually a LAN)

to clients running on multiple computers• Support the sharing of information of in the form of files and hardware resources in the

form of persistent storage throughout an intranet

– Enhancements in various domains for “real-time” performance (multimedia), high failure resistance, high scalability (P2P), security, longevity (archival systems), mobility/disconnections, …

– Remote objects to support distributed object-oriented programming

Storage systems and their properties

Main memory No No No Strict one-copy RAM

File system No Yes No Strict one-copy UNIX FS

Distributed file system

Yes Yes Yes Yes (approx.) NFS

Web Yes Yes Yes Very approx/No Web server

Distributed shared memory

Yes No Yes Yes (approx) Ivy

Remote objects (RMI/ORB)

Yes No No Strict one-copy CORBA

Persistent object store

Yes Yes No Strict one-copy CORBA persistent state service

P2P storage system

Yes Yes Yes Very approx OceanStore

Sharing PersistenceCaching/replication

Consistencymaintenance Example

storage systems cse 598d, spring 2007 lecture 15: consistency semantics, introduction to...

Documents

memory system

memory multiprocessors

consistency semantics

storage systemscse

distributed system

memory location x

systems innovationsfc

context of shared