storage systems cse 598d, spring 2007 lecture 15: consistency semantics, introduction to...
TRANSCRIPT
Storage SystemsStorage SystemsCSE 598d, Spring 2007CSE 598d, Spring 2007
Lecture 15: Consistency Semantics, Lecture 15: Consistency Semantics, Introduction to Network-attached StorageIntroduction to Network-attached Storage
March 27, 2007March 27, 2007
Agenda• Last class– Consistency models: Brief Overview
• Next– More details on consistency models– Network storage introduction
• NAS vs SAN• DAFS• Some relevant technology and systems innovations• FC, Smart NICs, RDMA, …
– A variety of topics on file systems (and other storage-related software)• Log-structured file systems• Databases and file systems compared• Mobile/poorly connected systems, highly distributed & P2P storage• NFS, Google file system• Asynchronous I/O• Flash-based storage• Active disks, object-based storage devices (OSD)• Archival and secure storage• Storage virtualization and QoS
– Reliability, (emerging) miniature storage devices
Problem Background and Definition
• Consistency issues were first studied in the context of shared-memory multi-processors and we will start our discussion in the same context– Ideas generalize to any distributed system with shared storage
• Memory consistency model (MCM) of an SMP provides a formal specification of how the memory system will appear to the programmer– Places restrictions on the values that can be returned by a read in
a shared-memory program execution– An MCM is a contract between the memory and the programmer
• Why different models?– Trade-offs involved between “strictness” of consistency
guarantees, implementation efforts (hardware, compiler, programmer), system performance
Atomic/Strict Consistency
• Most intuitive, naturally appealing• Any read to a memory location x returns the value stored by
the most recent write operation to x• Defined w.r.t. a “global” clock
– That is the only way “most recent” can be defined unambiguously• Uni-processors typically observe such consistency
– A programmer on a uni-processor naturally assumes this behavior– E.g., As a programmer, one would not expect the following code segment to print 1
or any other value than 2• A = 1; A = 2; print (A);
– Still possible for compiler and hardware to improve throughput by re-ordering instructions
• Atomic consistency can be achieved as long as data and control dependencies are adhered to
• Often considered a base model (for evaluating MCMs that we will see next)
Atomic/Strict Consistency
• What happens on a multi-processor?– Even on the smallest and fastest multi-processor, global time can
not be achieved!– Achieving atomic consistency not possible– But not a hindrance, since programmers manage quite well with
something weaker than atomic consistency– What behavior do we expect when we program on a multi-
processor?• What we DO NOT expect: a global clock• What we expect:
– Operations from a process will execute sequentially» Again: A = 1; A =2; print (A) should not print 1
• And then we can use Critical section/Mutual exclusion mechanisms to enforce desired order among instructions coming from different processors
– So we expect a MCM less strict than atomic consistency. What is this consistency model, what are its properties, and what does the hardware/software (compiler) have to do to provide it?
Sequential Consistency• What we typically expect from a shared-memory multi-
processor system is captured by sequential consistency– Lamport [1979]: A multi-processor is sequentially consistent if
the result of any execution is the same as if • The operations of all the processors were executed in some
sequential order– That is, memory accesses occur atomically w.r.t. other memory
accesses• The operations of each individual processor appear in this
sequence in the order specified by its program– Equivalently, any valid interleaving is acceptable as long as all
processes see the same ordering of memory references
– Programmer’s view
Memory
P1 P3 P3 Pn
Example: Sequential Consistency
P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1
• Not atomically consistent because:– R(y)2 by P3 reads a value that has not been written yet– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent– SC doesn’t have the notion of global clock
Example: Sequential Consistency
P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1
• Not atomically consistent because:– R(y)2 by P3 reads a value that has not been written yet– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent• What about?
P1: W(x)1P2: W(y)2 R(y)2 R(x)0
R(x)1P3: R(y)2 R(x)0 R(x)1
Example: Sequential Consistency
P1: W(x)1P2: W(y)2P3: R(y)2 R(x)0 R(x)1
• Not atomically consistent because:– R(y)2 by P3 reads a value that has not been written yet– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent• What about?
P1: W(x)1P2: W(y)2 R(y)2 R(x)0
R(x)1P3: R(x)1 R(y)0 R(y)2
Causal Consistency• Hutto and Ahamad, 1990• Each operation is either “causally related” or “concurrent” with another
– When a processor performs a read followed later by a write, the two operations are said to be causally related because the value stored by the write may have been dependent upon the result of the read
– A read operation is causally related to an earlier write that stored the data retrieved by the read– Transitivity applies – Operations that are not causally related are said to be concurrent.
• A memory is causally consistent if all processors agree on the order of causally related writes– Weaker than SC that requires all writes to be seen in the same order
P1: W(x)1 W(x)3 P2: R(x)1 W(x)2 P3: R(x)1 R(x)3 R(x)2 P4: R(x)1 R(x)2 R(x)3
W(x)1 and W(x)2 causally relatedW(x)2 and W(x)3 not causally related!
Summary: Uniform MCMs
Atomic consistency
Sequential consistency
Processor consistency
Causal consistency
PRAM consistencyCache consistency
Slow memory
UNIX and session semantics
• UNIX file sharing semantics on a uni-processor system– When a read follows a write, the read returns the value
just written– When two writes happen in quick succession, followed
by a read, the value read is that stored by the last write
• Problematic for a distributed system– Theoretically achievable if single file server and no
client caching
• Session semantics– Writes made visible to others only upon the closing of a
file
Delta Consistency• Any write will become visible within at most Delta
time units– Barring network latency– Meanwhile … all bets are off!– Push versus pull– Compare with sequential, causal, etc. in terms of valid
orderings of operations
• Related: Mutual consistency with parameter Delta– A given set of “objects” are within Delta time units of each
other at all times as seen by a client– Note that it is OK to be stale with respect to the server by
more than Delta!– Generally, specify two parameters
• Delta1: Freshness w.r.t. server• Delta2: Mutual consistency of related objects
File Systems Consistency Semantics
• What is involved in providing these semantics?• UNIX semantics easy to implement on a uni-processor• Session semantics: session state at the server• Delta consistency: timeouts, leases• Meta-data consistency
– Some techniques we have seen• Journaling, LFS, Meta-data journaling: ext3• Synchronous writes• NVRAM: expensive, unavailable
– Disk scheduler enforced ordering!• File system passes sequencing restrictions to the disk scheduler• Problem: Disk scheduler can not enforce an ordering among requests not yet
visible to it– Soft updates
• Dependency information is maintained for meta-data blocks in write-back cache on a per-field and/or per-pointer granularity
Network-attached Storage
• Introduction to important ideas and technologies• Lots of slides, will cover some in class, post all on Angel• Subsequent classes will cover some topics in depth
Direct Attached Storage
• Problems/shortcomings in enterprise/commercial settings– Sharing of data difficult– Programming and client access inconvenient– Wastage of data– More?
“Remote” Storage• Idea: Separate storage from the clients and application servers and
locate it on the other side of a scalable networking infrastructure– Variants on this idea that we will see soon
• Advantages– Reduction in wasted capacity by pooling devices and consolidating unused
capacity formerly spread over many directly-attached storage devices– Reduced time to deploy new storage
• Client software is designed to tolerate dynamic changes in network resources but not the changing of local storage configurations while the client is operating
– Backup made more convenient• Application server involvement removed
– Management simplified by centralizing storage under a consolidate manager interface
– Availability improved (potentially)• All software and hardware is specifically developed and tested to run together
• Disadvantages– Complexity, more expertise needed
• Implies more set-up and management cost
Network Attached Storage
File interface exported to rest of the network
Storage Area Network (SAN)
Block interface exported to rest of the network
SAN versus NAS
Source: November 2000/Vol. 43, No. 11 COMMUNICATIONS OF THE ACM
Differences between NAS and SAN
• NAS– TCP/IP or UDP/IP protocols and Ethernet networks– High-level requests and responses for files– NAS devices translate file requests into operations on disk
blocks– Cheaper
• SAN– Fibre Channel and SCSI– More scalable– Clients translate files access to operate on specific disk– Data block level – Expensive– Separation of storage traffic from general network traffic
• Beneficial from security, performance
NAS File Servers• Pre-configured file servers• Consists of one or more
internal servers with pre-configured capacity
• Have a stripped down OS; any component not associated with file services is discarded
• Connected via Ethernet to LAN
• OS stripping makes it more efficient than a general purpose OS
• Have plug and play functionality
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang Mueller
NAS Network Performance
• NAS and traditional network file systems use IP-based protocols over NIC devices.
• A consequence of this deployment is poor network performance.
• The main culprits often cited include: - Protocol processing in network stacks
- Memory copying - Kernel overhead including system calls and context switches.
NAS Network Performance
Figure depicting sources of TCP/IP overhead
NAS Network Performance
Protocol Processing• Data transmission involves the OS services for memory and
process management, the TCP/IP protocol stack and the network device and its device driver.
• The network per-packet costs include the overhead to execute the TCP/IP protocol code, allocate and release memory buffers, and device interrupts for packet arrival and transmit completion.
• The per-byte costs include overheads to move data within the end to end system and to compute checksums to detect data corruption in the network.
NAS Network Performance
Current implementation for data transmission requires the same data to be
copied at several stages.
Memory Copy
NAS Network Performance
• An NFS client requesting data stored on a NAS server with internal SCSI disk would involve:
- Hard Disk to RAM transfer using SCSI, PCI and system buses - RAM to NIC transfer using the System and PCI buses
• For a traditional NFS this would further involve a transfer from the application memory to the kernel buffer cache of the transmitting computer before forwarding to the network card.
Accelerating Performance
• Two starting points to accelerate network file system performance are :
- The underlying communication protocol TCP/IP was designed to provide a reliable framework for data exchange
over an unreliable network. The TCP/IP stack is complex and CPU-intensive. Example alternate: VIA/RDMA
- The Network file system Development of new network file systems which have a reliable network
connection requirement. Network file systems could be modified to use thinner communication
protocols Example alternate: DAFS
Proposed SolutionsTCP/IP offloading Engines (TOEs)• An increasing number of network adapters are able to
compute internet checksum• Some adapters can now perform TCP or UDP protocol
processing
Copy Avoidance• Several buffer management schemes had been proposed to
either reduce or eliminate data copying
Proposed SolutionsFibre Channel • Fibre Channel reduces the communication overhead by offloading
transport processing to the NIC instead of using the host processor• Zero copying is facilitated by direct communication between the
host memory and the NIC device
Direct-Access Transport• Requires NIC support for remote DMA• User-level networking made possible through user-mode process
interacting directly with the NIC to send or receive messages with minimal kernel intervention
• Reliable message transport network
Proposed SolutionsNIC Support Mechanism • NIC device exposes an array of connection descriptors to the
system’s physical address space• During connection setup time network device driver maps a free
descriptor into the user virtual address space• This grants user process a direct and safe access to the NIC’s
buffers and registers • This facilitates user-level networking and copy avoidance
Proposed SolutionsUser-Level File System
• Kernel policies for file system caching and prefetching do not favor some applications
• The migration of OS functions into user level libraries allow user applications more control and specialization.
• Clients would run in user mode as libraries linked directly with applications.This reduces the overhead due to system calls
• Clients may evolve independent of the operating system• Clients could also run on any OS, with no special kernel support
except the NIC device driver.
Virtual Interface And RDMA
• The virtual interface architecture facilitates fast and efficient data exchange between applications running on different machines
• VIA reduces complexity by allowing applications (VI consumers) to communicate directly with the network card (VI NIC) via common memory areas, bypassing the operating system
• The VI provider is the NIC and its device driver
• RDMA is a communication model supported on the VIA which allow applications to read and write memory areas of processes running on different computers
VI Architecture and RDMA
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Remote DMA (RDMA)
VIA Model
LANai
CPU
LANai
CPU
senddoorbel
l
receivedoorbell
user addressspace
receivedescriptorsend
descriptor
data packetsin NIC
memory
Myrinet NIC Myrinet NIC
1
2
4 5 9
73
sendbuffer
receivebuffer
86 1
0
user addressspace
host host
InfiniBand
• “Infinite Bandwidth”• A Switch-based I/O interconnect architecture• Low pin count serial architecture• Infiniband Architecture(IBA) defines a System Area
Network (SAN) – IBA SAN is a communications and management
infrastructure for I/O and IPC• IBA defines a switched communications fabric
– high bandwidth and low latency • Backed by top companies in the industries; Compaq, Dell,
Hewlett Packard, IBM, Intel, Microsoft and sun
Limits of the PCI Bus
• Parallel Component Interconnect (PCI)– Introduced in 1992– Has become the standard bus architecture for servers– PCI bus
• 32-bit/33MHz -> 64-bit/66 MHz– PCI-X
• The latest version 64 bits at: PCI-X 66, PCI-X 133, PCI-X 266 and PCI-X 533 [4.3GBps]
– Other PCI concerns include• Bus sharing• Bus speed• Scalability• Fault Tolerance
PCI Express• High-speed point-to-point architecture that is
essentially a serialized,packetizedversion of PCI • General purpose serial I/O bus for chip-to-chip
communication, USB 2.0 / IEEE 1349b interconnects,and high-end graphics – viable AGP replacement
• Bandwidth 4 Gigabit/second full duplex per lane – Up to 32 separate lanes – 128 Gigabit/second
• Software-compatible with PCI device driver model • Expected to coexist with and not displace technologies
like PCI-X in the foreseeable future
Benefits of IBA• Bandwidths• An open and industry-inclusive standard• Improved connection flexibility and scalability• Improved reliability• Offload communications processing from the OS and CPU• Wide access to a variety of storage systems• Simultaneous device communication • Built-in security, quality of Service• Support for Internet Protocol version (IPv6)• Fewer and better managed system interrupts• Support for up to 64000 addressable devices• Support for copper cable and optic fiber
InfiniBand Components• Host Channel Adapter (HCA)
– An interface to a host and supports all software Verbs • Target Channel Adapter (TCA)
– Provides the connection to an I/O device from InfiniBand• Switch
– Fundamental component of an IB fabric– Allows many HCAs and TCAs to connect to it and handles
network traffic.• Router
– Forwards data packets from a local network to other external subnets
• Subnet Manager– An application responsible for configuring the local subnet
and ensuring its continued operation
An IBA SAN
InfiniBand Layers• Physical Layer
Link Pin Count Signaling Rate
Data Rate
Full-Duplex Data Rate
1x 4 2,5 Gb/s 2 Gb/s 4 Gb/s (500 MB/s)
4x 16 10 Gb/s 8 Gb/s 16 Gb/s (2 GB/s)
12x 48 30 Gb/s 24 Gb/s 48 Gb/s (6 GB/s)
InfiniBand Layers• Link Layer
– Is central to the IBA and includes packet layout, point to point link instructions, switching within a local subnet and data integrity
– Packets• Data and management packets
– Switching• Data forwarding within a local subnet
– QoS• Supported by Virtual lanes • is a unique logical communication link that shares a single physical link• Up to 15 virtual lane per physical link (VL0 – VL15)• Packet is assigned a priority
– Credit Based Flow Control• Used to manage data flow between two point-to-point links
– Integrity check using CRC
InfiniBand Layers• Networking Layer
–Responsible for routing packets from one subnet to another–The global route header (GRH) located within a packet includes an
IPv6 address for the source and destination of each packet• Transport Layer
–Handles the order of packet delivery as well as partitioning, multiplexing and transport services that determine reliable connections
Infiniband Architecture• The Queue Pair Abstraction
–2 queues of communication meta data (send & recv)–Registered buffers which to send from/recv to
“Architectural Interactions of I/O Networks and Inter-networks”, Philip Buonadonna, Intel Research & University of California, Berkeley
Direct Access File System• A new network file system derived from NFS version 4• Tailored to use remote DMA (RDMA) which requires the
virtual interface (VI) framework• Introduced to combine the low overhead of SAN products
with the generality of NAS file servers• Communication between a DAFS server and client is done
through RDMA• Client side caching of locks for easier subsequent access to
same file• Clients can be implemented as a shared library in user space
or in the kernel
DAFS Architecture
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Direct Access File System
DAFS Protocol • Defined as a set of send and request formats and their
semantics• Defines recommended procedural APIs to access DAFS
services from a client program• Assumes a reliable network transport and offers server-
directed command flow• Each operation is a separate request but also supports
request chaining• Defines features for session recovery and locking primitives
Direct Access File System
Direct Access Data Transfer • Supports direct variants of data transfer operations such as read,
write, setattr etc.• Direct transfer operations to and from client-provided memory
using RDMA read and write operations• Client registers each memory region with local kernel before
requesting direct I/O on region• API defined primitives register and unregister for memory region
management; register returns a region descriptor• Registration issues a system call to pin buffer regions in physical
memory, then loads page translations for the region into a lookup table on the NIC
Direct Access File System
RDMA Operations • RDMA operations for direct I/O are initiated by the server.• Client write request to server includes a region token for the
buffer containing the data• Server then issues a RDMA read to fetch data from client
and responds with a write request response after RDMA completion
Direct Access File System
Asynchronous I/O and Prefetching • Supports fully asynchronous API interface which enables
clients to pipeline I/O operations and overlap them with application processing
• Event notification mechanisms delivers asynchronous completions and client may create several completion groups
• DAFS can be implemented as a user library to be linked with applications or within the kernel.
Direct Access File System
Figure depicting DAFS and NFS Client ArchitecturesSource: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Direct Access File System
Server Design and Implementation
• The kernel server design is fashioned on an event driven state transition diagram
• The main event triggering state transitions are:
recv_done, send_done and bio_done
Figure 1. An event-driven DAFS server
Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Direct Access File System
Event Handlers • Each network or disk event is associated with a handler routine• recv_done - Client initiated transfer is complete. This signal is
asserted by the NIC and initiates the processing of an incoming RPC request
• send_done - Server initiated transfer is complete. The handler for this signal releases all the locks involved in the RDMA operation and returns an RPC response
• bio_done - Block I/O request from disk is complete. This signal is raised by the disk controller and wakes up any thread that is blocking on a previous disk I/O
Direct Access File System
Server Design and Implementation • Server performs disk I/O using the zero-copy buffer cache
interface• This interface facilitates the locking pages and their mappings• Buffers involved in RDMA need to be locked during the entire
transfer duration• Transfers are initiated using RPC handlers and processing is
asynchronous• Kernel buffer cache manager registers and de-registers buffer
mappings to the NIC on the fly, as physical pages are returned or removed from the buffers
Direct Access File System
Server Design and Implementation • Server creates multiple kernel threads to facilitate I/O
concurrency • A single listener thread monitors for new transport connections.
Other worker threads handle data transfer• Arriving messages generate a recv_done interrupt which is
processed by a single handler for the completion group • Handler queues up incoming RPC requests and invokes a worker
thread to start data processing• A thread locks all the necessary file pages in the buffer cache,
creates RDMA descriptors and issues RDMA operations• After RDMA completion, a send_done signal is sent which
initiates the clean up and release of all resources associated with the completed operation
Communication Alternatives
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Experimental Setup
Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Experimental SetupSystem Configuration
• Pentium III 800 MHz clients and servers
• Server cache 1GB, 133MHz memory bus
• 9GB Disks, 10K RPM Seagate Cheetah, 64-bit/33MHz PCI bus
• VI over Giganet cLAN 1000 adapter (DAFS)
• UDP/IP over Gigabit Ethernet, Alteon Tigon-II adapters (NFS)
Experimental Setup• NFS block I/O transfer size is set at mount time• Packets sent in fragmented UDP packets• Interrupt coalescing is set to high on Tigon-II• Checksum offloading enabled on Tigon-II
• NFS-nocopy required modifying Tigon-II firmware, IP fragmentation code, file cache code,VM system and Tigon-II driver, to facilitate header splitting and page remapping
Experimental ResultsThe table below shows the results for one-byte round trip latency andbandwidth. The higher latency in Tigon-II was due to datapath crossingthe kernel UDP/IP stack
Experimental ResultsBandwidth and Overhead• Server pre-warmed with 768MB dataset• Designed to stress on network data transfer• Hence client caching not considered
Sequential Configuration• DAFS client utilized the asynchronous I/O API• NFS had read-ahead enabled Random Configuration• NFS tuned for best-case performance at each request size by
selecting a matching NFS transfer size
Experimental Results
Experimental Results
Experimental ResultsTPIE Merge• The sequential record merge program combines n sorted
input files of x y-bytes each into a single sorted output file• Depicts raw sequential I/O performance with varying
amounts of processing• Performance is limited by the client CPU
Experimental Results
Experimental ResultsPostMark• A synthetic benchmark used in measuring file system
performance over workloads composed of many short-lived, relatively small files
• Creates a pool of files with random sizes followed by sequence of file operations
Experimental ResultsBerkeley DB• Synthetic workload composed of read-only transactions,
processing one small record at random from a B-tree
Disk Storage Interfaces
• Parallel ATA (IDE, E-IDE)• Serial ATA (SATA)• Small Computer System Interface
(SCSI)• Serial Attached SCSI (SAS)• Fiber Channel (FC)
"It's More Then the Interface" By Gordy Lutz of Seagate, August, 2002.
Parallel ATA• 16-bit bus• Two bytes per bus transaction• 40-pin connector• Master/slave shared bus
• Bandwidth25MHz strobex 2 for double data rate clockingx 16bits per edge/ 8 bits per byte-------------------------------------= 100MBytes/sec
Serial ATA (SATA)• 7-pin connector• Point to Point connections for dedicated bandwidth• Bit-by-bit
– One single signal path for data transmission– The other signal path for acknowledgement
• Bandwidth 1500MHz embedded clockx 1 bit per clockx 80% for 8b10b encoding/ 8 bits per byte-------------------------------------= 150MBytes/sec
• 2002 -> 150MB/sec • 2004 -> 300MB/sec • 2007 -> 600MB/sec
8b10b encoding• IBM Patent• Used in SATA, SAS, FC and
InfiniBand• Convert 8 bits data to 10 bits
codes• Provides better synchronization
than Manchester encoding
Small Computer Systems Interface (SCSI)
• SCSI for high-performance storage market
• SCSI-1 proposed in 1986• Parallel Interface• Maximum cabling distance is 12 meters• Terminators required• Bus width is 8-bit (narrow)• 16 devices per bus• A device with a high priority has a bus
SCSI (cont’d)• Peer-to-peer connection (channel) • 50/68 pins
• Hot repair not provided• Multiple buses needed beyond 16 devices• Low bandwidth• Distance limitation
SCSI Roadmap• Wide SCSI (16-bit bus)• Fast SCSI (double data rate)
Serial Attached SCSI (SAS)
• ANSI standard in 2003• Interoperability with SATA• Full-duplex• Dual-port• 128 devices• 10 meters
Dual port• ATA, SCSI and SATA support a single
port• Controller is a single point of failure• SAS and FC support dual port
SAS Roadmap
http://www.scsita.org/aboutscsi/sas/SAS_roadmap2004.html
Fibre Channel (FC)• Developed to backbone technology of
LANs• The name is a misnomer
– Runs on copper also– 4 wire cable or fiber optic
• 10 km or less per link• 126 devices per loop• No terminators• Installed base of Fibre Channel devices*
– $2.45 billion FC HBAs in 2005– $5.4 billion FC switches in 2005
*Source: Gartner, Dec 13, 2001
FC (cont’d)• Advantage
– High bandwidth– Secure– Zero-copy send and receive– Low host CPU utilization– FCP (Fibre Channel Protocol)
• Disadvantage– Not a wide-area network– Separate physical network infrastructure – Expensive– Different management mechanisms– Interoperability from difference vendors
Fiber Channel Topologies
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Fiber Channel Ports• N-Port: Node port• F-Port: Fabric port• L-Port: Loop port
– Only connect to AL• E-Port: Expansion port
– Connect two switches• G-Port: Generic port• B-Port: Bridge port
– Bridge to other networks (IP, ATM, etc)• NL-Port: Node_Loop_port
– Can connect both in fabric and in AL• FL-Port: Fabric_Loop_port
– Makes a fabric to connect to a loop
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Routing mechanisms in switch
• Store-forward routing
• Cut-through routing
William James Dally and Brian Towles, Principles and practices of Interconnection networks, chapter 13
Fibre Channel Hub and Switch
• Switch– Thousands of
connections– Bandwidth per device
is nearly constant– Aggregate bandwidth
increases with increased connectivity
– Deterministic latency
• Hub– 126 Devices– Bandwidth per device
diminished with increased connectivity
– Aggregate bandwidth is constant with increased connectivity
– Latency increases as the number of devices increases
Fibre Channel Structure
Fibre Channel Bandwidth
• Clock rate is 1.0625GHz• 1.0625[Gbps] x
2048[payload]/2168[payload+overhead] x 0.8[8b10b]/8[bits] = 100.369 MB/s
Cable types in FC
FC Roadmap
ProductNaming
Throughput(MB/s)
T11 Spec Completed
(Year)
Market Availability
(Year)
1GFC 200 1996 1997
2GFC 400 2000 2001
4GFC 800 2003 2005
8GFC 1,600 2006 2008
16GFC 3200 2009 2011
32GFC 6400 2012 Market Demand
64GFC 12800 2016 Market Demand
128GFC 25600 2020 Market Demandhttp://www.fibrechannel.org/OVERVIEW/Roadmap.html
Interface Comparison
Market Segments
It’s more than interface, Seagate, 2003
Interface Trends - Previous
It’s more than interface, Seagate, 2003
Interface Trends – Today and Tomorrow
It’s more than interface, Seagate, 2003
IP Storage
IP Storage (cont’d)• TCP/IP is used as a storage interconnect to
transfer block level data.• IETF working group, the IP Storage (IPS) • iSCSI, iFCP, and FCIP protocols
• Cheaper• Provides one technology for a client to connect
to servers and storage devices• Increases operating distances• Improves availability of storage systems• Can utilize network management tools
It’s more than interface, Seagate, 2003
iSCSI (Internet SCSI)• iSCSI is a Transport for SCSI Commands
– iSCSI is an End to End protocol– iSCSI can be implemented on Desktops, Laptops
and Servers– iSCSI can be implemented with current TCP/IP
Stacks– iSCSI can be implemented completely in a HBA
• Overcomes the distance limitation• Cost-effective
Protocol Stack - iSCSI
Packet and Bandwidth - iSCSI
• iSCSI overhead: 78 Bytes – 14 (Ethernet) + 20 (IP) + 20 (TCP) + 4 (CRC) + 20 (Interframe
Gap)– iSCSI header occurs 48 bytes per SCSI command
• 1.25[Gbps] x 1460[payload]/1538[payload+overhead] x 0.8[8b10b]/8[bits] = 113.16 MB/s
• Bi-Directional Payload Bandwidth: 220.31 MB/s
Problems with iSCSI• Limited Performance because
– Protocol overhead in TCP/IP– Interrupts are generated for each
network packet– Extra copies when sending and
receiving data
iSCSI Adapter Implementations
• Software approach– Show the best performance– This approach is very competitive due to fast modern
CPUs
• Hardware Approaches– Relatively slow CPU compared to host CPU– Development speed is also slower than that in host
CPU– Performance improvement is limited without
superior advances in embedded CPU– Can show performance improvement in highly-
loaded systems
Prasenjit Sarkar, Sandeep Utamchandani, Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, FAST 2003
iFCP (Internet Fiber Channel Protocol)
• iFCP is a gateway-to-gateway protocol for the implementation of a fibre channel fabric over a TCP/IP transport
• Allow users to interconnect FC devices over a TCP/IP network at any distance
• Traffic between fibre channel devices is routed and switched by TCP/IP network
• iFCP maps each FC address to an IP address and each FC session to an TCP session
• FC messaging and routing services are terminated at the gateways so that are not merged
• Data backup and replication• mFCP uses UDP/IP
How does iFCP work?
Types of iFCP communication
FCIP (Fiber Channel over IP)
• TCP/IP-based tunneling protocol to encapsulate fibre channel packets
• Allow users to interconnect FC devices over a TCP/IP network at any distance (same as iFCP)
• Merges connected SANs into a single FC fabric• Data backup and replication• Gateways
–used to interconnect fibre channel SANs to the IP network –set up connections between SANs or between fibre
channel devices and SANs
FCIP (Fiber Channel over IP)
Comparison between FCIP and iFCP
IP Storage Protocols: iSCSI, iFCP and FCIP
RAS• Reliability
– The basic InfiniBand link connection is comprised of only four signal wires
– IBA accommodates multiple ports for each I/O unit– IBA provides multiple CRCs
• Availability– An IBA fabric in inherently redundant, with multiple paths to
sources assuring data delivery– IBA allows the network to heal itself if a link fails or is
reporting errors– IBA has a many-to-many server-to-I/O relationship
• Serviceability– Hot-pluggable
Feature Infini Band Fibre Channel 1Gb & 10 Gb Ethernet
PCI-X
Bandwidth 2.5 , 10, 30 Gb/s 1, 2.1 Gb/s 1, 10 Gb/s 8.51 Gb/s
Bandwidth Full-Duplex
5, 20, 60 Gb/s 2.1 , 4.2 GB/s 2, 20 Gb/s N/A
Pin Count 4, 16, 48 4 4 / 8 90
Media Copper/Fiber Copper/Fiber Copper/Fiber PCB
Max Length Copper
250 / 125 m 13m 100m inches
Max Length Fiber 10 km km km N/A
Partitioning X X X N/A
Scalable Link Width
X N/A N/A N/A
Max Payload 4 KB 2KB 1.5 KB No Packets
A classification of storage systems
(warning - not comprehensive)
• Isolated– E.g., A laptop/PC with a local file system– We know how these work– File systems were first developed for centralized computer systems as an OS
facility providing a convenient programming interfact to (disk) storage– Subsequently acquired features like AC, file-locking that made them useful for
sharing of data and programs• Distributed
– Why?• Sharing, scalability, mobility, fault tolerance, …
– “Basic” Distributed file system• Give the illusion of local storage when the data is spread across a network (usually a LAN)
to clients running on multiple computers• Support the sharing of information of in the form of files and hardware resources in the
form of persistent storage throughout an intranet
– Enhancements in various domains for “real-time” performance (multimedia), high failure resistance, high scalability (P2P), security, longevity (archival systems), mobility/disconnections, …
– Remote objects to support distributed object-oriented programming
Storage systems and their properties
Main memory No No No Strict one-copy RAM
File system No Yes No Strict one-copy UNIX FS
Distributed file system
Yes Yes Yes Yes (approx.) NFS
Web Yes Yes Yes Very approx/No Web server
Distributed shared memory
Yes No Yes Yes (approx) Ivy
Remote objects (RMI/ORB)
Yes No No Strict one-copy CORBA
Persistent object store
Yes Yes No Strict one-copy CORBA persistent state service
P2P storage system
Yes Yes Yes Very approx OceanStore
Sharing PersistenceCaching/replication
Consistencymaintenance Example