strata: high-performance scalable storage on virtualized ... · switch to transparently scale and...

16
This paper is included in the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14). February 17–20, 2014 • Santa Clara, CA USA ISBN 978-1-931971-08-9 Open access to the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14) is sponsored by Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield, Coho Data https://www.usenix.org/conference/fast14/technical-sessions/presentation/cully

Upload: others

Post on 19-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

This paper is included in the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14).

February 17–20, 2014 • Santa Clara, CA USA

ISBN 978-1-931971-08-9

Open access to the Proceedings of the 12th USENIX Conference on File and Storage

Technologies (FAST ’14) is sponsored by

Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory

Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield, Coho Data

https://www.usenix.org/conference/fast14/technical-sessions/presentation/cully

Page 2: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 17

Strata: Scalable High-Performance Storage on Virtualized Non-volatileMemory

Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan,Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield

Coho Data{firstname.lastname}@cohodata.com

AbstractStrata is a commercial storage system designed aroundthe high performance density of PCIe flash storage. Weobserve a parallel between the challenges introduced bythis emerging flash hardware and the problems that werefaced with underutilized server hardware about a decadeago. Borrowing ideas from hardware virtualization, wepresent a novel storage system design that partitionsfunctionality into an address virtualization layer for highperformance network-attached flash, and a hosted envi-ronment for implementing scalable protocol implemen-tations. Our system targets the storage of virtual machineimages for enterprise environments, and we demonstratedynamic scale to over a million IO operations per secondusing NFSv3 in 13u of rack space, including switching.

1 Introduction

Flash-based storage devices are fast, expensive and de-manding: a single device is capable of saturating a10Gb/s network link (even for random IO), consumingsignificant CPU resources in the process. That same de-vice may cost as much as (or more than) the server inwhich it is installed1. The cost and performance char-acteristics of fast, non-volatile media have changed thecalculus of storage system design and present new chal-lenges for building efficient and high-performance data-center storage.

This paper describes the architecture of a commercialflash-based network-attached storage system, built usingcommodity hardware. In designing the system aroundPCIe flash, we begin with two observations about the ef-fects of high-performance drives on large-scale storagesystems. First, these devices are fast enough that in mostenvironments, many concurrent workloads are needed to

1Enterprise-class PCIe flash drives in the 1TB capacity range cur-rently carry list prices in the range of $3-5K USD. Large-capacity,high-performance cards are available for list prices of up to $160K.

fully saturate them, and even a small degree of process-ing overhead will prevent full utilization. Thus, we mustchange our approach to the media from aggregation tovirtualization. Second, aggregation is still necessary toachieve properties such as redundancy and scale. How-ever, it must avoid the performance bottleneck that wouldresult from the monolithic controller approach of a tradi-tional storage array, which is designed around the obso-lete assumption that media is the slowest component inthe system. Further, to be practical in existing datacenterenvironments, we must remain compatible with existingclient-side storage interfaces and support standard enter-prise features like snapshots and deduplication.

In this paper we explore the implications of these two ob-servations on the design of a scalable, high-performanceNFSv3 implementation for the storage of virtual machineimages. Our system is based on the building blocks ofPCIe flash in commodity x86 servers connected by 10gigabit switched Ethernet. We describe two broad tech-nical contributions that form the basis of our design:

1. A delegated mapping and request dispatch inter-face from client data to physical resources throughglobal data address virtualization, which allowsclients to directly address data while still providingthe coordination required for online data movement(e.g., in response to failures or for load balancing).

2. SDN-assisted storage protocol virtualization thatallows clients to address a single virtual proto-col gateway (e.g., NFS server) that is transparentlyscaled out across multiple real servers. We havebuilt a scalable NFS server using this technique, butit applies to other protocols (such as iSCSI, SMB,and FCoE) as well.

At its core, Strata uses device-level object storage anddynamic, global address-space virtualization to achievea clean and efficient separation between control and datapaths in the storage system. Flash devices are split into

1

Page 3: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

18 12th USENIX Conference on File and Storage Technologies USENIX Association

Device Virtualization Layer (§4)

Network Attached Disks (NADs)

Responsibility: Virtualize a PCIe flash device into multiple address

spaces and allow direct client access with controlled sharing.

Protocol Virtualization Layer (§6)

Scalable Protocol Presentation

Responsibility: Allow the transparently scalable implementation of

traditional IP- and Ethernet-based storage protocols.

Scalable NFSv3

Presents a single external NFS IP address, integrates with SDN

switch to transparently scale and manage connections across

controller instances hosted on each microArray.

CLOS (Coho Log-structured Object Store)

Implements a flat object store, virtualizing the PCIe flash

device’s address space and presents an OSD-like interface to

clients.

libDataPath

NFSv3 instance on each microarray links as a dispatch library.

Data path descriptions are read from a cluster-wide registry

and instantiated as dispatch state machines. NFS forwards

requests through these SMs, interacting directly with NADs.

Central services update data paths in the face of failure, etc.

Global Address Space Virtualization Layer (§3,5)

Delegated Data Paths

Responsibility: Compose device level objects into richer storage

primitives. Allow clients to dispatch requests directly to NADs

while preserving centralized control over placement,

reconfiguration, and failure recovery.

Layer name, core abstraction, and responsibility: Implementation in Strata:

Figure 1: Strata network storage architecture.

virtual address spaces using an object storage-style inter-face, and clients are then allowed to directly communi-cate with these address spaces in a safe, low-overheadmanner. In order to compose richer storage abstrac-tions, a global address space virtualization layer allowsclients to aggregate multiple per-device address spaceswith mappings that achieve properties such as stripingand replication. These delegated address space map-pings are coordinated in a way that preserves direct clientcommunications with storage devices, while still allow-ing dynamic and centralized control over data placement,migration, scale, and failure response.

Serving this storage over traditional protocols like NFSimposes a second scalability problem: clients of theseprotocols typically expect a single server IP address,which must be dynamically balanced over multipleservers to avoid being a performance bottleneck. In or-der to both scale request processing and to take advan-tage of full switch bandwidth between clients and stor-age resources, we developed a scalable protocol presen-tation layer that acts as a client to the lower layers of ourarchitecture, and that interacts with a software-definednetwork switch to scale the implementation of the proto-col component of a storage controller across arbitrarilymany physical servers. By building protocol gatewaysas clients of the address virtualization layer, we preservethe ability to delegate scale-out access to device storagewithout requiring interface changes on the end hosts thatconsume the storage.

2 Architecture

The performance characteristics of emerging storagehardware demand that we completely reconsider storagearchitecture in order to build scalable, low-latency shared

persistent memory. The reality of deployed applicationsis that interfaces must stay exactly the same in order fora storage system to have relevance. Strata’s architectureaims to take a step toward the first of these goals, whilekeeping a pragmatic focus on the second.

Figure 1 characterizes the three layers of Strata’s archi-tecture. The goals and abstractions of each layer of thesystem are on the left-hand column, and the concrete em-bodiment of these goals in our implementation is on theright. At the base, we make devices accessible over anobject storage interface, which is responsible for virtual-izing the device’s address space and allowing clients tointeract with individual virtual devices. This approachreflects our view that system design for these storage de-vices today is similar to that of CPU virtualization tenyears ago: devices provide greater performance than isrequired by most individual workloads and so require alightweight interface for controlled sharing in order toallow multi-tenancy. We implement a per-device objectstore that allows a device to be virtualized into an ad-dress space of 2128 sparse objects, each of which may beup to 2

64 bytes in size. Our implementation is similarin intention to the OSD specification, itself motivated bynetwork attached secure disks [17]. While not broadlydeployed to date, device-level object storage is receiv-ing renewed attention today through pNFS’s use of OSDas a backend, the NVMe namespace abstraction, and inemerging hardware such as Seagate’s Kinetic drives [37].Our object storage interface as a whole is not a significanttechnical contribution, but it does have some notable in-terface customizations described in Section 4. We referto this layer as a Network Attached Disk, or NAD.

The middle layer of our architecture provides a globaladdress space that supports the efficient composition of

2

Page 4: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 19

IO processors that translate client requests on a virtualobject into operations on a set of NAD-level physical ob-jects. We refer to the graph of IO processors for a partic-ular virtual object as its data path, and we maintain thedescription of the data path for every object in a globalvirtual address map. Clients use a dispatch library toinstantiate the processing graph described by each datapath and perform direct IO on the physical objects atthe leaves of the graph. The virtual address map is ac-cessed through a coherence protocol that allows centralservices to update the data paths for virtual objects whilethey are in active use by clients. More concretely, datapaths allow physical objects to be composed into richerstorage primitives, providing properties such as stripingand replication. The goal of this layer is to strike a bal-ance between scalability and efficiency: it supports directclient access to device-level objects, without sacrificingcentral management of data placement, failure recovery,and more advanced storage features such as deduplica-tion and snapshots.

Finally, the top layer performs protocol virtualization toallow clients to access storage over standard protocols(such as NFS) without losing the scalability of direct re-quests from clients to NADs. The presentation layer istightly integrated with a 10Gb software-defined Ethernetswitching fabric, allowing external clients the illusion ofconnecting to a single TCP endpoint, while transparentlyand dynamically balancing traffic to that single IP ad-dress across protocol instances on all of the NADs. Eachprotocol instance is a thin client of the layer below, whichmay communicate with other protocol instances to per-form any additional synchronization required by the pro-tocol (e.g., to maintain NFS namespace consistency).

The mapping of these layers onto the hardware that oursystem uses is shown in Figure 2. Requests travel fromclients into Strata through an OpenFlow-enabled switch,which dispatches them according to load to the appropri-ate protocol handler running on a MicroArray (µArray)— a small host configured with flash devices and enoughnetwork and CPU to saturate them, containing the soft-ware stack representing a single NAD. For performance,each of the layers is implemented as a library, allowing asingle process to handle the flow of requests from clientto media. The NFSv3 implementation acts as a client ofthe underlying dispatch layer, which transforms requestson virtual objects into one or more requests on physicalobjects, issued through function calls to local physicalobjects and by RPC to remote objects. While the focusof the rest of this paper is on this concrete implementa-tion of scale-out NFS, it is worth noting that the designis intended to allow applications the opportunity to linkdirectly against the same data path library that the NFSimplementation uses, resulting in a multi-tenant, multi-

VMware

ESX Host

VMware

ESX Host

VMware

ESX Host

Virtual NFS server 10.150.1.1

Protocol Virtualizaiton

(Scalable NFSv3)

Arrows show NFS

connections and

associated requests.

Middle host connection

omited for clarity.

Global Address Space

Virtualization

(libDataDispatch)

Device Virtualization

(CLOS)

microArray

NFS Instance

libDataPath

CLOS

microArray

NFS Instance

libDataPath

CLOS

microArray

NFS Instance

libDataPath

CLOS

10Gb SDN Switch

Figure 2: Hardware view of a Strata deployment

presentation storage system with a minimum of networkand device-level overhead.

2.1 Scope of this Work

There are three aspects of our design that are not consid-ered in detail within this presentation. First, we only dis-cuss NFS as a concrete implementation of protocol vir-tualization. Strata has been designed to host and supportmultiple protocols and tenants, but our initial product re-lease is specifically NFSv3 for VMware clients, so wefocus on this type of deployment in describing the im-plementation. Second, Strata was initially designed to bea software layer that is co-located on the same physicalservers that host virtual machines. We have moved to aseparate physical hosting model where we directly buildon dedicated hardware, but there is nothing that preventsthe system from being deployed in a more co-located (or“converged”) manner. Finally, our full implementationincorporates a tier of spinning disks on each of the stor-age nodes to allow cold data to be stored more econom-ically behind the flash layer. However, in this paper weconfigure and describe a single-tier, all-flash system tosimplify the exposition.

In the next sections we discuss three relevant aspects ofStrata—address space virtualization, dynamic reconfig-uration, and scalable protocol support—in more detail.We then describe some specifics of how these three com-ponents interact in our NFSv3 implementation for VMimage storage before providing a performance evaluationof the system as a whole.

3

Page 5: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

20 12th USENIX Conference on File and Storage Technologies USENIX Association

3 Data Paths

Strata provides a common library interface to data thatunderlies the higher-level, client-specific protocols de-scribed in Section 6. This library presents a notion ofvirtual objects, which are available cluster-wide and maycomprise multiple physical objects bundled together forparallel data access, fault tolerance, or other reasons(e.g., data deduplication). The library provides a su-perset of the object storage interface provided by theNADs (Section 4), with additional interfaces to man-age the placement of objects (and ranges within objects)across NADs, to maintain data invariants (e.g., replica-tion levels and consistent updates) when object rangesare replicated or striped, and to coordinate both concur-rent access to data and concurrent manipulation of thevirtual address maps describing their layout.

To avoid IO bottlenecks, users of the data path inter-face (which may be native clients or protocol gatewayssuch as our NFS server) access data directly. To do so,they map requests from virtual objects to physical ob-jects using the virtual address map. This is not simplya pointer from a virtual object (id, range) pair to a setof physical object (id, range) pairs. Rather, each vir-tual range is associated with a particular processor forthat range, along with processor-specific context. Stratauses a dispatch-oriented programming model in which apipeline of operations is performed on requests as theyare passed from an originating client, through a set oftransformations, and eventually to the appropriate stor-age device(s). Our model borrows ideas from packet pro-cessing systems such as X-Kernel [19], Scout [25], andClick [21], but adapts them to a storage context, in whichmodules along the pipeline perform translations througha set of layered address spaces, and may fork and/or col-lect requests and responses as they are passed.

The dispatch library provides a collection of request pro-cessors, which can stand alone or be combined with otherprocessors. Each processor takes a storage request (e.g.,a read or write request) as input and produces one ormore requests to its children. NADs expose isolatedsparse objects; processors perform translations that allowmultiple objects to be combined for some functional pur-pose, and present them as a single object, which may inturn be used by other processors. The idea of request-based address translation to build storage features hasbeen used in other systems [24, 35, 36], often as the ba-sis for volume management; Strata disentangles it fromthe underlying storage system and treats it as a first-classdispatch abstraction.

The composition of dispatch modules bears similarity toClick [21], but the application in a storage domain car-ries a number of differences. First, requests are gener-

ally acknowledged at the point that they reach a storagedevice, and so as a result they differ from packet for-warding logic in that they travel both down and thenback up through a dispatch stack; processors containlogic to handle both requests and responses. Second,it is common for requests to be split or merged as theytraverse a processor — for example, a replication pro-cessor may duplicate a request and issue it to multiplenodes, and then collect all responses before passing asingle response back up to its parent. Finally, while pro-cessors describe fast, library-based request dispatchinglogic, they typically depend on additional facilities fromthe system. Strata allows processor implementations ac-cess to APIs for shared, cluster-wide state which maybe used on a control path to, for instance, store replicaconfiguration. It additionally provides facilities for back-ground functionality such as NAD failure detection andresponse. The intention of the processor organization isto allow dispatch decisions to be pushed out to client im-plementations and be made with minimal performanceimpact, while still benefiting from common system-wideinfrastructure for maintaining the system and respondingto failures. The responsibilities of the dispatch library aredescribed in more detail in the following subsections.

3.1 The Virtual Address Map

/objects/112:type=regular dispatch={object=111

type=dispatch}

/objects/111:type=dispatchstripe={stripecount=8 chunksize=524288

0={object=103 type=dispatch}1={object=104 type=dispatch}}

/objects/103:type=dispatchrpl={policy=mirror storecount=2

{storeid=a98f2... state=in-sync}{storeid=fc89f... state=in-sync}}

Figure 3: Virtual object to physical object range mapping

Figure 3 shows the relevant information stored in the vir-tual address map for a typical object. Each object hasan identifier, a type, some type-specific context, and maycontain other metadata such as cached size or modifica-tion time information (which is not canonical, for reasonsdiscussed below).

The entry point into the virtual address map is a regularobject. This contains no location information on its own,but delegates to a top-level dispatch object. In Figure 3,object 112 is a regular object that delegates to a dispatchprocessor whose context is identified by object 111 (theIDs are in reverse order here because the dispatch graph

4

Page 6: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 21

is created from the bottom up, but traversed from the topdown). Thus when a client opens file 112, it instantiatesa dispatcher using the data in object 111 as context. Thiscontext informs the dispatcher that it will be delegatingIO through a striped processor, using 2 stripes for the ob-ject and a stripe width of 512K. The dispatcher in turn in-stantiates 8 processors (one for each stripe), each config-ured with the information stored in the object associatedwith each stripe (e.g., stripe 0 uses object 103). Finally,when the stripe dispatcher performs IO on stripe 0, it willuse the context in the object descriptor for object 103 toinstantiate a replicated processor, which mirrors writesto the NADs listed in its replica set, and issues reads tothe nearest in sync replica (where distance is currentlysimply local or remote).

In addition to the striping and mirroring processors de-scribed here, the map can support other more advancedprocessors, such as erasure coding, or byte-range map-pings to arbitrary objects (which supports among otherthings data deduplication).

3.2 Dispatch

IO requests are handled by a chain of dispatchers, eachof which has some common functionality. Dispatchersmay have to fragment requests into pieces if they spanthe ranges covered by different subprocessors, or clonerequests into multiple subrequests (e.g., for replication),and they must collect the results of subrequests and dealwith partial failures.

The replication and striping modules included in thestandard library are representative of the ways processorstransform requests as they traverse a dispatch stack. Thereplication processor allows a request to be split and is-sued concurrently to a set of replica objects. The requestaddress remains unchanged within each object, and re-sponses are not returned until all replicas have acknowl-edged a request as complete. The processor prioritizesreading from local replicas, but forwards requests to re-mote replicas in the event of a failure (either an errorresponse or a timeout). It imposes a global ordering onwrite requests and streams them to all replicas in parallel.It also periodically commits a light-weight checkpoint toeach replica’s log to maintain a persistent record of syn-chronization points; these checkpoints are used for crashrecovery (Section 5.1.3).

The striping processor distributes data across a collectionof sparse objects. It is parameterized to take a stripe size(in bytes) and a list of objects to act as the ordered stripeset. In the event that a request crosses a stripe boundary,the processor splits that request into a set of per-stripe re-quests and issues those asynchronously, collecting the re-sponses before returning. Static, address-based striping

is a relatively simple load balancing and data distribu-tion mechanism as compared to placement schemes suchas consistent hashing [20]. Our experience has been thatthe approach is effective, because data placement tendsto be reasonably uniform within an object address space,and because using a reasonably large stripe size (we de-fault to 512KB) preserves locality well enough to keeprequest fragmentation overhead low in normal operation.

3.3 Coherence

Strata clients also participate in a simple coordinationprotocol in order to allow the virtual address map for avirtual object to be updated even while that object is inuse. Online reconfiguration provides a means for recov-ering from failures, responding to capacity changes, andeven moving objects in response to observed or predictedload (on a device basis — this is distinct from client loadbalancing, which we also support through a switch-basedprotocol described in Section 6.2).

The virtual address maps are stored in a distributed,synchronized configuration database implemented overApache Zookeeper, which is also available for any low-bandwidth synchronization required by services else-where in the software stack. The coherence protocol isbuilt on top of the configuration database. It is currentlyoptimized for a single writer per object, and works as fol-lows: when a client wishes to write to a virtual object, itfirst claims a lock for it in the configuration database. Ifthe object is already locked, the client requests that theholder release it so that the client can claim it. If theholder does not voluntarily release it within a reasonabletime, the holder is considered unresponsive and fencedfrom the system using the mechanism described in Sec-tion 6.2. This is enough to allow movement of objects,by first creating new, out of sync physical objects at thedesired location, then requesting a release of the object’slock holder if there is one. The user of the object willreacquire the lock on the next write, and in the processdiscover the new out of sync replica and initiate resyn-chronization. When the new replica is in sync, the sameprocess may be repeated to delete replicas that are at un-desirable locations.

4 Network Attached Disks

The unit of storage in Strata is a Network Attached Disk(NAD), consisting of a balanced combination of CPU,network and storage components. In our current hard-ware, each NAD has two 10 gigabit Ethernet ports, twoPCIe flash cards capable of 10 gigabits of throughputeach, and a pair of Xeon processors that can keep upwith request load and host additional services alongsidethe data path. Each NAD provides two distinct services.

5

Page 7: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

22 12th USENIX Conference on File and Storage Technologies USENIX Association

First, it efficiently multiplexes the raw storage hardwareacross multiple concurrent users, using an object stor-age protocol. Second, it hosts applications that providehigher level services over the cluster. Object rebalanc-ing (Section 5.2.1) and the NFS protocol interface (Sec-tion 6.1) are examples of these services.

At the device level, we multiplex the underlying storageinto objects, named by 128-bit identifiers and consistingof sparse 2

64 byte data address spaces. These addressspaces are currently backed by a garbage-collected log-structured object store, but the implementation of the ob-ject store is opaque to the layers above and could be re-placed if newer storage technologies made different ac-cess patterns more efficient. We also provide increasedcapacity by allowing each object to flush low priority orinfrequently used data to disk, but this is again hiddenbehind the object interface. The details of disk tiering,garbage collection, and the layout of the file system arebeyond the scope of this paper.

The physical object interface is for the most part a tradi-tional object-based storage device [37, 38] with a CRUDinterface for sparse objects, as well as a few extensionsto assist with our clustering protocol (Section 5.1.2). Itis significantly simpler than existing block device inter-faces, such as the SCSI command set, but is also intendedto be more direct and general purpose than even narrowerinterfaces such as those of a key-value store. Providinga low-level hardware abstraction layer allows the imple-mentation to be customized to accommodate best prac-tices of individual flash implementations, and also al-lows more dramatic design changes at the media inter-face level as new technologies become available.

4.1 Network Integration

As with any distributed system, we must deal with mis-behaving nodes. We address this problem by tightly cou-pling with managed Ethernet switches, which we discussat more length in Section 6.2. This approach borrowsideas from systems such as Sane [8] and Ethane [7],in which a managed network is used to enforce isola-tion between independent endpoints. The system inte-grates with both OpenFlow-based switches and softwareswitching at the VMM to ensure that Strata objects areonly addressable by their authorized clients.

Our initial implementation used Ethernet VLANs, be-cause this form of hardware-supported isolation is incommon use in enterprise environments. In the currentimplementation, we have moved to OpenFlow, whichprovides a more flexible tunneling abstraction for trafficisolation.

We also expose an isolated private virtual network for

out-of-band control and management operations internalto the cluster. This allows NADs themselves to accessremote objects for peer-wise resynchronization and reor-ganization under the control of a cluster monitor.

5 Online Reconfiguration

There are two broad categories of events to which Stratamust respond in order to maintain its performance andreliability properties. The first category includes faultsthat occur directly on the data path. The dispatch libraryrecovers from such faults immediately and automaticallyby reconfiguring the affected virtual objects on behalf ofthe client. The second category includes events such asdevice failures and load imbalance. These are handled bya dedicated cluster monitor which performs large-scalereconfiguration tasks to maintain the health of the systemas a whole. In all cases, reconfiguration is performedonline and has minimal impact on client availability.

5.1 Object Reconfiguration

A number of error recovery mechanisms are built directlyinto the dispatch library. These mechanisms allow clientsto quickly recover from failures by reconfiguring individ-ual virtual objects on the data path.

5.1.1 IO Errors

The replication IO processor responds to read errors inthe obvious way: by immediately resubmitting failed re-quests to different replicas. In addition, clients maintainper-device error counts; if the aggregated error count fora device exceeds a configurable threshold, a backgroundtask takes the device offline and coordinates a system-wide reconfiguration (Section 5.2.2).

IO processors respond to write errors by synchronouslyreconfiguring virtual objects at the time of the failure.This involves three steps. First, the affected replica ismarked out of sync in the configuration database. Thisserves as a global, persistent indication that the replicamay not be used to serve reads because it contains poten-tially stale data. Second, a best-effort attempt is made toinform the NAD of the error so that it can initiate a back-ground task to resynchronize the affected replica. Thisallows the system to recover from transient failures al-most immediately. Finally, the IO processor allocates aspecial patch object on a separate device and adds this tothe replica set. Once a replica has been marked out ofsync, no further writes are issued to it until it has beenresynchronized; patches prevent device failures from im-peding progress by providing a temporary buffer to ab-sorb writes under these degraded conditions. With thepatch object allocated, the IO processor can continue to

6

Page 8: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 23

meet the replication requirements for new writes whileout of sync replicas are repaired in the background. Areplica set remains available as long as an in sync replicaor an out of sync replica and all of its patches are avail-able.

5.1.2 Resynchronization

In addition to providing clients direct access to devicesvia virtual address maps, Strata provides a number ofbackground services to maintain the health of individ-ual virtual objects and the system as a whole. The mostfundamental of these is the resync service, which pro-vides a background task that can resynchronize objectsreplicated across multiple devices.

Resync is built on top of a special NAD resync APIthat exposes the underlying log structure of the objectstores. NADs maintain a Log Serial Number (LSN) withevery physical object in their stores; when a record isappended to an object’s log, its LSN is monotonically in-cremented. The IO processor uses these LSNs to imposea global ordering on the changes made to physical ob-jects that are replicated across stores and to verify thatall replicas have received all updates.

If a write failure causes a replica to go out of sync,the client can request the system to resynchronize thereplica. It does this by invoking the resync RPC onthe NAD which hosts the out of sync replica. The serverthen starts a background task which streams the miss-ing log records from an in sync replica and applies themto the local out of sync copy, using the LSN to identifywhich records the local copy is missing.

During resync, the background task has exclusive writeaccess to the out of sync replica because all clients havebeen reconfigured to use patches. Thus the resync taskcan chase the tail of the in sync object’s log while clientscontinue to write. When the bulk of the data has beencopied, the resync task enters a final stop-and-copy phasein which it acquires exclusive write access to all repli-cas in the replica set, finalizes the resync, applies anyclient writes received in the interim, marks the replica asin sync in the configuration database, and removes thepatch.

It is important to ensure that resync makes timelyprogress to limit vulnerability to data loss. Very heavyclient write loads may interfere with resync tasks and, inthe worst case, result in unbounded transfer times. Forthis reason, when an object is under resync, client writesare throttled and resync requests are prioritized.

5.1.3 Crash Recovery

Special care must be taken in the event of an uncleanshutdown. On a clean shutdown, all objects are releasedby removing their locks from the configuration database.Crashes are detected when replica sets are discoveredwith stale locks (i.e., locks identifying unresponsive IOprocessors). When this happens, it is not safe to assumethat replicas marked in sync in the configuration databaseare truly in sync, because a crash might have occuredmidway through a the configuration database update; in-stead, all the replicas in the set must be queried directlyto determine their states.

In the common case, the IO processor retrieves the LSNfor every replica in the set and determines which replicas,if any, are out of sync. If all replicas have the same LSN,then no resynchronization is required. If different LSNsare discovered, then the replica with the highest LSN isdesignated as the authoritative copy, and all other repli-cas are marked out of sync and resync tasks are initiated.

If a replica cannot be queried during the recovery pro-cedure, it is marked as diverged in the configurationdatabase and the replica with the highest LSN from theremaining available replicas is chosen as the authorita-tive copy. In this case, writes may have been committedto the diverged replica that were not committed to anyothers. If the diverged replica becomes available againsome time in the future, these extra writes must be dis-carded. This is achieved by rolling the replica back to itslast checkpoint and starting a resync from that point in itslog. Consistency in the face of such rollbacks is guaran-teed by ensuring that objects are successfully marked outof sync in the configuration database before writes areacknowledged to clients. Thus write failures are guar-anteed to either mark replicas out of sync in the config-uration database (and create corresponding patches) orpropagate back to the client.

5.2 System Reconfiguration

Strata also provides a highly-available monitoring ser-vice that watches over the health of the system and co-ordinates system-wide recovery procedures as necessary.Monitors collect information from clients, SMART di-agnostic tools, and NAD RPCs to gauge the status of thesystem. Monitors build on the per-object reconfigura-tion mechanisms described above to respond to eventsthat individual clients don’t address, such as load imbal-ance across the system, stores nearing capacity, and de-vice failures.

7

Page 9: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

24 12th USENIX Conference on File and Storage Technologies USENIX Association

5.2.1 Rebalance

Strata provides a rebalance facility which is capable ofperforming system-wide reconfiguration to repair brokenreplicas, prevent NADs from filling to capacity, and im-prove load distribution across NADs. This facility is inturn used to recover from device failures and expand ontonew hardware.

Rebalance proceeds in two stages. In the first stage, themonitor retrieves the current system configuration, in-cluding the status of all NADs and virtual address map ofevery virtual object. It then constructs a new layout forthe replicas according to a customizable placement pol-icy. This process is scriptable and can be easily tailoredto suit specific performance and durability requirementsfor individual deployments (see Section 7.3 for someanalysis of the effects of different placement policies).The default policy uses a greedy algorithm that consid-ers a number of criteria designed to ensure that replicatedphysical objects do not share fault domains, capacity im-balances are avoided as much as possible, and migrationoverheads are kept reasonably low. The new layout isformulated as a rebalance plan describing what changesneed to be applied to individual replica sets to achievethe desired configuration.

In the second stage, the monitor coordinates the execu-tion of the rebalance plan by initiating resync tasks onindividual NADs to effect the necessary data migration.When replicas need to be moved, the migration is per-formed in three steps:

1. A new replica is added to the destination NAD

2. A resync task is performed to transfer the data

3. The old replica is removed from the source NAD

This requires two reconfiguration events for the replicaset, the first to extend it to include the new replica, andthe second to prune the original after the resync has com-pleted. The monitor coordinates this procedure across allNADs and clients for all modified virtual objects.

5.2.2 Device Failure

Strata determines that a NAD has failed either when itreceives a hardware failure notification from a respon-sive NAD (such as a failed flash device or excessive errorcount) or when it observes that a NAD has stopped re-sponding to requests for more than a configurable time-out. In either case, the monitor responds by taking theNAD offline and initiating a system-wide reconfigurationto repair redundancy.

The first thing the monitor does when taking a NAD of-fline is to disconnect it from the data path VLAN. This is

a strong benefit of integrating directly against an Ether-net switch in our environment: prior to taking correctiveaction, the NAD is synchronously disconnected from thenetwork for all request traffic, avoiding the distributedsystems complexities that stem from things such as over-loaded components appearing to fail and then returninglong after a timeout in an inconsistent state. Rather thanattempting to use completely end-host mechanisms suchas watchdogs to trigger reboots, or agreement protocolsto inform all clients of a NAD’s failure, Strata disablesthe VLAN and requires that the failed NAD reconnect onthe (separate) control VLAN in the event that it returnsto life in the future.

From this point, the recovery logic is straight for-ward. The NAD is marked as failed in the configura-tion database and a rebalance job is initiated to repairany replica sets containing replicas on the failed NAD.

5.2.3 Elastic Scale Out

Strata responds to the introduction of new hardwaremuch in the same way that it responds to failures. Whenthe monitor observes that new hardware has been in-stalled, it uses the rebalance facility to generate a layoutthat incorporates the new devices. Because replication isgenerally configured underneath striping, we can migratevirtual objects at the granularity of individual stripes, al-lowing a single striped file to exploit the aggregated per-formance of many devices. Objects, whether whole filesor individual stripes, can be moved to another NAD evenwhile the file is online, using the existing resync mech-anism. New NADs are populated in a controlled man-ner to limit the impact of background IO on active clientworkloads.

6 Storage Protocols

Strata supports legacy protocols by providing an execu-tion runtime for hosting protocol servers. Protocols arebuilt as thin presentation layers on top of the dispatchinterfaces; multiple protocol instances can operate sideby side. Implementations can also leverage SDN-basedprotocol scaling to transparently spread multiple clientsacross the distributed runtime environment.

6.1 Scalable NFS

Strata is designed so that application developers can fo-cus primarily on implementing protocol specificationswithout worrying much about how to organize data ondisk. We expect that many storage protocols can be im-plemented as thin wrappers around the provided dispatchlibrary. Our NFS implementation, for example, mapsvery cleanly onto the high-level dispatch APIs, providing

8

Page 10: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 25

only protocol-specific extensions like RPC marshallingand NFS-style access control. It takes advantage of theconfiguration database to store mappings between theNFS namespace and the backend objects, and it reliesexclusively on the striping and replication processors toimplement the data path. Moreover, Strata allows NFSservers to be instantiated across multiple backend nodes,automatically distributing the additional processing over-head across backend compute resources.

6.2 SDN Protocol Scaling

Scaling legacy storage protocols can be challenging, es-pecially when the protocols were not originally designedfor a distributed back end. Protocol scalability limita-tions may not pose significant problems for traditionalarrays, which already sit behind relatively narrow net-work interfaces, but they can become a performance bot-tleneck in Strata’s distributed architecture.

A core property that limits scale of access bandwidth ofconventional IP storage protocols is the presentation ofstorage servers behind a single IP address. Fortunately,emerging “software defined” network (SDN) switchesprovide interfaces that allow applications to take moreprecise control over packet forwarding through Ethernetswitches than has traditionally been possible.

Using the OpenFlow protocol, a software controller isable to interact with the switch by pushing flow-specificrules onto the switch’s forwarding path. OpenFlow rulesare effectively wild-carded packet filters and associatedactions that tell a switch what to do when a matchingpacket is identified. SDN switches (our implementationcurrently uses an Arista Networks 7050T-52) interpretthese flow rules and push them down onto the switch’sTCAM or L2/L3 forwarding tables.

By manipulating traffic through the switch at the gran-ularity of individual flows, Strata protocol implementa-tions are able to present a single logical IP address tomultiple clients. Rules are installed on the switch to trig-ger a fault event whenever a new NFS session is opened,and the resulting exception path determines which pro-tocol instance to forward that session to initially. A ser-vice monitors network activity and migrates client con-nections as necessary to maintain an even workload dis-tribution.

The protocol scaling API wraps and extends the conven-tional socket API, allowing a protocol implementationto bind to and listen on a shared IP address across allof its instances. The client load balancer then monitorsthe traffic demands across all of these connections andinitiates flow migration in response to overload on anyindividual physical connection.

In its simplest form, client migration is handled entirelyat the transport layer. When the protocol load balancerobserves that a specific NAD is overloaded, it updatesthe routing tables to redirect the busiest client workloadto a different NAD. Once the client’s traffic is diverted, itreceives a TCP RST from the new NAD and establishesa new connection, thereby transparently migrating trafficto the new NAD.

Strata also provides hooks for situations where appli-cation layer coordination is required to make migra-tion safe. For example, our NFS implementation reg-isters a pre-migration routine with the load balancer,which allows the source NFS server to flush any pending,non-idempotent requests (such as create or remove)before the connection is redirected to the destinationserver.

7 Evaluation

In this section we evaluate our system both in terms ofeffective use of flash resources, and as a scalable, reli-able provider of storage for NFS clients. First, we estab-lish baseline performance over a traditional NFS serveron the same hardware. Then we evaluate how perfor-mance scales as nodes are added and removed from thesystem, using VM-based workloads over the legacy NFSinterface, which is oblivious to cluster changes. In addi-tion, we compare the effects of load balancing and objectplacement policy on performance. We then test reliabil-ity in the face of node failure, which is a crucial feature ofany distributed storage system. We also examine the rela-tion between CPU power and performance in our systemas a demonstration of the need to balance node powerbetween flash, network and CPU.

7.1 Test environment

Evaluation was performed on a cluster of the maximumsize allowed by our 48-port switch: 12 NADs, each ofwhich has two 10 gigabit Ethernet ports, two 800 GB In-tel 910 PCIe flash cards, 6 3 TB SATA drives, 64 GB ofRAM, and 2 Xen E5-2620 processors at 2 GHz with 6cores/12 threads each, and 12 clients, in the form of DellPowerEdge R420 servers running ESXi 5.0, with two 10gigabit ports each, 64 GB of RAM, and 2 Xeon E5-2470processors at 2.3 GHz with 8 cores/16 threads each. Weconfigured the deployment to maintain two replicas ofevery stored object, without striping (since it unneces-sarily complicates placement comparisons and has littlebenefit for symmetric workloads). Garbage collection isactive, and the deployment is in its standard configura-tion with a disk tier enabled, but the workloads have beenconfigured to fit entirely within flash, as the effects of

9

Page 11: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

26 12th USENIX Conference on File and Storage Technologies USENIX Association

Server Read IOPS Write IOPSStrata 40287 9960KNFS 23377 5796

Table 1: Random IO performance on Strata versusKNFS.

cache misses to magnetic media are not relevant to thispaper.

7.2 Baseline performance

To provide some performance context for our architec-ture versus a typical NFS implementation, we comparetwo minimal deployments of NFS over flash. We setStrata to serve a single flash card, with no replicationor striping, and mounted it loopback. We ran a fio [34]workload with a 4K IO size 80/20 read-write mix at aqueue depth of 128 against a fully allocated file. We thenformatted the flash card with ext4, exported it with thelinux kernel NFS server, and ran the same test. The re-sults are in Table 1. As the table shows, we offer goodNFS performance at the level of individual devices. Inthe following section we proceed to evaluate scalability.

Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140

IOPS

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

1100000

Figure 4: IOPS over time, read-only workload.

7.3 Scalability

In this section we evaluate how well performance scalesas we add NADs to the cluster. We begin each test by de-ploying 96 VMs (8 per client) into a cluster of 2 NADs.We choose this number of VMs because ESXi limits thequeue depth for a VM to 32 outstanding requests, but wedo not see maximum performance until a queue depth of128 per flash card. The VMs are each configured to runthe same fio workload for a given test. In Figure 4, fiogenerates 4K random reads to focus on IOPS scalabil-ity. In Figure 5, fio generates an 80/20 mix of reads andwrites at 128K block size in a Pareto distribution such

that 80% of requests go to 20% of the data. This is meantto be more representative of real VM workloads, but withenough offered load to completely saturate the cluster.

Seconds0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 3960 4320 4680 5040 5400 5760 6120 6480 6840

IOPS

0

10000

20000

30000

40000

50000

60000

70000

Figure 5: IOPS over time, 80/20 R/W workload.

As the tests run, we periodically add NADs, two at atime, up to a maximum of twelve2. When each pair ofNADs comes online, a rebalancing process automaticallybegins to move data across the cluster so that the amountof data on each NAD is balanced. When it completes,we run in a steady state for two minutes and then addthe next pair. In both figures, the periods where rebal-ancing is in progress are reflected by a temporary dropin performance (as the rebalance process competes withclient workloads for resources), followed by a rapid in-crease in overall performance when the new nodes aremarked available, triggering the switch to load-balanceclients to them. A cluster of 12 NADs achieves over1 million IOPS in the IOPS test, and 10 NADs achieve70,000 IOPS (representing more than 9 gigabytes/secondof throughput) in the 80/20 test.

We also test the effect of placement and load balancingon overall performance. If the location of a workloadsource is unpredictable (as in a VM data center with vir-tual machine migration enabled), we need to be able tomigrate clients quickly in response to load. However,if the configuration is more static or can be predictedin advance, we may benefit from attempting to placeclients and data together to reduce the network over-head incurred by remote IO requests. As discussed inSection 5.2.1, the load-balancing and data migration fea-tures of Strata make both approaches possible. Figure 4is the result of an aggressive local placement policy, inwhich data is placed on the same NAD as its clients, andboth are moved as the number of devices changes. Thisachieves the best possible performance at the cost of con-siderable data movement. In contrast, Figure 6 shows the

2ten for the read/write test due to an unfortunate test harness prob-lem

10

Page 12: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 27

Seconds0 420 840 1260 1680 2100 2520 2940 3360 3780 4200 4620 5040 5460 5880 6300 6720 7140 7560

IOPS

0

100000

200000

300000

400000

Figure 6: IOPS over time, read-only workload with ran-dom placement

performance of an otherwise identical test configurationwhen data is placed randomly (while still satisfying faulttolerance and even distribution constraints), rather thanbeing moved according to client requests. The paretoworkload (Figure 5) is also configured with the defaultrandom placement policy, which is the main reason thatit does not scale linearly: as the number of nodes in-creases, so does the probability that a request will needto be forwarded to a remote NAD.

7.4 Node Failure

As a counterpoint to the scalability tests run in the pre-vious section, we also tested the behaviour of the clusterwhen a node is lost. We configured a 10 NAD clusterwith 10 clients hosting 4 VMs each, running the 80/20Pareto workload described earlier. Figure 7 shows thebehaviour of the system during this experiment. Afterthe VMs had been running for a short time, we poweredoff one of the NADs by IPMI, waited 60 seconds, thenpowered it back on. During the node outage, the systemcontinued to run uninterrupted but with lower through-put. When the node came back up, it spent some timeresynchronizing its objects to restore full replication tothe system, and then rejoined the cluster. The client loadbalancer shifted clients onto it and throughput was re-stored (within the variance resulting from the client loadbalancer’s placement decisions).

7.5 Protocol overhead

The benchmarks up to this point have all been run in-side VMs whose storage is provided by a virtual diskthat Strata exports by NFS to ESXi. This configurationrequires no changes on the part of the clients to scaleacross a cluster, but does impose overheads. To quan-tify these overheads we wrote a custom fio engine that

Seconds0 60 120 180 240 300 360 420

GB/

s

0

1

2

3

4

5

6

7

8

9

10

11

12

Figure 7: Aggregate bandwidth for 80/20 clients duringfailover and recovery

CPU IOPS Freq (Cores) PriceE5-2620 127K 2 GHz (6) $406E5-2640 153K (+20%) 2.5 GHz (6) $885E5-2650v2 188K (+48%) 2.6 GHz (8) $1166E5-2660v2 183K (+44%) 2.2 GHz (10) $1389

Table 2: Achieved IOPS on an 80/20 random 4K work-load across 2 MicroArrays

is capable of performing IO directly against our nativedispatch interface (that is, the API by which our NFSprotocol gateway interacts with the NADs). We thencompared the performance of a single VM running a ran-dom 4k read fio workload (for maximum possible IOPS)against a VMDK exported by NFS to the same workloadrun against our native dispatch engine. In this experi-ment, the VMDK-based experiment produced an averageof 50240 IOPS, whereas direct access achieved 54060IOPS, for an improvement of roughly 8%.

7.6 Effect of CPU on Performance

A workload running at full throttle with small requestscompletely saturates the CPU. This remains true de-spite significant development effort in performance de-bugging, and a great many improvements to minimizedata movement and contention. In this section we re-port the performance improvements resulting from fasterCPUs. These results are from random 4K NFS requestsin an 80/20 readwrite mix at 128 queue depth over four10Gb links to a cluster of two NADs, each equipped with2 physical CPUs.

Table 2 shows the results of these tests. In short, it ispossible to “buy” additional storage performance underfull load by upgrading the CPUs into a more “balanced”configuration. The wins are significant and carry a non-trivial increase in the system cost. As a result of this

11

Page 13: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

28 12th USENIX Conference on File and Storage Technologies USENIX Association

experimentation, we elected to use a higher performanceCPU in the shipping version of the product.

8 Related Work

Strata applies principles from prior work in server virtu-alization, both in the form of hypervisor [5, 32] and lib-OS [14] architectures, to solve the problem of sharingand scaling access to fast non-volatile memories amonga heterogeneous set of clients. Our contributions buildupon the efforts of existing research in several areas.

Recently, researchers have begin to investigate a broadrange of system performance problems posed by stor-age class memory in single servers [3], including currentPCIe flash devices [30], next generation PCM [1], andbyte addressability [13]. Moneta [9] proposed solutionsto an extensive set of performance bottlenecks over thePCIe bus interface to storage, and others have investi-gated improving the performance of storage class mem-ory through polling [33], and avoiding system call over-heads altogether [10]. We draw from this body of workto optimize the performance of our dispatch library, anduse this baseline to deliver a high performance scale-outnetwork storage service. In many cases, we would ben-efit further from these efforts—for example, our imple-mentation could be optimized to offload per-object ac-cess control checks, as in Moneta-D [10]. There is also abody of work on efficiently using flash as a caching layerfor slower, cheaper storage in the context of large filehosting. For example, S-CAVE [23] optimizes cache uti-lization on flash for multiple virtual machines on a singleVMware host by running as a hypervisor module. Thiswork is largely complementary to ours; we support us-ing flash as a caching layer and would benefit from moreeffective cache management strategies.

Prior research into scale-out storage systems, such asFAWN [2], and Corfu [4] has considered the impact ofa range of NV memory devices on cluster storage per-formance. However, to date these systems have been de-signed towards lightweight processors paired with sim-ple flash devices. It is not clear that this balance isthe correct one, as evidenced by the tendency to eval-uate these same designs on significantly more powerfulhardware platforms than they are intended to operate [4].Strata is explicitly designed for dense virtualized serverclusters backed by performance-dense PCIe-based non-volatile memory. In addition, like older commodity disk-oriented systems including Petal [22, 29] and FAB [28],prior storage systems have tended to focus on buildingaggregation features at the lowest level of their designs,and then adding a single presentation layer on top. Stratain contrasts isolates shares each powerful PCIe-basedstorage class memory as its underlying primitive. This

has allowed us to present a scalable runtime environmentin which multiple protocols can coexist as peers with-out sacrificing the raw performance that today’s high per-formance memory can provide. Many scale-out storagesystems, including NV-Heaps [12], Ceph/RADOS [31],and even PNFS [18] are unable to support the legacy for-mats in enterprise environments. Our agnosticism to anyparticular protocol is similar to approach used by UrsaMinor [16], which also boasted a versatile client libraryprotocol to share access to a cluster of magnetic disks.

Strata does not attempt to provide storage for datacenter-scale environments, unlike systems including Azure [6],FDS [26], or Bigtable [11]. Storage systems in this spacediffer significantly in their intended workload, as theyemphasize high throughput linear operations. Strata’smanaged network would also need to be extended tosupport datacenter-sized scale out. We also differ fromin-RAM approaches such a RAMCloud [27] and mem-cached [15], which offer a different class of durabilityguarantee and cost.

9 Conclusion

Storage system design faces a sea change resulting fromthe dramatic increase in the performance density of itscomponent media. Distributed storage systems com-posed of even a small number of network-attached flashdevices are now capable of matching the offered loadof traditional systems that would have required multipleracks of spinning disks.

Strata is an enterprise storage architecture that respondsto the performance characteristics of PCIe storage de-vices. Using building blocks of well-balanced flash,compute, and network resources and then pairing thedesign with the integration of SDN-based Ethernetswitches, Strata provides an incrementally deployable,dynamically scalable storage system.

Strata’s initial design is specifically targeted at enterprisedeployments of VMware ESX, which is one of the dom-inant drivers of new storage deployments in enterpriseenvironments today. The system achieves high perfor-mance and scalability for this specific NFS environmentwhile allowing applications to interact directly with vir-tualized, network-attached flash hardware over new pro-tocols. This is achieved by cleanly partitioning our stor-age implementation into an underlying, low-overheadvirtualization layer and a scalable framework for imple-menting storage protocols. Over the next year, we intendto extend the system to provide general-purpose NFSsupport by layering a scalable and distributed metadataservice and small object support above the base layer ofcoarse-grained storage primitives.

12

Page 14: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 29

References

[1] AKEL, A., CAULFIELD, A. M., MOLLOV, T. I.,GUPTA, R. K., AND SWANSON, S. Onyx: a pro-toype phase change memory storage array. In Pro-ceedings of the 3rd USENIX conference on Hottopics in storage and file systems (Berkeley, CA,USA, 2011), HotStorage’11, USENIX Association,pp. 2–2.

[2] ANDERSEN, D. G., FRANKLIN, J., KAMINSKY,M., PHANISHAYEE, A., TAN, L., AND VASUDE-VAN, V. Fawn: a fast array of wimpy nodes. InProceedings of the ACM SIGOPS 22nd symposiumon Operating systems principles (2009), SOSP ’09,pp. 1–14.

[3] BAILEY, K., CEZE, L., GRIBBLE, S. D., ANDLEVY, H. M. Operating system implications offast, cheap, non-volatile memory. In Proceedingsof the 13th USENIX conference on Hot topics inoperating systems (Berkeley, CA, USA, 2011), Ho-tOS’13, USENIX Association, pp. 2–2.

[4] BALAKRISHNAN, M., MALKHI, D., PRAB-HAKARAN, V., WOBBER, T., WEI, M., ANDDAVIS, J. D. Corfu: a shared log design for flashclusters. In Proceedings of the 9th USENIX confer-ence on Networked Systems Design and Implemen-tation (2012), NSDI’12.

[5] BARHAM, P., DRAGOVIC, B., FRASER, K.,HAND, S., HARRIS, T., HO, A., NEUGEBAUER,R., PRATT, I., AND WARFIELD, A. Xen and the artof virtualization. In Proceedings of the nineteenthACM symposium on Operating systems principles(2003), SOSP ’03, pp. 164–177.

[6] CALDER, B., WANG, J., OGUS, A., NILAKAN-TAN, N., SKJOLSVOLD, A., MCKELVIE, S., XU,Y., SRIVASTAV, S., WU, J., SIMITCI, H., HARI-DAS, J., UDDARAJU, C., KHATRI, H., EDWARDS,A., BEDEKAR, V., MAINALI, S., ABBASI, R.,AGARWAL, A., HAQ, M. F. U., HAQ, M. I. U.,BHARDWAJ, D., DAYANAND, S., ADUSUMILLI,A., MCNETT, M., SANKARAN, S., MANIVAN-NAN, K., AND RIGAS, L. Windows azure storage:a highly available cloud storage service with strongconsistency. In Proceedings of the Twenty-ThirdACM Symposium on Operating Systems Principles(2011), SOSP ’11, pp. 143–157.

[7] CASADO, M., FREEDMAN, M. J., PETTIT, J.,LUO, J., MCKEOWN, N., AND SHENKER, S.Ethane: Taking control of the enterprise. In In SIG-COMM Computer Comm. Rev (2007).

[8] CASADO, M., GARFINKEL, T., AKELLA, A.,FREEDMAN, M. J., BONEH, D., MCKEOWN, N.,AND SHENKER, S. Sane: a protection architec-ture for enterprise networks. In Proceedings of the15th conference on USENIX Security Symposium -Volume 15 (Berkeley, CA, USA, 2006), USENIX-SS’06, USENIX Association.

[9] CAULFIELD, A. M., DE, A., COBURN, J., MOL-LOW, T. I., GUPTA, R. K., AND SWANSON,S. Moneta: A high-performance storage array ar-chitecture for next-generation, non-volatile mem-ories. In Proceedings of the 2010 43rd AnnualIEEE/ACM International Symposium on Microar-chitecture (2010), MICRO ’43, pp. 385–395.

[10] CAULFIELD, A. M., MOLLOV, T. I., EISNER,L. A., DE, A., COBURN, J., AND SWANSON,S. Providing safe, user space access to fast, solidstate disks. In Proceedings of the seventeenth in-ternational conference on Architectural Support forProgramming Languages and Operating Systems(2012), ASPLOS XVII, pp. 387–400.

[11] CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH,W. C., WALLACH, D. A., BURROWS, M., CHAN-DRA, T., FIKES, A., AND GRUBER, R. E.Bigtable: A distributed storage system for struc-tured data. ACM Trans. Comput. Syst. 26, 2 (June2008), 4:1–4:26.

[12] COBURN, J., CAULFIELD, A. M., AKEL, A.,GRUPP, L. M., GUPTA, R. K., JHALA, R., ANDSWANSON, S. Nv-heaps: making persistent objectsfast and safe with next-generation, non-volatilememories. In Proceedings of the sixteenth interna-tional conference on Architectural support for pro-gramming languages and operating systems (NewYork, NY, USA, 2011), ASPLOS XVI, ACM,pp. 105–118.

[13] CONDIT, J., NIGHTINGALE, E. B., FROST, C.,IPEK, E., LEE, B., BURGER, D., AND COETZEE,D. Better i/o through byte-addressable, persistentmemory. In Proceedings of the ACM SIGOPS 22ndsymposium on Operating systems principles (NewYork, NY, USA, 2009), SOSP ’09, ACM, pp. 133–146.

[14] ENGLER, D. R., KAASHOEK, M. F., ANDO’TOOLE, JR., J. Exokernel: an operating systemarchitecture for application-level resource manage-ment. In Proceedings of the fifteenth ACM sym-posium on Operating systems principles (1995),SOSP ’95, pp. 251–266.

13

Page 15: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

30 12th USENIX Conference on File and Storage Technologies USENIX Association

[15] FITZPATRICK, B. Distributed caching with mem-cached. Linux J. 2004, 124 (Aug. 2004), 5–.

[16] GANGER, G. R., ABD-EL-MALEK, M., CRA-NOR, C., HENDRICKS, J., KLOSTERMAN, A. J.,MESNIER, M., PRASAD, M., SALMON, B., SAM-BASIVAN, R. R., SINNAMOHIDEEN, S., STRUNK,J. D., THERESKA, E., AND WYLIE, J. J. Ursaminor: versatile cluster-based storage, 2005.

[17] GIBSON, G. A., AMIRI, K., AND NAGLE, D. F.A case for network-attached secure disks. Tech.Rep. CMU-CS-96-142, Carnegie-Mellon Univer-sity.Computer science. Pittsburgh (PA US), Pitts-burgh, 1996.

[18] HILDEBRAND, D., AND HONEYMAN, P. Ex-porting storage systems in a scalable mannerwith pnfs. In IN PROCEEDINGS OF 22NDIEEE/13TH NASA GODDARD CONFERENCEON MASS STORAGE SYSTEMS AND TECH-NOLOGIES (MSST (2005).

[19] HUTCHINSON, N. C., AND PETERSON, L. L. Thex-kernel: An architecture for implementing net-work protocols. IEEE Trans. Softw. Eng. 17, 1 (Jan.1991), 64–76.

[20] KARGER, D., LEHMAN, E., LEIGHTON, T., PAN-IGRAHY, R., LEVINE, M., AND LEWIN, D.Consistent hashing and random trees: distributedcaching protocols for relieving hot spots on theworld wide web. In Proceedings of the twenty-ninthannual ACM symposium on Theory of computing(1997), STOC ’97, pp. 654–663.

[21] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI,J., AND KAASHOEK, M. F. The click modularrouter. ACM Trans. Comput. Syst. 18, 3 (Aug.2000), 263–297.

[22] LEE, E. K., AND THEKKATH, C. A. Petal: dis-tributed virtual disks. In Proceedings of the seventhinternational conference on Architectural supportfor programming languages and operating systems(1996), ASPLOS VII, pp. 84–92.

[23] LUO, T., MA, S., LEE, R., ZHANG, X., LIU, D.,AND ZHOU, L. S-cave: Effective ssd caching toimprove virtual machine storage performance. InParallel Architectures and Compilation Techniques(2013), PACT ’13, pp. 103–112.

[24] MEYER, D. T., CULLY, B., WIRES, J., HUTCHIN-SON, N. C., AND WARFIELD, A. Block mason. InProceedings of the First conference on I/O virtual-ization (2008), WIOV’08.

[25] MOSBERGER, D., AND PETERSON, L. L. Makingpaths explicit in the scout operating system. In Pro-ceedings of the second USENIX symposium on Op-erating systems design and implementation (1996),OSDI ’96, pp. 153–167.

[26] NIGHTINGALE, E. B., ELSON, J., FAN, J., HOF-MANN, O., HOWELL, J., AND SUZUE, Y. Flatdatacenter storage. In Proceedings of the 10thUSENIX conference on Operating Systems Designand Implementation (Berkeley, CA, USA, 2012),OSDI’12, USENIX Association, pp. 1–15.

[27] OUSTERHOUT, J., AGRAWAL, P., ERICKSON,D., KOZYRAKIS, C., LEVERICH, J., MAZIERES,D., MITRA, S., NARAYANAN, A., ONGARO,D., PARULKAR, G., ROSENBLUM, M., RUMBLE,S. M., STRATMANN, E., AND STUTSMAN, R.The case for ramcloud. Commun. ACM 54, 7 (July2011), 121–130.

[28] SAITO, Y., FRØLUND, S., VEITCH, A., MER-CHANT, A., AND SPENCE, S. Fab: buildingdistributed enterprise disk arrays from commoditycomponents. In Proceedings of the 11th interna-tional conference on Architectural support for pro-gramming languages and operating systems (NewYork, NY, USA, 2004), ASPLOS XI, ACM, pp. 48–58.

[29] THEKKATH, C. A., MANN, T., AND LEE, E. K.Frangipani: a scalable distributed file system. InProceedings of the sixteenth ACM symposium onOperating systems principles (1997), SOSP ’97,pp. 224–237.

[30] VASUDEVAN, V., KAMINSKY, M., AND ANDER-SEN, D. G. Using vector interfaces to deliver mil-lions of iops from a networked key-value storageserver. In Proceedings of the Third ACM Sympo-sium on Cloud Computing (New York, NY, USA,2012), SoCC ’12, ACM, pp. 8:1–8:13.

[31] WEIL, S. A., WANG, F., XIN, Q., BRANDT,S. A., MILLER, E. L., LONG, D. D. E., ANDMALTZAHN, C. Ceph: A scalable object-basedstorage system. Tech. rep., 2006.

[32] WHITAKER, A., SHAW, M., AND GRIBBLE, S. D.Denali: A scalable isolation kernel. In Proceed-ings of the Tenth ACM SIGOPS European Work-shop (2002).

[33] YANG, J., MINTURN, D. B., AND HADY, F. Whenpoll is better than interrupt. In Proceedings of the10th USENIX conference on File and Storage Tech-nologies (Berkeley, CA, USA, 2012), FAST’12,USENIX Association, pp. 3–3.

14

Page 16: Strata: High-Performance Scalable Storage on Virtualized ... · switch to transparently scale and manage connections across ... ESX Host ESX Host ESX Host Virtual NFS server 10.150.1.1

USENIX Association 12th USENIX Conference on File and Storage Technologies 31

[34] Flexible io tester. http://git.kernel.dk/?p=fio.git;a=summary.

[35] Linux device mapper resource page. http://sourceware.org/dm/.

[36] Linux logical volume manager (lvm2) resourcepage. http://sourceware.org/lvm2/.

[37] Seagate kinetic open storage documenta-tion. https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki.

[38] Scsi object-based storage device commands -2, 2011. http://www.incits.org/scopes/1729.htm.

15