migrating server storage to ssds: analysis of tradeoffs · pdf filemigrating server storage to...

Migrating Server Storage to SSDs: Analysis of Tradeoffs

Dushyanth Narayanan Eno Thereska Austin Donnelly Sameh Elnikety Antony RowstronMicrosoft Research Cambridge, UK

{dnarayan,etheres,austind,samehe,antr}@microsoft.com

AbstractRecently, flash-based solid-state drives (SSDs) have becomestandard options for laptop and desktop storage, but their im-pact on enterprise server storage has not been studied. Provi-sioning server storage is challenging. It requires optimizingfor the performance, capacity, power and reliability needs ofthe expected workload, all while minimizing financial costs.

In this paper we analyze a number of workload tracesfrom servers in both large and small data centers, to de-cide whether and how SSDs should be used to support each.We analyze both complete replacement of disks by SSDs, aswell as use of SSDs as an intermediate tier between disksand DRAM. We describe an automated tool that, given de-vice models and a block-level trace of a workload, deter-mines the least-cost storage configuration that will supportthe workload’s performance, capacity, and fault-tolerance re-quirements.

We found that replacing disks by SSDs is not a cost-effective option for any of our workloads, due to the lowcapacity per dollar of SSDs. Depending on the workload, thecapacity per dollar of SSDs needs to increase by a factor of3–3000 for an SSD-based solution to break even with a disk-based solution. Thus, without a large increase in SSD ca-pacity per dollar, only the smallest volumes, such as systemboot volumes, can be cost-effectively migrated to SSDs. Thebenefit of using SSDs as an intermediate caching tier is alsolimited: fewer than 10% of our workloads can reduce pro-visioning costs by using an SSD tier at today’s capacity perdollar, and fewer than 20% can do so at any SSD capacity perdollar. Although SSDs are much more energy-efficient thanenterprise disks, the energy savings are outweighed by thehardware costs, and comparable energy savings are achiev-able with low-power SATA disks.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.EuroSys’09, April 1–3, 2009, Nuremberg, Germany.Copyright c© 2009 ACM 978-1-60558-482-9/09/04. . . $5.00

1. IntroductionSolid-state drives (SSDs) are rapidly making inroads intothe laptop market and are likely to encroach on the desktopstorage market as well. How should this technology be in-corporated into enterprise server storage, which is complex,and different both in scale and requirements from laptop anddesktop storage? A server storage system must meet capac-ity, performance, and reliability requirements — which varywidely across workloads — while minimizing cost.

While solid-state storage offers fast random-access readsand low power consumption, it also has disadvantages suchas high cost per gigabyte. There have been many proposals toexploit the characteristics of solid-state storage to redesignstorage systems, either replacing disks entirely [Prabhakaran2008, Woodhouse 2001] or using solid-state storage to aug-ment disk storage, e.g., to store file system metadata [Miller2001]. However, there has been no cost-benefit analysis ofthese architectures for real workloads.

In this paper we take the point of view of a storage ad-ministrator who wishes to re-provision or upgrade a set ofstorage volumes. This administrator must answer two ques-tions. First, what are the capacity, performance, and fault-tolerance requirements of the workload? Second, what is theleast-cost configuration that will satisfy these requirements?The approach in this paper is to find the least-cost config-uration that meets known targets for performance (as wellas capacity and fault-tolerance) rather than maximizing per-formance while keeping within a known cost budget. Sincethis paper is focused on the tradeoffs between SSDs and me-chanical disks, we consider three types of configuration foreach workload: disk-only, SSD-only, and two-tiered (withthe SSD tier above the disk tier).

Intuitively, SSDs are favored by workloads with a highdemand for random-access I/Os, especially reads, and a rel-atively low capacity requirement. To decide whether a givenworkload is better served by SSDs than disks, we first de-fine and measure the workload’s requirements along severalaxes: capacity, random-access I/O rate, sequential transferrate, and fault-tolerance. We also characterize the capabili-ties of each device type using the same metrics, normalizedby dollar cost. Given these two sets of parameters, we com-pute the total cost of different configurations, and choose theone that meets workload requirements at the lowest cost.

1.1 ContributionsThe main contribution of this paper is an analysis of theSSD/disk tradeoff for a range of workloads. Our workloadtraces are taken from a variety of servers in both large andsmall data centers. Our analysis is based on matching eachworkload with the storage configuration that meets its re-quirements at the lowest dollar cost. We note that storagedevices such as disks and SSDs are only part of the cost ofsupporting enterprise storage; other costs such as network-ing, enclosures, or cooling are not addressed in this paper.

We find that replacing disks by SSDs is not the optimalsolution for any of our workloads. This is due to the verylow capacity/dollar offered by SSDs, compared to disks. De-pending on the workload, the capacity/dollar will need to im-prove by a factor of 3–3000 for SSDs to become competitivewith disks. While the capacity/dollar of SSDs is improvingover time, an improvement of several orders of magnitudeseems unlikely [Hetzler 2008]. Thus replacement of disks bySSDs does not seem likely for any but the smallest volumesin our study, such as system boot volumes.

Using SSDs as a caching tier above the disk-based stor-age also has limited benefits. By absorbing some of the I/Oin the SSD tier, such hybrid configurations could reduce theload on the disk tier. This load reduction might then allowreduced provisioning of the disk tier and hence save moneyoverall. However, for 80% of the workloads in our study,the SSD tier did not reduce the cost of the disk tier at all.For a further 10% of the workloads, the SSD tier enabledsome reduction in the cost of the disk tier, but the cost of theSSD tier outweighed any savings in the disk tier. Only for10% of the workloads was the hybrid configuration the op-timal configuration. Similarly we found that using SSDs asa write-ahead log, while requiring only modest capacity andbandwidth in the SSD tier and potentially reducing write re-sponse times, does not reduce provisioning costs. Addition-ally, current flash-based SSDs wear out after a certain num-ber of writes per block, and we found this to be a concernwhen absorbing an entire volume’s writes on a small SSD.

We also found that although SSD-based solutions con-sume less power than enterprise disk based solutions, theresulting dollar savings are 1–3 orders of magnitude lowerthan the initial investment required. Hence, power savingsare unlikely to be the main motivation in moving from disksto SSDs. We found low-speed SATA disks to be compet-itive with SSDs in terms of performance and capacity perwatt. These disks also offer much higher capacity and perfor-mance per dollar than either enterprise disks or SSDs. How-ever, they are generally believed to be less reliable and thetradeoff between cost and reliability requires further study.

An additional contribution of this paper is our use of amodeling and optimization approach to answer questionsabout SSDs. Optimization and solvers combined with per-formance models have been used previously to automatestorage provisioning [Anderson 2005, Strunk 2008], but to

our knowledge this is the first work to use them specificallyto examine the use of SSDs. Our contribution lies in thenovelty and the importance of this new application domain,rather than on our optimization techniques or performancemodels, which are simple first-order models.

2. Background and related workThis section provides a brief overview of solid-state drives.It then describes the workload traces used in our analysis. Itconcludes with related work.

2.1 Solid-state drivesSolid-state drives (SSDs) provide durable storage through astandard block I/O interface such as SCSI or SATA. Theyhave no mechanical moving parts and hence no positioningdelays. Access latencies are sub-millisecond, compared tomany milliseconds for disks. Commercially available SSDstoday are based on NAND flash memory (henceforth re-ferred to as “flash”). While other solid-state technologieshave been proposed, such as magnetic RAM (M-RAM) orphase change memory (PCM) [Intel News Release 2008],these are not widely used in storage devices today. In thispaper we use device models extracted from flash SSDs; ourapproach would apply equally to these other technologies.

Flash storage has two unusual limitations, but both canbe mitigated by using layout remapping techniques at thecontroller or file system level. The first limitation is thatsmall, in-place updates are inefficient, due to the need toerase the flash in relatively large units (64–128 KB) beforeit can be rewritten. This results in poor random-access writeperformance for the current generation of SSDs. However,random-access writes can be made sequential by using a log-structured file system [Rosenblum 1991, Woodhouse 2001]or transparent block-level remapping [Agrawal 2008, Birrell2007]. Since SSDs do not have any positioning delays, readperformance is not affected by these layout changes. Someoverhead is added for log cleaning: server workloads in gen-eral have sufficient idle time, due to diurnal load patterns,to perform these maintenance tasks without impacting fore-ground workloads [Golding 1995, Narayanan 2008a].

The second limitation of flash is wear: the reliabilityof the flash degrades after many repeated write-erase cy-cles. Most commercially available flash products are ratedto withstand 10,000–100,000 write-erase cycles. Using oneof a variety of wear-leveling algorithms [Gal 2005], the wearcan be spread evenly across the flash memory to maximizethe lifetime of the device as a whole. These algorithms im-pose a small amount of additional wear in the form of addi-tional writes for compaction and defragmentation.

2.2 Workload tracesOur analysis uses real workload traces collected from pro-duction servers in both large and small data centers. Thereare two sets of large data center traces: the first is from an Ex-change server that serves 5000 corporate users, one of many

Data center size Server Function Volumes Spindles Total GB

Large exchange Corporate mail 9 102 6706msn-befs MSN storage server 4 28 429

Small usr User home dirs 3 16 1367proj Project dirs 5 44 2094print Print server 2 6 452hm H/w monitor 2 6 39resproj Research projects 3 24 277proxy Web proxy 2 4 89src1 Source control 3 12 555src2 Source control 3 14 355webstg Web staging 2 6 113term Terminal server 1 2 22websql Web/SQL server 4 17 441media Media server 2 16 509webdev Test web server 4 12 136

Table 1. Enterprise servers traced.

responsible for Microsoft employee e-mail. The Exchangetraces cover 9 volumes over a 24-hour period, starting from2:39pm PST on the 12th December 2007. The system bootvolume has 2 disks in a RAID-1 configuration, and the re-maining volumes each have 6–8 disks in a RAID-10 config-uration. The Exchange traces contain 61 million requests, ofwhich 43% are reads.

The second set of large data center traces is from 4 RAID-10 volumes on an MSN storage back-end file store. Thesetraces cover a 6-hour period starting from 12:41pm MST onthe 10th March 2008. They contain 29 million requests, ofwhich 67% are reads. Both the Exchange and MSN storagetraces are described in more detail by Worthington [2008].

The third set of traces was taken from a small data centerwith about 100 on-site users [Narayanan 2008a], and coversa range of servers typical of small to medium enterprises:file servers, database servers, caching web proxies, etc. Eachserver was configured with a 2-disk RAID-1 system bootvolume and one or more RAID-5 data volumes. These tracescover a period of one week, starting from 5pm GMT on the22nd February 2007. The number of requests traced was 434million, of which 70% were reads.

All the traces were collected at the block level, below thefile system buffer cache but above the storage tier. In this pa-per we consider each per-volume trace as a separate work-load, and thus a workload corresponds to a single volume:we use the terms “volume” and “workload” interchangeablythroughout the paper. Table 1 shows the servers traced, withthe number of volumes, total number of spindles, and to-tal storage capacity used by each. All the traces used in thispaper are or will be available on the Storage Networking In-dustry Association’s trace repository [SNIA 2009].

2.3 Related workThere has been much research in optimizing solid-state stor-age systems and also on hybrid systems with both solid-stateand disk storage. For example, there have been proposals foroptimizing B-trees [Nath 2007, Wu 2004] and hash-basedindexes [Zeinalipour-Yazti 2005] for flash-based storage inembedded systems; to partition databases between flash anddisks [Koltsidas 2008]; to optimize database layout for flashstorage [Lee 2008]; to use flash to replace some of the mainmemory buffer cache [Kgil 2006]; to use M-RAM to storefile system metadata [Miller 2001]; and to build persistentbyte-addressable memory [Wu 1994]. These studies aim toimprove storage performance but do not consider the dollarcost, i.e., when does the SSD-based solution cost less thanadding more disk spindles to achieve the same performance?In this paper we use performance and capacity normalizedby dollar cost to answer this question.

In the laptop storage space, vendors are manufactur-ing hybrid disks [Samsung 2006] that incorporate a smallamount of flash storage within the disk. Windows Vista’sReadyDrive technology [Panabaker 2006] makes use ofthese hybrid drives to speed up boot and application launch,as well as to save energy by spinning the disk down. How-ever the use of such hybrid or two-tiered technologies inthe server space depends on the cost-performance-capacitytradeoff, which is analyzed by this paper.

Hetzler [2008] has recently argued, based on an analy-sis of the supply-side cost factors behind the two technolo-gies, that flash-based storage will not achieve capacity/dollarcomparable to that of disks in the foreseeable future. Thisis complementary to our “demand-side” work, which showsthe capacity/dollar required for flash to “break even” withdisk-based solutions, for different workloads. When taken

Workload

requirements

Device models

Solver

Workload

traces

Benchmarks

Recommended

config

Business

objectives

Vendor

specs

Figure 1. Tool steps

together, the two analyses present a discouraging prognosisfor flash in server storage.

There have also been several cost-benefit analyses of var-ious storage architectures and hierarchies, including thosethat examine the use of non-traditional storage technolo-gies, i.e., storage not based on mechanical disks. However,to our knowledge, ours is the first study to specifically ex-amine the cost-performance-capacity tradeoff for solid-statedevices using real workloads. Here we briefly describe someof the most related previous work.

Uysal [2003] describes a cost-performance analysis of thetradeoffs involved in using microelectromechanical storage(MEMS) in the storage hierarchy. The analysis covers bothreplacing NVRAM with MEMS, and replacing some or allof the disks with MEMS. It shows that for workloads whereperformance rather than capacity dominates the provisioningcost, MEMS can be competitive with disk when it comeswithin 10x of the price per gigabyte of disks. However,MEMS-based storage is not commercially available todayand its cost per gigabyte is unknown.

Baker [1992] explored the use of battery-backed non-volatile RAM (NVRAM) in the Sprite distributed file systemto reduce disk write traffic. NVRAM is widely adopted inserver storage, especially in high-end disk array controllers,but due to high costs is usually deployed in small sizes. Theanalysis was based on traces collected in 1991 on four fileservers running a log-structured file system serving 40 disk-less clients in a university research environment. Server stor-age workloads in 2007, such as those used in this paper,are substantially different: they include web servers, webcaches, and database back ends for web services, with or-ders of magnitude more storage than in 1991. Furthermore,flash memory today is an order of magnitude cheaper thanNVRAM and raises the possibility of replacing disks en-tirely. This paper explores the new space of workloads andtechnologies.

3. Modeling the solution spaceThis section outlines the modeling and solver frameworkthat is used to analyze the solution space of having SSDsin a storage architecture. Figure 1 illustrates the basic stepstaken in coming up with a solution. First, workload require-ments are collected. Some of them may be easily expressedin terms of business objectives. For example, availability iscommonly expressed as the number of nines required, where

Capacity GBRandom-access read rate IOPSRandom-access write rate IOPSRandom-access I/O rate IOPSSequential read rate MB/sSequential write rate MB/sAvailability Nines†

Reliability (MTBF‡) Years† Nines = − log10(1− FractionOfT imeAvailable)‡ Mean Time Between Failures.

Table 2. Workload requirements and units.

3 nines, for example, means that the data is available 99.9%of the time. Other requirements, such as performance ex-pectations, can be automatically extracted by our tool fromworkload I/O traces. Next, device models are created for thehardware under consideration. Such models capture devicecharacteristics like dollar cost, power consumption, devicereliability (usually reported by the vendor, and then refinedover time as more empirical data is available [Schroeder2007]), and performance, which our tool automatically ex-tracts using synthetic micro-benchmarks. Finally, a solvercomponent finds a configuration that minimizes cost whilemeeting the other objectives. The configuration is either asingle-tier configuration — an array of devices of some type— or a two-tier configuration where the top and bottom tierscould use different device types (e.g., SSDs and disks).

3.1 Extracting workload requirementsWorkload requirements make up the first set of inputs to thesolver. Table 2 lists the requirements used in this paper. Sev-eral of them, such as capacity, availability and reliability canbe straightforward to specify in a service level objective. Per-formance, however, can be more difficult. Our tool helps anadministrator understand a workload’s inherent performancerequirements by extracting historical performance metrics.

A workload could be characterized in terms of its meanand peak request rates; its read/write ratio; and its ratioof sequential to non-sequential requests. However, just us-ing these aggregate measures can hide correlations betweenthese workload attributes, e.g., a workload might have manyrandom-access reads and streaming writes but few streamingreads and random-access writes. Hence, we prefer to con-sider random-access reads, random-access writes, sequen-tial reads, and sequential writes each as a separate work-load requirement. In addition we consider the total random-access I/O rate: for mixed read/write workloads, this couldbe higher than the individual read or write I/O rates.

Many workloads, including the ones analyzed in this pa-per, have time-varying load, e.g., due to diurnal patterns.Hence, provisioning storage for the mean request rates willbe inadequate. Our models primarily consider percentiles ofoffered load rather than means. By default we use the peak

Capacity GBSequential performance MB/s (read, write)Random-access performance IOPS (read, write)Reliability MTBFMaximum write-wear rate GB/yearPower consumption W (idle and active)Purchase cost $

Table 3. Device model attributes and units

request rate (the 100th percentile) but other percentiles suchas the 95th can also be used. All the analyses in this paperuse peak I/O rates averaged over 1 min intervals.

Computing the peak read (or write) transfer bandwidthfrom a trace is straightforward: it is the maximum read (orwrite) rate in MB/s observed over any minute in the trace.Usually, this corresponds to a period of sequential transfer. Ifthe workload has no sequential runs this will reflect a lowerrate of transfer, indicating that sequential transfer is not alimiting factor for this workload.

Random-access performance is measured in IOPS, i.e.,I/O operations completed per second. When defining this“IOPS requirement” based on a workload trace, we musttake care to filter out sequential or near-sequential runs.Since mechanical disks in particular can sustain a muchhigher rate of sequential I/Os than random-access ones, se-quential runs could cause an overestimate of the true IOPSrequirement of the workload. In general, the locality patternof a sequence of I/Os could fall anywhere between com-pletely random and completely sequential. However, to keepthe models simple, we classify each I/O in the workloadtrace as either sequential or non-sequential. We use LBN(logical block number) distance between successively com-pleted I/Os to classify the I/Os: any I/O that is within 512 KBof the preceding I/O is classified as sequential. This thresh-old is chosen to be large enough to correctly detect sequen-tial readahead, which on Windows file systems results in asmall amount of request re-ordering at the block level. Theread, write, and total IOPS requirements of the workloadsare then based on the non-sequential I/O rates averaged overa 1 min time scale.

3.2 Device modelsDevice models make up the second set of inputs to the solver.Table 3 shows the metrics our tool uses to characterize a de-vice. The models are empirical rather than analytical. Ourtool runs a sequence of synthetic tests to extract all the per-formance attributes. Sequential performance is measured us-ing sequential transfers. For mechanical disks, we measuredboth the outermost and innermost tracks — the differencein bandwidth can be as high as 50%. However, we only re-port on bandwidth on the outermost tracks in this paper: datarequiring fast streaming performance is typically placed onthese tracks. Random-access performance is based on issu-ing concurrent 4 KB requests uniformly distributed over the

device, with a queue length of 256 requests maintained in theOS kernel. The aim is to measure the number of IOPS thatthe device can sustain under high load. All synthetic testsare short in duration, around 20 seconds. We found this suf-ficient to verify what vendors were reporting.

“Log-structured” SSD Our tool considers read and writeperformance as separate metrics. In general, mechanicaldisks have equivalent read and write performance. For thecurrent generation of SSDs, while sequential read and writeperformance are equivalent, random-access writes are muchslower than sequential writes unless a log structure is used(Section 2.1). In our analyses we consider both flavors ofSSD. The default SSD model uses the measured random-write performance of the device, without assuming anyremapping or log structure. The “log-structured” SSD modeluses a random-write performance equal to the measured se-quential write performance; it assumes that any cleaning orcompaction activity will be done in the background with noimpact on foreground performance.

Scaling Enterprise storage volumes usually consist ofmultiple homogeneous devices in a RAID configuration.This gives fault-tolerance for a configurable number of diskfailures (usually one) as well as additional capacity and per-formance. Our device models are based on the assumptionthat both capacity and performance will scale linearly withthe number of devices added. This assumption can be vali-dated in practice using the synthetic tests described above.The level of fault-tolerance is a separately configurable pa-rameter, and the model automatically adds in the appropriatenumber of additional device(s).

Reliability and fault-tolerance In general, storage devicereliability is hard to predict. Recent work in this area hasshown that traditional reliability metrics need to be extractedthrough empirical observation after a number of years in op-eration [Jiang 2008, Schroeder 2007]. Such empirical num-bers are still lacking for SSDs.

It is known that NAND flash memories suffer from wear:the reliability of the memory decreases as it is repeatedlyerased and overwritten. This is measured as the number ofwrite cycles that the SSD can tolerate before data retentionbecomes unreliable, i.e., the number of times that each blockin the SSD can be erased and overwritten. Typical enterpriseSSDs specify a wear tolerance of 10,000–100,000 write cy-cles. We convert this metric into a maximum write-wear rateRwear (expressed in GB/year) as follows: for an SSD witha capacity C (expressed in GB), a wear tolerance of Nwear

cycles, and a desired MTBF (mean time between failures) ofTfail (measured in years):

Rwear =C ·Nwear

Tfail

In this paper we use a default Tfail of 5 years, to matchthe typical target lifetime of a storage device before it isupgraded or replaced.

Disk tier

Solid-

state

tier

Write

Write-ahead log Read cache

Cache miss

Flush

Read

Cache hit

Figure 2. Two-tier architecture using SSDs and disks.

Since this paper is focused on solid-state storage, andwear is a novel, SSD-specific phenomenon, we include it inour device models. Currently we do not model other failures,such as mechanical failures in disks. Given the widespreaduse of RAID for fault-tolerance, we do not believe that reli-ability differences are an important factor for provisioning,however verifying this remains future work.

3.3 Tiered modelsThe cost and performance characteristics of solid-state mem-ory are in between those of main memory (DRAM) and tra-ditional storage (disks). Hence, it also makes sense to con-sider solid-state devices not just as a replacement for disks,but also as an intermediate storage tier between main mem-ory and disks. This tier could cache more data than DRAM(since it is cheaper per gigabyte) and hence improve readperformance. Unlike DRAM, solid-state memories also of-fer persistence. Hence, this tier could also be used to im-prove write performance, by using it as a write-ahead logwhich is lazily flushed to the lower tier. Several storage andfile system architectures [Miller 2001, Panabaker 2006] havebeen proposed to use solid-state memory as a cache and/ora write-ahead log. This paper quantifies the benefits of theseapproaches for server workloads.

Our tool supports hybrid configurations where solid-statedevices are used as a transparent block-level intermediatetier. Caching and write-ahead logging could also be done at ahigher level, e.g., at the file system, which would allow poli-cies based on semantic knowledge, such as putting meta-datain the cache. Such policies however would require changesto the file system, and our block-level workload traces do notinclude such semantic information. We compute the benefitsof transparent block-level caching and logging without as-suming any changes in the file system or application layer.

Figure 2 shows the architecture we assume. The solid-state tier is divided into a write-ahead log and a larger readcache area. In practice, the write-ahead log is smaller thanthe read cache; also, the location of the write log wouldbe periodically moved to level the wear across the flashstorage device. The write-ahead log could also be replicatedover multiple SSDs depending on workload fault-tolerancerequirements. The read cache, by contrast, has only clean

blocks and does not need such fault tolerance. Note thatthis is an inclusive caching model, where blocks stored inthe SSD cache tier are also stored in the disk tier. Our toolalso supports an exclusive, or “partitioning” model whereeach block is stored on either SSD or disk. However, giventhe huge disparity in capacities between SSDs and disk, apartitioning model provides negligible capacity savings. Inthis paper, we only use the more common (and simpler)inclusive model.

Our models use the caching and write-ahead logging poli-cies described below, which are based on our experienceswith the traced workloads as well as key first-order proper-ties of SSDs and disks. They can easily be extended to coverother policies.

3.3.1 Read cachingOur workloads are server I/O workloads and hence havealready been filtered through large main-memory buffercaches. Hence, there is little short-term temporal localityin the access patterns, and cache replacement policies suchas LRU (least recently used). that aim to exploit this localityare unlikely to do well. However, there could be benefit incaching blocks based on long-term access frequency. Also,the benefit of caching is highest for random-access reads:disk subsystems are already efficient for streaming reads.

We have implemented two different caching policies inour models. The first is LRU. The second is a “long-termrandom-access” (LTR) policy that caches blocks based ontheir long-term random-access frequency. To do this, werank the logical blocks in each trace according to the num-ber of random accesses to each; accesses are classified asrandom or sequential as described previously in Section 3.1.The total number of reads to each block is used as a sec-ondary ranking metric. Note that this policy acts like an “or-acle” by computing the most frequently accessed blocks be-fore they are accessed: hence while LRU gives an achievablelower bound on the cache performance, the LTR policy de-scribed here gives an upper bound.

For a given cache size of H blocks, the tool splits theworkload trace into two, such that accesses to cached blocksgo to the top tier, and the remainder to the bottom tier. Itthen computes the best configuration independently for eachtier. In theory, each tier could be provisioned with any devicetype; in practice, the only solutions generated are those witha solid-state tier on top of a disk tier. The solver then iteratesover values of H to find the least-cost configuration.

3.3.2 Write-ahead logIt is straightforward to absorb writes to a storage volumeon a much smaller solid-state device; the written data canthen be flushed in the background to the underlying volume.For low response times and high throughput, the solid-statedevice should be laid out using a log structure, for exampleby using a block-level versioned circular log [Narayanan2008b]. Writes can be acknowledged as soon as they are

persistent in the log. The log space can be reused when thewrites become persistent on the underlying volume.

Write-ahead logging can be combined with read cachingas shown in Figure 2. In this case all writes, and all readsof cached blocks, are redirected to the solid-state tier. Toguarantee sequential performance for the writes, they aresent to a separately allocated log area on the solid-statedevice. They are then lazily flushed to the disk tier, andupdated in the read cache if required. In theory, this couldcause double-buffering of blocks in the write-ahead log andin the read cache; however, as we will see in Section 4, thesize of the write-ahead log required for our workloads is verysmall and hence this is not a concern.

Background flushes can take advantage of batching, co-alescing, overwriting of blocks, and low-priority I/O to re-duce the I/O load on the lower tier. The efficacy of theseoptimizations depends on the workload, the flushing policy,and the log size. We evaluate both extremes of this spectrum:a “write-through log” where writes are sent simultaneouslyto the disk tier and to the log, and a “write-back” log wherethe lazy flushes to the disk tier are assumed to be free. Theperformance requirements for the write-ahead log and thedisk tier are then derived by splitting the original workloadtrace according to the logging policy: writes are duplicatedon the log and the disk tier for the “write-through” policy,and sent only to the log for the “write-back” policy (the even-tual flushes to disk are assumed to be free).

To find the cost of adding a write-ahead log (with or with-out read caching) the tool must estimate both the capacity aswell as the performance required from the log. The log ca-pacity is measured as the size of the largest write burst ob-served in the workload trace: the maximum amount of writedata that was ever in flight at one time. This gives us theamount of log space required to absorb all writes when us-ing the “write-through” logging policy. In theory, a larger logcould give us better performance when using “write-back”,by making background flushing more efficient. However, inpractice, we found that “write-back” did not reduce the loadon the disk tier sufficiently to have an impact on provision-ing: hence we did not explore this tradeoff further, but onlyconsidered the log capacity required for “write-through”.

3.4 SolverGiven workload requirements, and per-device capabilities aswell as costs, the solver finds the least-cost configurationthat will satisfy the requirements. Any cost metric can beused: in this paper we use purchase cost in dollars and powerconsumption in watts. These could be combined into a singledollar value based on the anticipated device lifetime andthe cost per watt of powering the device, if known; in ouranalyses we show the two costs separately. The number ofdevices required N(d, w) of any particular device type d tosatisfy the requirements of workload w is:

N(d, w) = maxm

⌈rm(w)sm(d)

⌉+ F (w) (1)

where m ranges over the different metrics of the workloadrequirement: capacity, random-access read rate, etc. rm(w)is the workload’s requirement for the metric m (measuredin GB, MB/s, IOPS, etc.) and sm(d) is the device’s scoreon that metric measured in the same units as rm(w). Inother words, the number of devices is determined by themost costly metric to satisfy. The workload’s fault-tolerancerequirement F (w) is specified separately as the numberof redundant devices required. In this case we are assum-ing “parity-based” fault-tolerance such as RAID-5, whereeach additional device provides tolerance against one ad-ditional device failure. The model can also be extended toreplication-based fault-tolerance such as mirroring.

The minimum cost for satisfying the workload require-ments is then

Cmin(w) = mind

N(d, w) · C(d) (2)

where C(d) is the cost per device of device type d.For two-tiered solutions, the workload trace is split into

two traces wtop(w, H) and wbot(w, H), where H is the sizeof the top tier. The performance metrics for the two work-loads are extracted from the two derived traces. The lowertier has the same capacity and fault-tolerance requirementsas the original workload. For the top tier the capacity re-quirement is simply H; the fault-tolerance is set to 0 for theread cache and to F (w) for the write-ahead log. The costCmin/tiered(w) of the optimal tiered configuration is thengiven by:

Ctiered(w, H) = Cmin(wtop(w, H)) + Cmin(wbot(w, H))(3)

Cmin/tiered(w) = minH

Ctiered(w, H) (4)

In theory, H is a continuously varying parameter. How-ever, in practice, solid-state memories are sold in discretesizes, and very large solid-state memories are too expensiveto be part of a least-cost tiered solution. Additionally, eachvalue of H requires us to reprocess the input traces: the mostexpensive step in the tool chain. Hence, our tool restricts thevalues of H to powers of 2 ranging from 4–128 GB.

The solver is currently implemented as 1580 unoptimizedlines of C and Python. The run time is dominated by thetime to split each workload trace for each value of H , and toextract the workload requirements: this is linear in the size ofthe trace file, which uses a text format. The traces used in thispaper vary in size from 40 KB–10 GB; a typical trace of size255 MB took 712 s to process on a 2 GHz AMD Opteron.

3.5 LimitationsOur analyses are based on simple first-order models of de-vice performance. These can estimate workload require-ments and device throughput in terms of random or sequen-tial I/O rates, but not per-request response times. We believe

Price Capacity Power Read Write Read Write Maximum write-wear rateUS$ GB W MB/s MB/s IOPS IOPS GB/year

Memoright MR 25.2 SSD 739 32 1.0 121 126 6450 351 182500Seagate Cheetah 10K 339 300 10.1 85 84 277 256 n/aSeagate Cheetah 15K 172 146 12.5 88 85 384 269 n/aSeagate Momentus 7200∗ 150 200 0.8 64 54 102 118 n/a

Table 4. Storage device characteristics. ∗All devices except the Momentus are enterprise-class.

these first-order estimates of throughput are adequate forcoarse-grained decisions such as provisioning. For more de-tailed characterization of I/O performance, or for workloadswith specific response time requirements, more sophisticatedperformance models [Bucy 2008, Popovici 2003] can besubstituted for the simple first-order models used here.

Our analyses are also limited by the workload tracesavailable to us, which are block-level, open-loop traces.Thus, while they present a realistic picture of the offeredload to the storage layer, this is only a lower bound on thepotential I/O rate of the application: if the storage systemwere the bottleneck for the application, then a faster stor-age system would result in a higher I/O rate. The storagesystems that we traced are well-provisioned and we do notbelieve that they were a performance bottleneck, and hencethe measured storage load is a good approximation of theworkload’s “inherent” storage I/O requirement.

Since our traces are taken at the block level, we only ana-lyze system changes below the block level. For example, wedo not model the effect of changing the main memory buffercaches, or of using file system specific layout strategies, e.g.,putting only metadata on the SSDs.

4. ResultsThis section has four parts. First, it presents the device char-acteristics extracted by the tool from a number of representa-tive storage devices, and compares them in terms of capabili-ties per dollar of purchase cost. Second, it examines for eachworkload whether disks can be fully replaced by SSDs at alower dollar cost, while satisfying workload requirements.The third part is a similar analysis for SSDs as an interme-diate storage tier used as a cache or write-ahead log. Thefinal part is an analysis and discussion of power considera-tions, including a comparison of the enterprise-class SSDsand disks with a low-power SATA disk.

4.1 Analysis of device characteristicsTable 4 shows the four representative devices used in this pa-per. The Memoright MR 25.2 was chosen to represent SSDssince it was the fastest available mid-range (US$1000–5000per device) SSD at the time of the analysis. The SeagateCheetah is a widely used high-end enterprise class disk avail-able in two speeds, 10,000 rpm and 15,000 rpm. The Sea-gate Momentus is a low-power, low-speed drive that is not

GB/$

Memoright MR-25.2

Seagate Cheetah 10K

Seagate Cheetah 15K

MB/s/$IOPS/$

Figure 3. Device capabilities normalized by dollar cost.

currently used for enterprise storage; it is included as an in-teresting additional data point. We defer discussion of theMomentus until Section 4.4.

For each device we considered multiple versions whichdiffered in per-device capacity, since this leads to differentcost-performance-capacity-power tradeoffs. One representa-tive device (shown in Table 4) was chosen to extract theperformance characteristics using the micro-benchmarks de-scribed in Section 3: we assumed that all versions of a givenmodel would have similar performance. The dollar costs areUS on-line retail prices as of June 2008 and the power con-sumptions are based on vendor documentation.

Thus each device is characterized by multiple dimen-sions: capacity, sequential read bandwidth, etc. Our tool con-siders all these dimensions as well as all the variants ofeach device (different capacities, log-structured SSDs, etc.).However, to understand the tradeoffs between the devicesit helps to visualize the most important features. Figure 3shows the three most important axes for this paper, normal-ized by dollar cost: capacity (measured in GB/$), sequentialread performance (MB/s/$), and random-access read perfor-mance (IOPS/$). For each device type, we chose the versionhaving the highest value along each axis.

Several observations can be made at this stage, even with-out considering workload requirements. SSDs provide morerandom-access IOPS per dollar whereas the enterprise diskswin on capacity and sequential performance. It seems likelythat the main factor in choosing SSDs versus disks is thetrade-off between the capacity requirement and the IOPS re-

100

1000

10000

acc

ess

re

ad

s (I

OP

S)

SSD

1

10

1 10 100 1000

Ra

nd

om

-acc

ess

re

ad

s (I

OP

S)

Capacity (GB)

Enterprise disk

Figure 4. IOPS/capacity trade-off (log-log scale)

quirement of the workload. Sequential bandwidth also fa-vors disks, but the difference is smaller than the order-of-magnitude differences on the other two axes. Interestingly,the faster-rotating 15K disk has no performance advantageover the 10K disk when performance is normalized by dollarcost. This suggests that there could be room for even slower-rotating disks in the enterprise market.

4.2 Replacing disks with SSDsA natural question this section seeks to answer is “what doesit take to replace disks with SSDs?” Whole-disk replacementhas the appeal of requiring no architectural changes to thestorage subsystem. This section answers the question byusing the provisioning tool to find the least-cost device foreach of our traced volumes, for a single-tiered configuration.

Of the three enterprise-class devices shown Table 4, theCheetah 10K disk was the best choice for all 49 volumes.In all cases the provisioning cost was determined by eitherthe capacity or the random-read IOPS requirement. In otherwords, the sequential transfer and random-access write re-quirements are never high enough to dominate the provi-sioning cost. This was the case even for the default (nonlog-structured) SSD model, where random-access writes aremore expensive than when using a log structure.

For workloads where capacity dominates the provision-ing cost, clearly disks are a more cost-effective solution.However, while IOPS dominated the cost for some disk-based solutions, capacity always dominated the cost whenusing SSDs, due to the low capacity/dollar of SSDs. Thuswhen both capacity and performance requirements are con-sidered, disks always provided the cheapest solution. Fig-ure 4 shows the different workloads’ requirements as pointsplotted on the two axes of random-read IOPS and capacity.The line separates the two regions where SSDs and disksrespectively would be the optimal choice; none of our work-loads fall in the “SSD” region.

At today’s prices SSDs cannot replace enterprise disks forany of our workloads: the high per-gigabyte price of SSDstoday makes them too expensive even for the workloadswith the highest IOPS/GB ratio. At what capacity/dollar

will SSDs become competitive with enterprise disks? Wecan expect that disks will continue to provide a higher ca-pacity/dollar than SSDs, even as the latter improve overtime. However (keeping the other device parameters fixed)at some point the capacity/dollar will be high enough so thatIOPS rather than capacity will dominate the cost, and SSDswill become a better choice than disks.

We define the break-even point SSDGB/$ as the pointat which the cost of an SSD-based volume (dominated bycapacity) will equal that of a disk-based volume (dominatedby IOPS). This gives us:

WGB

SSDGB/$=

WIOPS

DiskIOPS/$(5)

and hence

SSDGB/$ =WGB

WIOPSDiskIOPS/$ (6)

where WGB and WIOPS are the workload’s capacity andIOPS requirements respectively, and DiskIOPS/$ is theIOPS/$ of the Cheetah 10K disk. In other words, SSDswill become competitive with disks when the capacity costof the SSD equals the IOPS cost of the disk, for a givenworkload. Figure 5 shows this break-even point for eachvolume, on a log scale. For reference, it also shows the cur-rent capacity/dollar for the SSD and the Cheetah 10K disk.

The break-even point varies from 3–3000 times the ca-pacity/dollar of today’s SSDs. Some smaller volumes, es-pecially system volumes (numbered 0 in the figure) requireonly a 2–4x increase in SSD capacity/dollar to consider re-placement of disks by SSDs. However, most volumes will re-quire an improvement of 1–3 orders of magnitude. For 21 of45 volumes, the break-even point lies beyond the current ca-pacity/dollar of the Cheetah 10K disk: this is because capac-ity dominates the provisioning cost of these volumes todayeven when using disks. If we assume that disks will retain acapacity/dollar advantage over SSDs in the future, for theseworkloads the capacity/dollar must first increase to the pointwhere capacity is no longer the dominant cost for disks; be-yond this point SSDs will become competitive since theycost less per IOPS.

4.3 Two-tiered configurationsThis section answers the question “what are the bene-fits/costs of augmenting the existing storage hierarchy withSSDs?” Specifically, we consider the two-tiered configura-tion described in Section 3.3, where the SSD tier functionsas a write-ahead log as well as a read cache for frequentlyread, randomly-accessed blocks. Our analysis shows that theamount of storage required for the write-ahead log is smallcompared to SSD sizes available today. Thus, it is reason-able to allocate a small part of the solid-state memory foruse as a write-ahead log and use the rest as a read cache. Wefirst present the analysis for the write-ahead log, and thenfor a combined write-ahead log/read cache tier.

ev

en

po

int

(GB

/$) Break-even point

0.01

0.1

1

10

100

0 1 2 3 4 5 6 7 8 0 1 4 5 0 1 2 0 1 2 3 4 0 1 0 1 0 1 2 0 1 0 1 2 0 1 2 0 1 0 0 1 2 3 0 1 0 1 2 3

exchange msn-befs usr proj prn hm rsrch prxy src1 src2 stg ts web mds wdev

Bre

ak

-ev

en

po

int

(GB

/$)

Server/volume

Break-even point

Cheetah 10K (2008)

SSD (2008)

Figure 5. Break-even point at which SSDs can replace disks, on a log scale. For comparison we show the current capac-ity/dollar for enterprise SSDs and disks. Volumes numbered 0 are system boot volumes.

4.3.1 Write-ahead logAcross all workloads, the maximum write burst size, andhence the maximum space required for the write-ahead log,was less than 96 MB per volume. The peak write bandwidthrequired was less than 55 MB/s per volume: less than half thesequential write bandwidth of the Memoright SSD. Thus, asingle SSD can easily host the write-ahead log for a vol-ume using only a fraction of its capacity. Even with faulttolerance added, the capacity and bandwidth requirement islow. We repeated the analysis, but this time sharing a sin-gle write-ahead log across all volumes on the same server.The peak capacity requirement increased to 230 MB and thepeak bandwidth requirement remained under 55 MB/s. Thepeak bandwidth requirement did not increase significantlybecause across volumes peaks are not correlated.

This analysis is based on a “write-through” log, whichdoes not reduce the write traffic to the disk tier. We also re-analyzed all the workloads assuming a “write-back” log thatabsorbs all write traffic, i.e., assuming that the backgroundflushes are entirely free. We found that this reduction in writetraffic did not reduce the provisioning requirement for thedisk tier for any of the workloads; this was expected sincethe limiting factor for the workloads was always capacity orread performance. Thus, while a larger log with a lazy write-back flush policy can reduce load on the disk tier, the loadreduction is not enough to reduce the provisioning cost.

Given the small amount of log space required, it seemsclear that if there is any solid-state memory in the system, asmall partition should be set aside as a write-ahead log forthe disk tier. This can improve write response times but willnot reduce the cost of provisioning the disk tier.

4.3.2 Write-ahead log with read cacheCapacity is clearly an obstacle to replacing disks entirelywith SSDs. However, if a small solid-state cache can absorba large fraction of a volume’s load, especially the random-read load, then we could potentially provision fewer spindles

for the volume. This would be cheaper than using (more)DRAM for caching, since flash is significantly cheaper pergigabyte than DRAM. It is important to note we are notadvocating replacing main memory buffer caches with solid-state memory. Our workload traces are taken below the mainmemory buffer cache: hence we do not know how effectivethese caches are, or the effect of removing them.

The blocks to store in the read cache are selected accord-ing to one of the two policies described in Section 3.3.1:least-recently-used (LRU) and long-term-random (LTR).Reads of these blocks are sent to the top (SSD) tier, and otherreads to the disk tier. All writes are sent simultaneously toboth tiers, in line with the “write-through” log policy.

We used the solver to test several tiered configurationsfor each workload, with the capacity of the solid-state tierset to 4, 8, 16, 32, 64 and 128 GB. For each workload wecompared the best two-tiered configuration with the single-tiered disk-based configuration. Only for 4 out of 49 volumeswas the two-tiered configuration the better choice at today’sSSD capacity/dollar. For each workload, we also calculatedthe break-even capacity/dollar at which the best two-tiersolution would equal the cost of a disk-based solution. For40 out of 49 volumes, this break-even point was the same asthe break-even point for replacing all the disks with SSDs.In other words, the cache tier does not absorb enough I/Osfor these 40 volumes to enable any reduction of spindlesin the disk tier. In some cases, this is because the volume’sprovisioning is determined by capacity requirements, whichare not reduced by caching. In other cases, the cache hit ratesare too low to significantly reduce the peak load on the disktier: we speculate that this is because main-memory buffercaches have already removed much of the access locality.

For the 9 workloads where caching has some benefit, Fig-ure 6 shows the break-even point: the SSD capacity/dollarrequired for the two-tiered solution to become competitive.The last 4 workloads in the figure have a break-even pointbelow today’s SSDs capacity/dollar, i.e., they can already

0.01

0.1

1e

ve

n p

oin

t (G

B/$

)

LTR LRU SSD (2008)

0.001

0.01

Bre

ak

-ev

en

po

int

(GB

/$)

Server/volume

Figure 6. Break-even points for workloads to becomecacheable (log scale). Only the LTR policy is shown for theExchange volumes; with LRU there is no benefit and henceno break-even point for these volumes.

benefit from an SSD cache tier. The first 5 workloads appearto be cacheable with a break-even point not far from today’sSSD capacity/dollar, but only with the LTR policy. Since theLTR policy is used in “oracle” mode, it is not clear that itsbenefits are achievable in practice. With LRU these work-loads show no benefit from caching. However, we only have24-hour traces for the Exchange workloads; longer traces areneeded to decide if the workloads contain longer-term local-ity patterns that LRU could exploit.

4.4 PowerThe previous sections used purchase cost as the metric tominimize while satisfying workload requirements. Here welook at the implications of minimizing power consumptioninstead. We made the conscious choice to analyze, in addi-tion to the enterprise disks and SSDs, a non-enterprise SATAdisk in this section: the Seagate Momentus 7200. We areaware that SATA and SCSI disks are different, especiallyin terms of their reliability [Anderson 2003]. However, wechose not to ignore a SATA-based solution for two reasons.First, there is much work on storage clusters of commod-ity cheap hardware, where reliability is handled at a higherlevel. Second, recent empirical research on the reliabilityof disk drives has shown inconclusive results [Jiang 2008,Schroeder 2007]. However, we caution that the analyses hereare based primarily on performance, capacity and powermetrics. More empirical evidence is required to make anystrong claim about reliability.

Figure 7 shows the four devices compared by capacity,sequential bandwidth, and random-read IOPS, all normal-ized by the device’s power consumption. We use idle powernumbers (device ready/disks spinning, but no I/O), since ourworkloads have considerable idleness over their duration.Using active power does not make much difference in ourcalculations. The main takeaway is that, when scaled perwatt, SSDs have much better performance and comparablecapacity to the enterprise disks. However, the low-power,

GB/WSeagate Momentus 7200

Memoright MR-25.2

Seagate Cheetah 10K

Seagate Cheetah 15K

MB/s/WIOPS/W

Figure 7. Device capabilities normalized by power con-sumption.

100

1000

10000

acc

ess

re

ad

s (I

OP

S)

SSD

1

10

1 10 100 1000Ra

nd

om

-acc

ess

re

ad

s (I

OP

S)

Capacity (GB)

Low-power SATA disk

Figure 8. IOPS/capacity trade-off when optimizing forpower (log-log scale)

low-speed Momentus does far better than the SSD in termsof gigabytes per watt, and better than the enterprise disk onall three power-normalized metric. We also found that theMomentus significantly outperformed the Cheetahs on ca-pacity and performance per dollar; however, the price ad-vantage might only reflect market forces, whereas power ismore a property of the underlying technology.

For our workloads, the SSD was the lowest-power single-tiered solution for 11 out of 49 workloads, and chooses theMomentus for the remaining 38. Again, random-read IOPSand capacity were the limiting factors for all workloads.Figure 8 shows the workloads as points on these two axes:the graph is divided according to the device providing thelowest-power solution.

The analysis so far has been independent of energy prices.In general however, the cost of power consumption mustbe balanced against that of provisioning the hardware, bycomputing the overall cost over the device lifetime or up-grade period (typically 3–5 years). Figure 9 shows the “5-year break-even energy price” (in $/kWh). This is the energyprice at which the power savings over 5 years of an SSD-based solution will equal the additional purchase cost. Weshow the break-even price for the SSD against the Cheetahfor all 49 volumes, and for the SSD against the Momentus

20

25

30

35

40

45

50N

um

be

r o

f w

ork

loa

ds

US energy price (2008)

Break-even vs. Cheetah

Break-even vs. Momentus

0

5

10

15

20

0.01 0.1 1 10

Nu

mb

er

of

wo

rklo

ad

s

Energy price ($/kWh)

Figure 9. CDF of 5-year break-even energy price point forSSDs (log scale)

for the 11 volumes where the SSD was more power-efficientthan the Momentus. For reference we show the commercialUS energy price as of March 2008 [US Department of En-ergy 2008].

Note that that the break-even points are 1–3 orders ofmagnitude above the current energy prices: even if we allowa 100% overhead in energy costs for cooling and power sup-ply equipment, we need energy prices to increase by factorof 5 for the power savings of SSDs to justify the initial costfor even the smallest volumes. Thus, perhaps surprisingly,power consumption alone is not a compelling argument forSSDs. However, a combination of falling SSD per-gigabyteprices rising energy prices could motivate the replacementof disks by SSDs in a few of the smaller volumes.

4.5 Reliability and wearIn this paper so far we have considered performance, capac-ity, dollar cost and power consumption as the metrics of in-terest for provisioning storage. While reliability is also animportant metric, we believe that the pervasive use of RAIDfor fault-tolerance makes it a less important factor. More-over, the factors that determine disk reliability are still notconclusively known, and are largely estimated from empir-ical studies [Jiang 2008, Schroeder 2007]. Such empiricalevidence is lacking for SSDs. Flash-based SSD vendors doprovide a wear-out metric for their devices, as the numberof times any portion of the flash memory can be erased andrewritten before the reliability begins to degrade. Here weshow the results of an analysis based on the wear metric: ourpurpose is to judge whether SSDs deployed in any particularconfiguration are likely to fail before their nominal lifetime,i.e., before the next hardware upgrade.

Based on the long-term write rate of each workload, wecomputed the time after which the wear-out limit would bereached for each volume. Figure 10 shows the CDF of thiswear-out time in years on a log scale. Note that this assumesthat in the long term, the wear is evenly distributed overthe flash memory. This relies on using wear-leveling tech-niques [Birrell 2007, Gal 2005] that avoid in-place updates

20

25

30

35

40

45

50

Nu

mb

er

of

wo

rklo

ad

s

1 GB write-ahead log

Entire volume

0

5

10

15

20

0.1 1 10 100

Nu

mb

er

of

wo

rklo

ad

s

Wear-out time (years)

Figure 10. CDF of wear-out times (log scale)

and hence require some additional background writes for de-fragmentation. However, even if we conservatively assume ahigh overhead of 50% (one background write for every twoforeground writes), the majority of volumes have wear-outtimes exceeding 100 years. All volumes with the exceptionof one small 10 GB volume have wear-out times of 5 yearsor more. Hence, we do not expect that wear will be a majorcontributor to the total cost of SSD-based storage.

Wear can be a concern, however, for flash used as a write-ahead log. Here a relatively small flash device absorbs allthe writes sent to a volume. The dotted line in Figure 10shows the CDF of estimated wear-out time for a 1 GB flashused as a write-ahead log for each of our workloads. Here weassume no overhead, since a circular log used for short-termpersistence does not need to be defragmented. However, dueto the relatively small size of the log, each block in thelog gets overwritten frequently, and 28 out of 49 workloadshave a wear-out time of less than 5 years. Thus, while flash-based SSDs can easily provide the performance and capacityrequired for a write-ahead log, wear is a significant concernand will be need to be addressed. If a large flash memoryis being used as a combined read cache and write log, onepotential solution is to periodically rotate the location of the(small) write log on the flash.

5. ConclusionFlash-based solid-state drives (SSDs) are a new storage tech-nology for the laptop and desktop markets. Recently “enter-prise SSDs” have been targeted at the server storage market,raising the question of whether, and how, SSDs should beused in servers. This paper answers the above question bydoing a cost-benefit analysis for a range of workloads. Anadditional contribution of the paper is a modeling and opti-mization approach which could be extended to cover othersolid-state technologies.

We show that, across a range of different server work-loads, replacing disks by SSDs is not a cost-effective optionat today’s prices. Depending on the workload, the capac-ity/dollar of SSDs needs to improve by a factor of 3–3000for SSDs to be able to replace disks. The benefits of SSDs as

an intermediate caching tier are also limited, and the cost ofprovisioning such a tier was justified for fewer than 10% ofthe examined workloads.

AcknowledgmentsWe thank Bruce Worthington, Swaroop Kavalanekar, ChrisMitchell and Kushagra Vaid for the Exchange and MSN stor-age traces used in this paper. We also thank the anonymousreviewers and our shepherd John Wilkes for their feedback.

References[Agrawal 2008] Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber,

John D. Davis, Mark Manasse, and Rina Panigrahy. Designtradeoffs for SSD performance. In USENIX Annual TechnicalConference, pages 57–70, Boston, MA, June 2008.

[Anderson 2003] Dave Anderson, Jim Dykes, and Erik Riedel.More than an interface - SCSI vs. ATA. In Proc. USENIXConference on File and Storage Technologies (FAST), pages245–257, San Francisco, CA, March 2003.

[Anderson 2005] Eric Anderson, Susan Spence, Ram Swami-nathan, Mahesh Kallahalla, and Qian Wang. Quickly findingnear-optimal storage designs. ACM Trans. Comput. Syst., 23(4):337–374, 2005.

[Baker 1992] Mary Baker, Satoshi Asami, Etienne Deprit, JohnOusterhout, and Margo Seltzer. Non-volatile memory for fast,reliable file systems. In Proc. International Conference on Ar-chitectural Support for Programming Languages and OperatingSystems (ASPLOS), pages 10–22, Boston, MA, October 1992.

[Birrell 2007] Andrew Birrell, Michael Isard, Chuck Thacker, andTed Wobber. A design for high-performance flash disks. Oper-ating Systems Review, 41(2):88–93, 2007.

[Bucy 2008] John S. Bucy, Jiri Schindler, Steven W. Schlosser,and Gregory R. Ganger. The DiskSim simulation environmentversion 4.0 reference manual. Technical Report CMU-PDL-08-101, Carnegie Mellon University, May 2008.

[Gal 2005] Eran Gal and Sivan Toledo. Algorithms and datastructures for flash memories. ACM Computing Surveys, 37(2):138–163, 2005.

[Golding 1995] Richard Golding, Peter Bosch, Carl Staelin, TimSullivan, and John Wilkes. Idleness is not sloth. In Proc.USENIX Annual Technical Conference, pages 201–212, NewOrleans, LA, January 1995.

[Hetzler 2008] Steven R. Hetzler. The storage chasm: Implicationsfor the future of HDD and solid state storage. http://www.idema.org/, December 2008.

[Intel News Release 2008] Intel News Release. Intel, STMi-croelectronics deliver industry’s first phase change memoryprototypes. http://www.intel.com/pressroom/archive/releases/20080206corp.htm, February2008.

[Jiang 2008] Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, andArkady Kanevsky. Are disks the dominant contributor for stor-age failures? A comprehensive study of storage subsystem fail-ure characteristics. In USENIX Conference on File and Storage

Technologies (FAST), pages 111–125, San Jose, CA, February2008.

[Kgil 2006] Taeho Kgil and Trevor N. Mudge. Flashcache: aNAND flash memory file cache for low power web servers.In Proc. International Conference on Compilers, Architecture,and Synthesis for Embedded Systems (CASES), pages 103–112,Seoul, Korea, October 2006.

[Koltsidas 2008] Ioannis Koltsidas and Stratis Viglas. Flashingup the storage layer. In Proc. International Conference onVery Large Data Bases (VLDB), pages 514–525, Auckland, NewZealand, August 2008.

[Lee 2008] Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, and Sang-Woo Kim. A case for flash memory SSDin enterprise database applications. In Proc. ACM SIGMODInternational Conference on Management of Data (SIGMOD),pages 1075–1086, Vancouver, BC, June 2008.

[Miller 2001] Ethan Miller, Scott Brandt, and Darrell Long. HeR-MES: High-performance reliable MRAM-enabled storage. InProc. IEEE Workshop on Hot Topics in Operating Systems (Ho-tOS), pages 95–99, Elmau/Oberbayern, Germany, May 2001.

[Narayanan 2008a] Dushyanth Narayanan, Austin Donnelly, andAntony Rowstron. Write off-loading: Practical power manage-ment for enterprise storage. In Proc. USENIX Conference onFile and Storage Technologies (FAST), pages 256–267, San Jose,CA, February 2008.

[Narayanan 2008b] Dushyanth Narayanan, Austin Donnelly, EnoThereska, Sameh Elnikety, and Antony Rowstron. Everest:Scaling down peak loads through I/O off-loading. In Proc.Symposium on Operating Systems Design and Implementation(OSDI), pages 15–28, San Diego, CA, December 2008.

[Nath 2007] Suman Nath and Aman Kansal. FlashDB: Dynamicself tuning database for NAND flash. In Proc. Intnl. Conf. onInformation Processing in Sensor Networks (IPSN), pages 410–419, Cambridge, MA, April 2007.

[Panabaker 2006] Ruston Panabaker. Hybrid hard disk and Ready-Drive technology: Improving performance and power for Win-dows Vista mobile PCs. http://www.microsoft.com/whdc/winhec/pres06.mspx, May 2006.

[Popovici 2003] Florentina I. Popovici, Andrea C. Arpaci-Dusseau,and Remzi H. Arpaci-Dusseau. Robust, portable I/O schedulingwith the Disk Mimic. In Proc. USENIX Annual Technical Con-ference, pages 297–310, San Antonio, TX, June 2003.

[Prabhakaran 2008] Vijayan Prabhakaran, Thomas L. Rodeheffer,and Lidong Zhou. Transactional flash. In Proc. Symposium onOperating Systems Design and Implementation (OSDI), pages147–160, San Diego, CA, December 2008.

[Rosenblum 1991] Mendel Rosenblum and John Ousterhout. Thedesign and implementation of a log-structured file system.In Proc. ACM Symposium on Operating Systems Principles(SOSP), pages 1–15, Pacific Grove, CA, October 1991.

[Samsung 2006] Samsung. MH80 SATA product data sheet, June2006.

[Schroeder 2007] Bianca Schroeder and Garth A. Gibson. Diskfailures in the real world: What does an MTTF of 1,000,000hours mean to you? In Proc. USENIX Conference on File

and Storage Technologies (FAST), pages 1–16, San Jose, CA,February 2007.

[SNIA 2009] SNIA. IOTTA repository. http://iotta.snia.org/, January 2009.

[Strunk 2008] John Strunk, Eno Thereska, Christos Faloutsos, andGregory Ganger. Using utility to provision storage systems.In Proc. USENIX Conference on File and Storage Technologies(FAST), pages 313–328, San Jose, CA, February 2008.

[US Department of Energy 2008] US Department of En-ergy. Average retail price of electricity to ultimate cus-tomers by end-use sector, by state, April 2008 and 2007.http://www.eia.doe.gov/cneaf/electricity/epm/table5_6_a.html, August 2008.

[Uysal 2003] Mustafa Uysal, Arif Merchant, and Guillermo Al-varez. Using MEMS-based storage in disk arrays. In Proc.USENIX Conference on File and Storage Technologies (FAST),pages 89–102, San Francisco, CA, March 2003.

[Woodhouse 2001] David Woodhouse. JFFS: The journalling flashfile system. http://sources.redhat.com/jffs2/jffs2.pdf, July 2001.

[Worthington 2008] Bruce Worthington, Swaroop Kavalanekar,Qi Zhang, and Vishal Sharda. Characterization of storage work-load traces from production Windows servers. In Proc. IEEE In-ternational Symposium on Workload Characterization (IISWC),pages 119–128, Austin, TX, October 2008.

[Wu 2004] Chin-Hsien Wu, Li-Pin Chang, and Tei-Wei Kuo. Anefficient B-tree layer for flash-memory storage systems. In Proc.Real-Time and Embedded Computing Systems and Applications(RTCSA), pages 409–430, Gothenburg, Sweden, August 2004.

[Wu 1994] Michael Wu and Willy Zwaenepoel. eNVy: A non-volatile main memory storage system. In Proc. InternatinoalConference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS), pages 86–97, SanJose, CA, October 1994.

[Zeinalipour-Yazti 2005] Demetrios Zeinalipour-Yazti, Song Lin,Vana Kalogeraki, Dimitrios Gunopulos, and Walid A. Najjar.Microhash: An efficient index structure for flash-based sensordevices. In Proc. USENIX Conference on File and Storage Tech-nologies (FAST), pages 31–44, San Francisco, CA, December2005.

migrating server storage to ssds: analysis of tradeoffs · pdf filemigrating server storage to...

Documents