repram: re-cycling pram faulty blocks for extended lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf ·...

12
RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetime Jie Chen, Guru Venkataramani, H. Howie Huang Department of Electrical and Computer Engineering, The George Washington University, Washington DC, USA {jiec,guruv,howie}@gwu.edu Abstract—As main memory systems begin to face the scaling challenges from DRAM technology, future computer systems need to adapt to the emerging memory technologies like Phase-Change Memory (PCM or PRAM). While these newer technologies offer advantages such as storage density, non- volatility, and low energy consumption, they are constrained by limited write endurance that becomes more pronounced with process variation. In this paper, we propose a novel PRAM- based main memory system, RePRAM (Recycling PRAM), which leverages a group of faulty pages and recycles them in a managed way to significantly extend the PRAM lifetime while minimizing the performance impact. In particular, we explore two different dimensions of dynamic redundancy levels and group sizes, and design low-cost hardware and software support for RePRAM. Our proposed scheme involves minimal hardware modifications (that have less than 1% on-chip and off-chip area overheads). Also, our schemes can improve the PRAM lifetime by up to 43× (times) over a chip with no error correction capabilities, and outperform prior schemes such as DRM and ECP at a small fraction of the hardware cost. The performance overhead resulting from our scheme is less than 7% on average across 21 applications from SPEC2006, Splash- 2, and PARSEC benchmark suites. Keywords-Phase Change Memory, Lifetime, Redundancy, Main memory, Performance I. I NTRODUCTION Modern processor trends toward multi-core systems exert increasing pressure on DRAM systems to retain the working sets of all threads executing on the individual cores. This has forced greater demands for main memory capacity and density in order for the computer systems to keep up with performance scalability while operating under limited power budgets. As a result, it becomes necessary to explore alternative emerging memory technologies such as Flash and resistive-memory types such as Phase Change Memory (PCM or PRAM) to reduce the overall system cost and increase the storage density. PCM, in particular, has been shown to exhibit enormous potential as a viable alternative to DRAM systems because it can offer up to 4× more density at only small orders of magnitude (up to 4×) slowdown in performance [22]. PCM is made of chalcogenide glass, which has crystalline and amorphous states corresponding to low (binary 1) and high (binary 0) resistances to electric currents. There have been recent proposals that look at using PCM-based hybrid memory for future generation main memory [21]. A major challenge when using PCM as a DRAM replace- ment arises from its limited write endurance. PCM-based devices are expected to sustain an average of 10 8 writes per cell, when the cell’s programming element breaks and the write operations can no longer change the values. Most of the existing solutions focus on wear-leveling [20] and reducing the number of writes to PCM [14], [34]. Some recent studies have looked at resuscitating the faulty pages that were normally discarded as unusable by the memory controller [5], [9], [24], [26], [32]. In this paper, we adopt a similar goal of extending lifetime of PCM beyond the initial bit failures. Our objective behind rejuvenating faulty PCM blocks is to put faulty blocks (that would otherwise be discarded by the memory controller) back into use, i.e., not having to retire them prematurely. As PCM-based memory begins to find widespread acceptance in the market, memory manufacturers will need to take system engineering costs into account. Frequent replacement of failed memory modules can be cost prohibitive, especially for large scale data centers. In this paper, we propose that instead of completely discarding a PCM page as soon as it becomes faulty, a group of faulty pages could be recycled and utilized in a managed way to significantly extend the PCM lifetime while minimizing the performance impact. To this end, we design RePRAM (Recycling PRAM), that explores advanced dynamic redundancy techniques, while minimizing the com- plexity of finding compatible faulty pages to store redundant information. We explore varying levels of redundancy as bits begin to fail. First, we add a redundant (parity) bit for a given number of data bits. We call it as Parity-based Dynamic Redundancy (PDR). As the number of faults per page increase, we add more redundancy where we have one redundant (mirror) bit for every data bit. We call this as Mirroring-based Dynamic Redundancy (MDR). We explore the redundancy techniques along two dimen- sions, redundancy levels and group sizes (defined as the number of faulty PCM pages that form a group together for parity). In the first dimension, we investigate using different redundancy levels to get extra lifetime in PRAM. We begin with PDR, and as the number of faults per page begin to rise, we switch to MDR to minimize the complexity associated with finding compatible faulty pages. For the second dimension, we explore varying group sizes

Upload: others

Post on 17-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetime

Jie Chen, Guru Venkataramani, H. Howie HuangDepartment of Electrical and Computer Engineering,

The George Washington University, Washington DC, USA{jiec,guruv,howie}@gwu.edu

Abstract—As main memory systems begin to face the scalingchallenges from DRAM technology, future computer systemsneed to adapt to the emerging memory technologies likePhase-Change Memory (PCM or PRAM). While these newertechnologies offer advantages such as storage density, non-volatility, and low energy consumption, they are constrained bylimited write endurance that becomes more pronounced withprocess variation. In this paper, we propose a novel PRAM-based main memory system, RePRAM (Recycling PRAM),which leverages a group of faulty pages and recycles themin a managed way to significantly extend the PRAM lifetimewhile minimizing the performance impact. In particular, weexplore two different dimensions of dynamic redundancy levelsand group sizes, and design low-cost hardware and softwaresupport for RePRAM. Our proposed scheme involves minimalhardware modifications (that have less than 1% on-chip andoff-chip area overheads). Also, our schemes can improve thePRAM lifetime by up to 43× (times) over a chip with no errorcorrection capabilities, and outperform prior schemes such asDRM and ECP at a small fraction of the hardware cost. Theperformance overhead resulting from our scheme is less than7% on average across 21 applications from SPEC2006, Splash-2, and PARSEC benchmark suites.

Keywords-Phase Change Memory, Lifetime, Redundancy,Main memory, Performance

I. INTRODUCTION

Modern processor trends toward multi-core systems exertincreasing pressure on DRAM systems to retain the workingsets of all threads executing on the individual cores. Thishas forced greater demands for main memory capacityand density in order for the computer systems to keep upwith performance scalability while operating under limitedpower budgets. As a result, it becomes necessary to explorealternative emerging memory technologies such as Flashand resistive-memory types such as Phase Change Memory(PCM or PRAM) to reduce the overall system cost andincrease the storage density.

PCM, in particular, has been shown to exhibit enormouspotential as a viable alternative to DRAM systems becauseit can offer up to 4× more density at only small ordersof magnitude (up to 4×) slowdown in performance [22].PCM is made of chalcogenide glass, which has crystallineand amorphous states corresponding to low (binary 1) andhigh (binary 0) resistances to electric currents. There havebeen recent proposals that look at using PCM-based hybridmemory for future generation main memory [21].

A major challenge when using PCM as a DRAM replace-ment arises from its limited write endurance. PCM-baseddevices are expected to sustain an average of 108 writesper cell, when the cell’s programming element breaks andthe write operations can no longer change the values. Mostof the existing solutions focus on wear-leveling [20] andreducing the number of writes to PCM [14], [34]. Somerecent studies have looked at resuscitating the faulty pagesthat were normally discarded as unusable by the memorycontroller [5], [9], [24], [26], [32]. In this paper, we adopt asimilar goal of extending lifetime of PCM beyond the initialbit failures.

Our objective behind rejuvenating faulty PCM blocks is toput faulty blocks (that would otherwise be discarded by thememory controller) back into use, i.e., not having to retirethem prematurely. As PCM-based memory begins to findwidespread acceptance in the market, memory manufacturerswill need to take system engineering costs into account.Frequent replacement of failed memory modules can be costprohibitive, especially for large scale data centers.

In this paper, we propose that instead of completelydiscarding a PCM page as soon as it becomes faulty, agroup of faulty pages could be recycled and utilized ina managed way to significantly extend the PCM lifetimewhile minimizing the performance impact. To this end, wedesign RePRAM (Recycling PRAM), that explores advanceddynamic redundancy techniques, while minimizing the com-plexity of finding compatible faulty pages to store redundantinformation. We explore varying levels of redundancy asbits begin to fail. First, we add a redundant (parity) bitfor a given number of data bits. We call it as Parity-basedDynamic Redundancy (PDR). As the number of faults perpage increase, we add more redundancy where we have oneredundant (mirror) bit for every data bit. We call this asMirroring-based Dynamic Redundancy (MDR).

We explore the redundancy techniques along two dimen-sions, redundancy levels and group sizes (defined as thenumber of faulty PCM pages that form a group togetherfor parity). In the first dimension, we investigate usingdifferent redundancy levels to get extra lifetime in PRAM.We begin with PDR, and as the number of faults perpage begin to rise, we switch to MDR to minimize thecomplexity associated with finding compatible faulty pages.For the second dimension, we explore varying group sizes

Page 2: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

within PDR. Specifically, we investigate the design trade-offs by adopting two and three pages for a PDR group,and the resulting effects on storage efficiency, performance,and hardware complexity in the system. While exploringredundancy, we follow two rules: (1) Any two PCM pagesare deemed compatible with each other only when thecorresponding pages do not have faulty bytes in the samebyte position. (2) The PCM pages are discarded once theyhave at least 160 faults, because finding compatible pairs ofpages becomes exponentially harder beyond this limit [9].

Our main motivation behind exploring dynamicredundancy-based techniques to improving the lifetime ofPCM is driven by three factors:

• Increase the space efficiency in the usage of faultypages: In MDR configuration, we mirror identical dataon two compatible faulty PCM pages that effectivelyreplicates data across both pages. As a result, both thePCM pages together store a single page worth of data,i.e., the storage density is 50%. In PDR configuration, agroup G of n faulty pages have a dedicated block P, thatstores the parity values for all of the n pages. Therefore,the storage density for n+1 (including the parity) pagesis n

n+1 . For example, with group size of 3, the storageefficiency is 75% (25% more efficient than MDR), andfor group size of 2, the storage efficiency is 67% (17%higher than MDR). At higher values of n, we get betterefficiency in terms of storage density, although findingcompatible pages for larger groups become increasinglydifficult.

• Utilize off-the-shelf memory components without ex-tensive hardware redesign: Prior techniques such asECP [24] have to custom design the PCM chip tointegrate their lifetime-enhancing techniques. Such de-signs increase the hardware cost, and more importantly,reduce the flexibility of switching to other lifetime-enhancing techniques in the future. To counter suchdrawbacks, our goal is to maximize the use of off-the-shelf components and include techniques that wouldenhance lifetime with minimum changes to hardwaredesign.

• Explore design choices that will offer flexibility to theuser: The end-users can make an informed choice thatis most suitable to their needs under a given cost budget.

To summarize, the main contributions of RePRAM are:

1) We explore dynamic redundancy techniques to resus-citate faulty PCM pages and investigate the merits ofdifferent design choices toward improving the lifetimeof PCM-based Main Memory systems.

2) We propose low-cost hardware that can be combinedwith off-the-shelf PCM memory and show that onlyminimal hardware modifications are needed to imple-ment RePRAM schemes. We also study how relatively

small DRAM buffers can be used to store parity pagesand reduce the performance impact.

3) We evaluate our design and show that we can improvethe PCM lifetime by up to 43× over raw PCM withoutany Error Correction Capabilities at less than 1% areaoverhead both on-chip and off-chip. Also, we incur lessthan 7% performance overhead (average case), and lessthan 16% performance overhead (stress case) acrossSPEC2006 [27], PARSEC-1.0 [1] and Splash-2 [30]applications.

II. BACKGROUND AND RELATED WORK

In this section, we present a brief overview of PCM mainmemory or PRAM. We discuss prior research that havestudied extending PCM lifetime of PCM, and give a quickreview of redundancy based techniques to tolerate faults.

A. PCM based Main Memory and Wear-out Problem

DRAM, which has served as computer system mainmemory for decades, is confronting scalability problemsas technology limitations prevent scaling cell feature sizesbeyond 32nm [18]. DRAM is also confronting power-relatedissues due to high leakage caused by shrinking transistorsizes. All of these have led to building main memory withalternative emerging memory technologies such as Phase-change Random Access Memory (PRAM).

An important consideration is PCM’s high operating tem-peratures for set and reset operations that directly affectthe lifetime of PRAM devices. In particular, repeated resetoperations at very high temperatures cause to break theprogramming circuit of phase change material, and perma-nently reset the PCM cell into a state of high resistance.This introduces limited endurance to PCM that significantlyrestrains the use of PCM as a good replacement for DRAM-based memory. Additional complications arise in PCM dueto process variation effects that could further decrease writeendurance in these devices.

B. Extending PCM Lifetime

To mitigate the effects of PCM’s limited endurance, priorworks have looked at better wear-leveling algorithms [20],reducing write traffic to PCM memory through partialwrites [14], writing select bits [6], [33], randomizing dataplacement [25], exploring intelligent writes [4], and usingDRAM buffers [21]. All these techniques focus on applyingtheir optimizations prior to the first bit failure. We note thatthese techniques are complementary to our RePRAM, andcan contribute to even higher PCM lifetime when used inconjunction with our techniques.

To the best of our knowledge, the first scheme to look atreusing faulty pages was Dynamically Replicated Memory(DRM) [9]. DRM forms pairs of faulty PCM pages thatdo not have faults in the same byte position so that pairedpages can serve as replicas of one another. Redundant data

Page 3: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

is stored in both pages to make sure that the system canread a non-faulty version of the byte from at least oneof the pages. The idea behind pairing is that there is ahigh probability of finding two compatible faulty pages andhence, one could eventually reclaim what would otherwisebe a decommissioned memory space. While this is useful,simply replicating the data in both pages can rapidly degradethe effective capacity of the memory system. We note that,by using replication scheme, DRM merely explores MDRor mirroring. First of all, by using MDR for PCM pagesthat have too few errors initially, DRM can waste a lotof non-faulty bytes in the paired pages and unnecessarilyreplicate the entire block. Secondly, writes need to updateboth the mirror copies, which means that both pages willendure increased wear, and subsequently result in expeditedaging of the pages contributing to their failures. In RePRAM,we overcome the above disadvantages by exploring moredynamic redundancy techniques like PDR, that reduce un-necessary additional writes to PCM blocks. Further, wecombine several PDR configurations to explore design pointsthat will be most suitable to user’s needs.

Error Correcting Pointers (ECP) [24] is another tech-nique that handles the errors by encoding the locationsof failed cells in a table and by assigning new cells toreplace them. The main disadvantage with this approachis the complexity of changing PCM chip to accommodatethe dedicated pointers, as well as, the costs involved inhardware redesign. ECP incurs a static storage overheadof 12% to store these pointers. SAFER [26] dynamicallypartitions the blocks into multiple groups, exploiting thefact that failed cells can still be read and reuses the cellto store data. LLS [10] is a line-level mapping techniquethat dynamically partitions the overall PCM space into amain space and a backup space. By mapping failed lines tothe backup, LLS manages to maintain a contiguous memoryspace that provides easy integration with wear leveling.When accessing a failed line, the request will be redirectedto access the mapped backup line through a special addresstranslation logic, which requires PCM chip redesign effortsand also incurs extra latency and energy. LLS also relieson intra-line salvaging, such as ECP, to correct initial cellfailures. RDIS [15] incorporates a recursive error correctionscheme in hardware to track memory cell locations that havefaults during write. In RePRAM, we use off-the-shelf PCMstorage and perform minimal changes to existing hardware,which lowers the cost of redesign and helps us to minimizethe performance overheads.

FREE-p [32] performs fine-grained remapping of worn-out PCM blocks without requiring dedicated storage forstoring these mappings. Pointers to remapped blocks arestored within the memory block and the memory controlleradds extra requests for memory to read these blocks, whichresults in additional bandwidth and latency overheads. Asthe number of faults per block increase, this approach incurs

higher performance overheads due to sequential memoryreads and increased memory bandwidth demands. Whereas,RePRAM incurs significantly lower performance overheadsas the memory requests to group blocks can be overlappedand performed simultaneously (as shown in Section IV).

C. Redundancy-based Techniques

Storing redundant data to achieve high availability wasexplored by Patterson, Gibson, and Katz for RedundantArrays of Inexpensive Disks (RAID) [17] in 1987, wherethe original five RAID levels were presented as a low-cost storage system. Since then, RAID has become multi-billion dollar industry [12], and inspired many similar con-cepts such as Redundant Arrays of Independent Memory(RAIM) [8], Redundant Arrays of Independent Filesystems(RAIF) [11], and so on. While our work leverages a numberof redundancy techniques, we actually aim to reuse faultyPCM pages, whereas RAID was designed to recover thedata on a failed hard disk, which statistically has a higheravailability compared to average 108 writes per PCM cell.In the context of PCM-based memory, we face a set ofnew problems, such as mapping management, asymmetricread/write performance, and limited write cycles. On thevery high level, our idea has similar spirit as HP AutoRAIDhierarchical storage system [29], where RAID 1 and 5 aredeployed at two different levels, with the former used forhigh performance data access, and the latter for high storageefficiency. In AutoRAID, the decision is made upon activemonitoring of workload access patterns and data blocks aremigrated automatically in the background. In contrast, ourRePRAM employs a unique redundancy level at differentstages of the PCM lifetime, which is selected to maximizeits usability and performance.

III. DESIGN OVERVIEW

In this section, we first describe the overview of our hard-ware design and later show how our proposed modificationscan be incorporated into the multi-core processor hardware.We also show the software support needed for RePRAM.

A. Incorporating Dynamic Redundancy Techniques intoPCM

In this paper, we assume that the PCM-based mainmemory starts without any faults and has wear-leveling algo-rithms in place to uniformly distribute the writes throughoutthe pages. For each write to the page, a read-after-write isperformed to determine if the write succeeded. For caseswhere Error Correcting Code (ECC) exists onchip, thefirst few bit faults can be tolerated using these pre-builtschemes. In particular, SECDED (single-error correction,double-error detection) ECC can correct one bit error, whileHi-ECC [28] can correct up to four bit errors, both ofwhich can be utilized to tolerate errors before beginning touse our RePRAM schemes. Note that this does not require

Page 4: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

RePRAM intervention, and thus for the first few faults,avoids performance impact on the applications. After thepoint when the error correction capabilities can not tolerateany more faults, the PCM-based main memory that didnot use any additional lifetime extension strategies, wouldbe forced to discard the faulty pages. Subsequently, thiswould lead to a quickly diminishing capacity of the memory.To maximize PCM lifetime with minimal overheads andlow cost, we propose RePRAM that recycles and leveragesadvanced dynamic redundancy techniques in the contextof the PCM-based main memory. We devise a number ofhardware (and supporting software) techniques to realize thefull benefit of this new memory architecture.

In RePRAM, when the first fault in the PCM page k oc-curs (beyond the tolerance limit of already incorporated errorcorrection schemes), we temporarily decommission this pageand place it in a separate pool of PCM pages POOLPDR,where the faulty pages are waiting to be matched with othercompatible faulty pages. At this point, we disable the ECCcomputation, an operation supported by most ECC-basedmemory controllers [19]. We reuse the parity bits to store thefaulty byte vector within the already built-in parity support,i.e., for each byte we have a bit to indicate whether it isfaulty or not. We start with the redundancy level of PDR,in which a dedicated parity block is associated with a groupof data pages. Basically, we group multiple faulty PCMpages that are compatible (i.e., do not have faults in thesame byte position) and use a separate high-performanceDRAM buffer to store the corresponding parity to minimizethe performance impact. The number of data pages per group(n) is determined based on the matching complexity. Notethat for higher values of n, it becomes exceedingly difficultto find compatible faulty pages.

When we decommission the PCM page k, we copy itscontents into one of the reserve PCM pages, which serveto store the data in the faulty pages until the matching isdone. By default, we assume that there are 10,000 suchreserve PCM pages. To start, we randomly pick a group Gof compatible faulty pages from POOLPDR. Alternatively,a low-cost approximate pairing algorithm similar to one inDRM [9] can be used to assemble the group G of compatiblefaulty pages from POOLPDR. The mapping information isstored as tuples MAPPDR = {k1, k2, ..., kn, P}, where n isthe group size and P is the parity page, which are managedby the Operating System (OS) (described in III-C). Forexample, in a PDR group size of three, a tuple of the form,{k1, k2, k3, P} is stored, where k1, k2, k3 are the compatiblefaulty pages and P is the parity page. In the remainder ofthis paper, we use a default group size of three for PDRconfiguration as it is relatively less complex than higherorder matching, while offering much better data to paritystorage ratio than PDR with the group size of two.

We note that matching process (finding compatibility)is much easier with pages that have fewer number of

!"#"$"%"&"

'!"'#"'$"'%"'&"#!"

()*+,-.'/"'!0"

()*+,-.'/"#!0"

()*+,-.'/"1!0"

()*+,-.'/"$!0"

()*+,-.'/"2!0"

()*+,-.'/"%!0"

()*+,-.'/"3!0"

()*+,-.'/"&!0"

()*+,-.'/"4!0"

()*+,-.'/"'!!0"

()*+,-.'/"''!0"

()*+,-.'/"'#!0"

()*+,-.'/"'1!0"

()*+,-.'/"'$!0"

()*+,-.'/"'2!0"

()*+,-.'/"'%!0"

#56)7"8),9:;<="

156)7"8),9:;<="

Figure 1. Average number of random trials needed for the two-wayand three-way matching between faulty pages. Each pair of bars show thenumber of tries needed for matching within the fault bounds indicated inthe x-axis. Faults are assumed to be randomly distributed within the 4 KBPCM pages.

faults. This is especially true when using PDR, as thecost associated with finding the three-way compatible pagesbeyond a certain number of bit faults increases sharply. Toempirically evaluate the mapping process, we simulate thetrials needed for 2-way and 3-way matching under PDR.Figure 1 presents the average number of the trails in bothcases. These experiments were done by assuming a poolof one million 4KB pages and we averaged over 10,000samples. We also tried a pool of 10,000 4KB pages withover 1,000 samples and observed similar results. In the two-way matching, we randomly pick two faulty pages and checkif they are compatible. If not, we repeat with another newrandomly picked page until we find a compatible set ofpages. The average number of tries needed is recorded. Werepeat our experiments by bounding our faults in variousranges as shown in Figure 1. The same set of experimentswas done for the three-way matching for compatibility ofpages. Beyond 80 faults (we refer to this as three-way faultthreshold), we notice that the number of trials needed forthree-way matching is at least twice as the number of thetwo-way trials and steeply increases from there. Therefore,continuing to operate in PDR with group size of threebeyond the three-way fault threshold will be expensive formatching algorithm that forms groups of compatible faultypages.

In order to bound the complexity associated with match-ing, we explore two design dimensions after a faulty pageincurs the number of errors that is greater than the three-wayfault threshold:

1) Dim-1: Switch to MDR that employs the mirroring-based dynamic redundancy technique, where we storetwo identical copies of the data onto two different PCMpages.

2) Dim-2: Reduce the PDR group size to two and continueto operate under parity-based dynamic redundancymode to tolerate future faults.

We choose to explore the above two dimensions primarilyto provide more user-friendly options and allow the users to

Page 5: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

Proc

L1 cache

L2 cache

Faulty page Bloom Filter

Map CachePDR

Map CacheMDR

Not Faulty

Faulty

Proc

L1 cache

PCM Memory DRAM Buffer(parity)

PDRMDR

Disk

2 31 1

(a) Dim-1

Proc

L1 cache

L2 cache

Faulty page Bloom Filter

Map CachePDR

Not Faulty

Faulty

Proc

L1 cache

PCM Memory DRAM Buffer(parity)

PDR

Disk

1 2 or 3 1

(b) Dim-2

Figure 2. Hardware modifications and structures (shown in gray boxes) needed to incorporate RePRAM into the processor hardware. The number ofmemory requests issued to the memory structure is shown next to the arrows (e.g., PDR always issues 3 PCM requests in Dim-1 for a group size of 3;whereas in Dim-2, the number of requests (2 or 3) depends on the current group size).

decide what is for their best interests. Our experiments quan-tify the resulting lifetime and the associated performanceimpact. We note that the end users should be able to choosethe configuration that suits their needs.

Dim-1: We switch to MDR mode when one of the newlyoccurring fault in an already matched PCM page (in PDR)surpasses the three-way fault threshold and renders the grouppages incompatible. This new fault will be detected bythe read-after-write operations in PCM-based memory [31].When this happens, we reassign the faulty PCM page toa new pool POOLMDR, where pages are waiting to bematched with other compatible faulty pages in the mirroringfashion. We use the MDR technique to tolerate bit faultsfor the remaining portion of the pages’ lifetime and theircorresponding mappings are stored in MAPMDR, managedby the OS. Similarly, the mappings can be represented astuples of {k1, k2} where k1 and k2 are compatible faultypages that mirror the content of each other. It is worthy topoint out that even after MDR mapping is adopted for somePCM pages, other pages that are currently in MAPPDR

will continue to be in PDR mode until there is a need forremapping these faulty pages.

Dim-2: In this case, throughout the lifetime of the device,we continue to operate in the PDR mode where a groupof PCM pages have an associated parity page. The onlydistinction is when a newly occurring fault surpasses thethree-way fault threshold and renders the already existinggroup incompatible, where we downgrade the group size totwo. This results in the need for configuring the memory

controller to monitor for the group sizes correspondingto faulty PCM page and MAPPDR tables to handle therelevant group size information.

On a read to a main memory page k1, we first determinewhether the target page is faulty or not. If k1 is not faulty,the memory request proceeds as usual. If it is faulty, wedetermine whether k1 is in MAPPDR or in MAPMDR. Theaccesses to MAPPDR and MAPMDR can be performedin parallel. As the result, we obtain the correspondinginformation on this group of blocks k2, k3, and P for PDR,or alternatively, the information about the group block, k2for MDR. Extra memory requests are issued by the memorycontroller for the corresponding group blocks depending onthe PDR/MDR mapping. Upon reading the blocks, we usethe faulty byte vector (reused ECC parity bits) to determinethe locations of faulty bytes. Once the memory requests aresatisfied, the data block for k1 is reconstructed based on thePDR parity or the MDR mirror. We note that these extrarequests could potentially lead to a performance bottleneckby saturating the bus bandwidth and delaying the reads thatare on the critical path. In order to reduce the performanceoverheads, additional optimizations such as a temporaryread buffer that stores reconstructed memory blocks shallbe used. Such optimizations can offer faster read accessesto otherwise faulty memory blocks, however, care shouldbe taken to ensure that write accesses properly update theactual PCM and parity blocks, keeping them consistent atall times. To ensure data consistency, every write operationshould make sure to invalidate or update the appropriate read

Page 6: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

buffer entries. Since writes are off the performance-criticalpath, lazy invalidate or update schemes can be used for readbuffer entries.

On a write to a main memory page k1, we also determinewhether the target page is faulty or not. If faulty, we findthe corresponding group blocks from the mapping tables.In case of MDR, we issue two memory requests to k1 andk2. For PDR, we need to update only k1 and correspondingparity, P . Therefore, the number of memory requests neededfor a write under RePRAM is always two, instead of (n+1)memory requests needed for a read.

B. Hardware Implementation

The primary goal for RePRAM implementation is tolower the cost and complexity of hardware design. Also, theperformance impact resulting from our proposed hardwarechanges should not adversely affect the normal processorperformance. Figure 2 shows the hardware structures thatare needed to implement the RePRAM scheme in a cost-effective manner with low performance overheads.

The first task is to find out whether a page is pristine or hasfaults. A naive approach to detecting faulty pages is to makememory controller perform additional readback after a writeto determine whether the write succeeded [9]. However,performing repeated read-after-write requests during everywrite operation can be time consuming and hence, we usecompact structures such as Bloom Filters [2] to recordinformation about the faulty pages. Once we determine thefaulty/pristine status of the page, we can allow the pristine(fault-free) pages to directly access the memory as usual,while the faulty pages can be directed to lookup one of theMAP structures. One caveat with using structures such asBloom Filters is that they have a potential for false positives,i.e., a page may be reported to be faulty when it actuallydoes not have faults. Fortunately, our experiments show anegligible rate of false alarms, and usually such cases canbe addressed by checking if the page has an entry in oneof the MAP tables or by directly reading the faulty bytevector associated with the PCM page.

Once the pages are determined to be faulty, the nextstep is to check the associated mapping under MAPPDR

(and MAPMDR too for Dim-1). This is to reconstruct theoriginal data page for a read or update the necessary parityinformation on a write corresponding to the faulty PCMpage. The mapping information is managed by the OS andfrequent invocation of the OS for mapping lookups can resultin expensive overheads. Hence, to speedup this process, weuse limited-entry caches, CACHEPDR and CACHEMDR

to temporarily store MAPPDR and MAPMDR respectively.In our current implementation, we use two small 1024-entrybuffer caches to store the mappings. This organization issimilar to Translation Lookaside Buffers (TLB). Upon amiss in both the mapping caches, the OS service routineis invoked and a mapping lookup is performed. An entry is

created in the corresponding mapping cache depending onthe dynamic redundancy scheme under which the requestedpage is mapped. We may remove an entry from the mappingcaches for the following reasons: (1) when one of the PCMpages has suffered permanent failure (more than 160 biterrors) and needs decommissioning, or (2) when PDR toMDR remapping needs to be done due to increased matchingcomplexity associated with PDR. Note that, under Dim-2, we would continue to remain in PDR mode, but withdifferent group sizes. However, due to decreasing group sizefrom three to two, previously formed groups may need to bereorganized and would require us to flush the correspondingentries from CACHEPDR.

We use a separate DRAM buffer to store the parity forfaster access by the multi-core processor, which we believeis more cost-effective than simply investing on additionalPCM capacity for parity. This is mainly motivated by twofacts: (1) DRAM provides fast data access and does notsuffer from write endurance problem. Note that the parityinformation needs to be accurate and cannot have errors inorder to recover the data from faulty PCM block. DRAMbuffer offers a much better alternative to PCM in this regard.(2) The parity information is much smaller than data itself -for a group of n data pages, there is only one correspondingparity page. So, DRAM buffer can be a lot smaller thanthe actual PCM-based RAM. A caveat, however, is thatparity information could be lost during power failure asDRAM buffer only offers volatile storage. To counter thisproblem, prior techniques such as Brant et al. [3] haveproposed a low cost, battery-powered Flash RAM backupto store the contents of DRAM. As we will observe in ourevaluation (Section IV), fast accesses to parity offered byDRAM buffers can easily outweigh the demerits of usingDRAM buffer (that can be handled through other low-costmechanisms). In this work, we use off-the-shelf componentsin our hardware design without extensive modifications tothe hardware structures. We note that both the data and paritypages would share the lower level disk for backup storage.

Finally, we would need to discard pages that have morethan 160 bit errors and decommission these pages for per-manent failure. To do so, during rematching of pages (whenpages become incompatible upon errors in the correspondingsame byte position), we determine if the candidate pageshave more than 160 bit errors. We permanently mark suchpages for removal, and invoke the remapping algorithm toreconstitute the other group pages associated with the failedpage under MDR or PDR.

C. Software Support for RePRAM

In RePRAM, the OS is responsible for managing themapping of faulty PCM pages, as well as parity pages.The OS maintains two lists, the free list for DRAM paritybuffer, and the candidate list for faulty PCM pages. In thebeginning, the free list contains all the pages in the DRAM

Page 7: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

buffer. Upon forming a new group of compatible faulty PCMpages, one entry from DRAM buffer free list is allocated forstoring parity and this information is maintained in the OS.When the group is disbanded due to incompatibility amongthe pages, the OS will put the corresponding parity pageback into the free list. When the free list is empty, the OSwill stop the matching process.

The OS invokes the matching algorithm every time whenthe memory controller detects a new faulty page or ifa page becomes incompatible with its group pages. Thisnew faulty PCM page is inserted into the candidate listwhere compatible pages can be paired. The OS performsrandom selection from the candidate list to quickly form agroup of the compatible pages. After a page is successfullymatched, it will be removed from the candidate list. Notethat our experiments show that not every fault result inincompatibility between the group pages (see Section IV-D).Remapping happens far less frequently than the number offaults occurring in pages, which greatly reduces the OSoverhead for page remapping.

The OS manages the mapping between faulty PCM pagesusing a hash table that is stored in a reserved area of thememory address space. Assuming that we need to maintainthe mapping information for every page, the worst casestorage overhead for a 4GB PRAM with 4KB pages wouldonly be 2.5MB (20 bits per page for 1M pages). Also, thatthe OS interventions for accessing and updating the mappinginformation should be kept minimal in order to avoid highperformance overheads. As we show in Section IV, withthe help of mapping caches and dynamic redundancy levels,RePRAM is able to minimize performance impacts on awide range of the benchmarks.

IV. EVALUATION

We first perform the experiments to analyze the extendedlifetime that can be achieved through RePRAM, and nextstudy the performance impact arising from our proposedhardware changes. Finally, we present the sensitivity studiesthat show the performance of our system under variousconfigurations. To present fair comparison results with priorschemes, our configuration parameters are similar to, andderived from earlier works such as DRM [9] and ECP [24].

A. Lifetime

Figure 3 compares the lifetime (measured as the totalnumber of writes that can be performed to the PCM) alongwith the diminishing effective capacity left in the PCMdevice. We show the results corresponding to both Dim-1 and Dim-2 design choices, and contrast them with priorschemes such as Fail Stop (that does not have any ErrorCorrection capabilities and discards a PCM block as soon asthe first fault occurs), DRM and ECP schemes. We assumea baseline of 4GB PCM memory with 4KB page size, andwrite operations happen on a granularity of 64Byte blocks.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

Effe

ctiv

e C

apac

ity

Writes to PCM memory (trillions)

FAIL STOP

DRM

Dim-1

ECP

Dim-2

(a) variance=0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110

Writes to PCM memory (trillions)

FAIL STOP

DRM

Dim­1

ECP

Dim­2

(b) variance=0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55

Effe

ctiv

e C

apac

ity

Writes to PCM memory (trillions)

FAIL STOP

DRM

Dim-1

ECP

Dim-2

(c) variance=0.3

Figure 3. Effective capacity versus the total number of writes issued tothe PCM main memory.

In addition, we assume a 50% probability that any singlewrite operation would flip a particular bit. We model thePCM cell lifetime to follow a normal distribution with amean of 108 and three different process variation factors of0.1, 0.2, and 0.3 (shown in Figure 3).

Our experiments show that with a 4KB page size, there isa high probability of at least one byte having a lifetime at thetail-end of the normal distribution. This makes the Fail Stopscheme quickly decommission all of the pages within a shortwindow of writes before the PCM’s effective capacity drops

Page 8: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

to zero. This effect is especially more pronounced at higherlevels of process variation, as shown in our experimentalresults. Although DRM is able to tolerate up to 160 faults ineach page before decommissioning the page for permanentfailure, the main disadvantage with DRM is that replicatedwrites to two different pages speeds up the aging processof both pages. DRM offers an extended lifetime of approx-imately 0.13× to 22× over Fail Stop scheme depending onthe process variation factor. The third scheme, ECP, cantolerate up to 6 faults in each 64 Byte block and a 4KBPCM page is decommissioned as soon as the seventh faultoccurs in one of its constituent 64 Byte blocks. This lets ECPachieve a lifetime improvement of 0.26× (variance=0.1) to41× (variance=0.3) over Fail Stop.

Our RePRAM schemes achieve the lifetime results thatare comparable to, or better than, the ECP scheme at a smallfraction of the ECP’s hardware cost. In both dimensions, wefirst deploy PDR that stores parity separately from PCM datapages. Therefore, under PDR, PRAM capacity shrinks onlywhen PCM pages have reached 160 bit failures. This letsthe PRAM capacity to degrade gracefully in PDR. However,under MDR, PRAM pages are mirrored onto each other andthe PRAM capacity degrades more rapidly. In Dim-1, weinitially deploy PDR with the group size of 3 to toleratefaults, and after the PCM blocks begin to exhibit 80 faults(three-way fault threshold shown in Figure 1), we switchto MDR for the rest of PCM block’s lifetime. We can seethat Dim-1 design yields the lifetime results that are slightlyworse than ECP (1% when the variance is 0.1 to 9.5% whenthe variance is 0.3). This shows a good trade-off from thehardware cost perspective, as our design involves using mostoff-the-shelf components versus the extensive modificationsto main memory design needed by ECP. Under Dim-2, wealso deploy PDR with the group size of 3 in the initialstages, and after the PCM blocks begin to exhibit 80 faults,we switch to PDR with the group size of 2 for the restof PCM bock’s lifetime. We find that Dim-2 exhibits betterlifetime than ECP (0.5% for the variance=0.1 to 4.4% whenthe variance=0.3). Overall, the proposed RePRAM schemesachieve good lifetime improvements for PCM, ranging from0.26× (when the variance is 0.1 in Dim-1) to 43× (for thevariance of 0.3 in Dim-2) over Fail Stop.

One might argue about why low-cost schemes such asRePRAM should be advantageous over ECP, when bothschemes have comparable lifetime results. We note thatECP requires custom PRAM design, and offers a hardwiredsolution at about 12% memory overhead where there isless flexibility for users to upgrade to further lifetime-enhancement techniques. In contrast, RePRAM uses off-the-shelf modules and minimizes hardware changes for easieradoption and for future upgrades.

B. Performance Impact

We evaluate the performance impact resulting fromRePRAM using SESC [23], a cycle-accurate, execution-driven simulator. Our baseline system models an IntelNehalem-like four-core processor [7] running at 3GHz, 4-way, out-of-order core, each with a private 32KB, 8-way set-associative L1 and a shared 4MB, 16-way set-associative L2.The L1 caches are kept coherent using the MESI protocol.The block size is 64Bytes in all caches. We model 4GB PCMmain memory with 4KB pages with read access latency of50ns and write access latency of 1µs [16]. For storing parity,we include an additional 16MB DRAM that can accessedsimultaneously during the PRAM accesses. When DRAMbuffer is full, we model a 96000 cycle (32µs) latency towrite parity information to disk.

To model RePRAM effects, every L2 access first queriesa bloom filter, that has a three-cycle latency to deter-mine whether the page is faulty or not. We use 1024-entry CACHEMDR and CACHEPDR in our experiments,where we observe an average miss rate of 3% (and 8%worst case). Note that we do not assume fixed latencies formemory lookups corresponding to MDR or PDR. Our con-figuration setup has a common memory bus between coresto fully model memory read/write latencies and bandwidth.

For faulty pages, in Dim-1, we access CACHEMDR

and CACHEPDR simultaneously each of which have a 2cycle latency. In Dim-2, we just access CACHEPDR. Ona mapping cache miss, we model an additional latency of500 cycles to perform mapping lookup from lower levelsof memory hierarchy and store it in the mapping cache forfuture use. Upon receiving all the pages in a group, we havea 20 cycle penalty for parity reconstruction and recoveringfaulty bytes from mirror copies.

We show performance overheads on two sets of scenariosfor Dim-1 and Dim-2 configurations:

• average case: In this scenario, we assume that a ran-domly picked page has 50% probability that it is faulty.Therefore, a randomly picked 50% of the PCM pagesare considered to be faulty. Of these, 50% of faultypages (25% of the total pages) are mapped under PDR(group size = 3), and 50% (25% of total pages) areunder MDR in Dim-1 and under PDR (group size =2) in Dim-2. The remaining 50% of total pages areassumed to be non-faulty.

• stress case: In this scenario, we assume that a randomlypicked page is always faulty. Of these, a randomlypicked 50% of the PCM pages are mapped under PDR(group size = 3). The remaining 50% pages are underMDR in Dim-1, and under PDR (group size = 2) inDim-2.

We use 21 different memory-intensive and CPU-intensiveapplication mix from SPEC2006 [27], PARSEC-1.0 [1] andSplash-2 [30] benchmark suites with reference input sets.

Page 9: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

0% 1% 2% 3% 4% 5% 6% 7%

asta

r

bzi

p2

gcc

gob

mk

libq

uan

tum

mcf

om

net

pp

per

lben

ch

xala

ncb

mk

cact

usA

DM

Gem

sFD

TD

sop

lex

sph

inx3

SPEC

avg

bla

cksc

ho

les

flu

idan

imat

e

swap

tio

ns

stre

amcl

ust

er

bo

dyt

rack

er

PAR

SEC

avg

bar

nes

cho

lesk

y fft

rad

ix

wat

er-n

2

wat

er-s

p

fmm

oce

an

rad

iosi

ty

rayt

rack

lu

volr

end

SPLA

SH a

vg P

erf

orm

ance

ove

rhe

ad

Dim-1

Dim-2

SPEC2006 PARSEC SPLASH2

(a) Average case

0% 2% 4% 6% 8%

10% 12% 14% 16% 18%

asta

r

bzi

p2

gcc

gob

mk

libq

uan

tum

mcf

om

net

pp

per

lben

ch

xala

ncb

mk

cact

usA

DM

Gem

sFD

TD

sop

lex

sph

inx3

SPEC

avg

bla

cksc

ho

les

flu

idan

imat

e

swap

tio

ns

stre

amcl

ust

er

bo

dyt

rack

er

PAR

SEC

avg

bar

nes

cho

lesk

y fft

rad

ix

wat

er-n

2

wat

er-s

p

fmm

oce

an

rad

iosi

ty

rayt

rack

lu

volr

end

Spla

sh2

avg

Pe

rfo

rman

ce o

verh

ead

Dim-1

Dim-2

SPEC2006 PARSEC SPLASH2

(b) Stress case

Figure 4. Performance Overheads of RePRAM on Spec2006, PARSEC-1.0 and Splash-2 applications.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110

Writes to PCM memory (trillions)

60 faults

80 faults

100 faults

120 faults

140 faults

Figure 5. Lifetime comparison for various configurations of Dim-1 (atvariance=0.2). For each configuration, we use PDR until the specifiednumber of faults, and later we switch over to MDR for the rest of thePCM block’s lifetime.

For SPEC2006 and PARSEC applications, we fast forwardthe first five billion instructions and simulate the next onebillion in detail. For Splash2 applications, we simulate themfrom start to end. In addition, we run SPEC2006 benchmarks

as single-core applications individually one at a time. ForPARSEC and Splash2 benchmarks, we spawn four parallelthreads, one each on every core that share L2 and lowerlevel memory substructures.

Figure 4(a) shows the performance impact of ourRePRAM schemes in the average case scenario. We seethat, all 21 of our applications show the overheads lessthan 7%, and the highest performance overhead (6.2%)occurs in cactusADM, followed by 5.9% for gcc in Dim-2. Under Dim-1, we notice an average of 1.87%, 0.8%, and1.5% across SPEC2006, PARSEC and Splash2 respectively,whereas the corresponding averages under Dim-2 are 2.2%,0.8%, and 1.7%. We notice an increased memory accessrate in Dim-2 (due to continuous use of PDR configuration)along with a higher rate of misses in mapping cachescontribute to increased performance impact than Dim-1. Thiseffect is especially pronounced in benchmarks such as gcc,cactusADM, and mcf. Also, issuing multiple requests forgroup pages creates contention for memory bus bandwidthand degrades performance in benchmarks such as fft, ocean,and volrend.

Page 10: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

0%

3%

6%

9%

12%

15%

18%

21%

24%

asta

r

bzi

p2

gcc

gob

mk

libq

uan

tum

mcf

om

net

pp

per

lben

ch

xala

ncb

mk

cact

usA

DM

Gem

sFD

TD

sop

lex

sph

inx3

bla

cksc

ho

les

flu

idan

imat

e

swap

tio

ns

stre

amcl

ust

er

bo

dyt

rack

er

bar

nes

cho

lesk

y fft

rad

ix

wat

er-n

2

wat

er-s

p

fmm

oce

an

rad

iosi

ty

rayt

rack

lu

volr

end

Pe

rfo

rman

ce o

verh

ead

4MB 16MB 32MB

SPEC2006 PARSEC SPLASH2

Figure 6. Performance overheads for different sizes of DRAM buffer in PDR (average case).

Figure 4(b) shows the performance impact of ourRePRAM schemes in the stress case scenario. We see that,all 21 of our applications show the overheads less than 16%,and the highest overhead (15.5%) occurs in gcc, followedby 12.4% for cactusADM in the Dim-2 design. In Dim-1,we notice an average of 3.78%, 1.64%, and 2.65% acrossSPEC2006, PARSEC and Splash2 respectively, whereas thecorresponding averages in Dim-2 are 4.66%, 2.71%, and3.16%. Due to multiple memory accesses (depending onthe group size) and demand for mapping cache entries, wenotice an increased average performance impact in Dim-2than Dim-1 across various benchmark suites. Particularly,this effect is seen in benchmarks such as gcc, mcf, andfluidanimate. Similarly, issuing multiple requests incurs per-formance impact in benchmarks such as cactusADM, fft,and ocean.

C. Area Overheads

We use Cacti 5.3 [13], an integrated model for cacheand memory access time, cycle time, area, leakage anddynamic power. We use this tool to model our two 1024-entry mapping caches, and a compact bloom filter thathas space to store 1 million PCM page entries inside theprocessor chip. We assume that these hardware structureswould be integrated with an on-chip memory controller. Weuse 45 nm technology node in our experiments.

Table I shows the area estimates of our proposedRePRAM hardware. We find that our area overheads bothon-chip and off-chip are less than 1% and our proposedhardware can be easily integrated into the existing processorswith minimal changes. We note that this is much cheaperthan prior works such as ECP [24] that incur substantialarea overheads to store replacement bits along with addi-tional circuitry like row-decoders in order to store the errorcorrecting pointers.

Processor Die 263 mm2 [7]Bloom Filter 2.25 mm2

CACHEPDR 0.08 mm2

CACHEMDR 0.08 mm2

On-chip area overhead 0.92%16 MB DRAM Buffer overhead(compared to 4GB PRAM) <0.5%

Table IAREA OVERHEADS OF ON-CHIP MAPPING CACHES, BLOOM FILTER AND

OFF-CHIP DRAM BUFFER IN REPRAM.

D. Sensitivity Experiments

In this subsection, we present further analysis to show howour RePRAM schemes behave under various parameters andpossible configurations. We show that such analyses presentkey insights to the user along with experimental justificationfor choosing user-desirable set of parameters toward buildingthe RePRAM system.

1) Lifetime vs. Redundancy Levels: In Dim-2, PDR isused throughout the lifetime of PCM device. An advantageof using just PDR is that every PCM block incurs onlythe write operations directly intended for them, i.e., theuse of PDR configuration itself does not impose additionalwrites to the PCM blocks. However, in Dim-1, we switchto MDR beyond a specific point in time and this may likelyaccelerate the wear-out of PCM blocks. Since mirroringduplicates the writes to two different PCM blocks, everywrite ages two separate PCM blocks, contributing to thediminishing capacity of the PCM device. Therefore, we seekto investigate when to switch to MDR during the lifetime ofthe PCM block.

Page 11: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

0

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Mat

chin

g al

gori

thm

invo

cati

on

s

Normalized block lifetime

(a) variance=0.1

3

4

5

6

7

8

9

10

Ma

tch

ing

alg

ori

thm

in

vo

cati

on

s

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ma

tch

ing

alg

ori

thm

in

vo

cati

on

s

Normalized block lifetime

(b) variance=0.2

3

4

5

6

7

8

9

10

Ma

tch

ing

alg

ori

thm

in

vo

cati

on

s

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ma

tch

ing

alg

ori

thm

in

vo

cati

on

s

Normalized block lifetime

(c) variance=0.3

Figure 7. Frequency of invocation of matching algorithm through PRAMblock’s lifetime.

Figure 5 presents comparison of lifetime for variousconfigurations of Dim-1. In each case, we begin with PDRwith group size of 3. After we cross a “specific numberof faults” shown in each configuration, we switch to MDRwhere data is mirrored onto two PCM blocks. It is importantto note that, as we begin to operate in PDR mode withhigher number of faults, the cost associated with three-waymapping increases sharply (Section III-A). For example,the average number of random trials to complete a three-way matching for PCM blocks with up to 80 faults is

one more than three-way matching for up to 60 faults andthe corresponding lifetime improvement is 2.1%; whereas,the number of trials to do three-way matching for PCMblocks with up to 140 faults is ten-fold compared to theblocks with up to 60 faults and the corresponding lifetimeimprovement is 5.2%. Our experiments clearly show thatthere are diminishing returns in PCM lifetime improvementif we try to remain longer under PDR with higher ordergroup sizes.

2) Sensitivity of PDR to DRAM Buffer Size: A factor thatcould be critical to minimizing the performance impact inPDR is the size of DRAM buffer (that we use for paritylookup). To investigate this effect, we present additionalexperiments to determine the size of DRAM buffers neededfor storing the parity in our benchmarks. Due to smaller ratioof parity to PCM data pages, we find that relatively smallerDRAM buffer sizes (compared to PCM main memory)work very well toward minimizing the overall performanceimpact. Figure 6 shows the overheads experienced by PDRwhen we experiment with 3 different DRAM buffer sizes,viz., 4MB, 16MB and 32MB. Our results show that 16MBis sufficient to keep the performance overheads at less than7% (average case), while the 4MB DRAM buffer incurshigh performance overheads of up to 23% in libquantumand xalancbmk benchmarks. A 32MB DRAM buffer showsnegligible benefit over 16MB for performance overheads inalmost every benchmark, indicating that DRAM capacity(above 16MB) is no longer a bottleneck beyond the initialcompulsory misses to parity information. This result showsthat the amount of DRAM buffer needed to maintain lowperformance overheads across the three benchmark suites isjust 1

256 th of the capacity invested in PCM main memory.Furthermore, this 16MB DRAM buffer has shown less than16% overheads even for the stress case (seen in Figure 4(b)).

3) Frequency of Matching Algorithm Invocations: Tounderstand the OS and software overhead, we perform theexperiments to quantify the average number of times thatmatching algorithm needs to be invoked during a PRAMblock’s lifetime. Figure 7 presents the results for processvariation factors of 0.1, 0.2 and 0.3. In this test, we countthe number of writes (presented as normalized lifetime), thathas been performed to a PRAM block, before the blockencounters the first bit fault. Now that the block is no longerpristine, the matching algorithm is invoked to find a groupof compatible blocks. As additional writes are performed tothis block and its group pages, they incur more bit faults,and eventually become incompatible due to faults in thesame byte positions. At this point, the matching algorithmneeds to be invoked again to find a new set of compatiblepages for this faulty PCM block. We repeat this matchingprocess until the PRAM block has exceeded 160 bit faults(when we discard the block permanently). From Figure 7,we can see that matching algorithm invocations are oftenclustered towards the end of PRAM block’s lifetime for

Page 12: RePRAM: Re-cycling PRAM Faulty Blocks for Extended Lifetimehome.gwu.edu/~jiec/papers/dsn12.pdf · memory controller) back into use, i.e., not having to retire them prematurely. As

process variation of 0.1, while it is more spaced out forprocess variation of 0.3. In all of the cases, we find thataverage number of matching algorithm invocations is lessthan 10 throughout the PRAM block’s lifetime. Clearly,the matching algorithm accounts for a small fraction ofperformance overhead in comparison to the actual writesperformed on the PRAM block.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we explore a number of dynamic redundancytechniques to resuscitate faulty PCM pages and improve thelifetime of PCM-based main memory systems. We exploredifferent design choices along two dimensions namely, 1)switching from PDR to MDR, and 2) reducing the group sizein PDR from three to two. We show that, by intelligentlycombining the use of PDR and MDR schemes, the lifetimeof PRAM can be improved by upto 43× over Fail Stop.

As future work, we plan to extend RePRAM to incorpo-rate application-specific characteristics and system energyawareness. We will focus on capturing the key features ofthe memory-intensive applications, and tune the hardware toadjust to the performance demands and energy constraints.Furthermore, we will extend this work by investigatingother resistive memory technologies, as well as, system-level effects needed to tolerate failures resulting from writeendurance limitations inherent in some of these devices.

VI. ACKNOWLEDGMENTSThis material is based upon work supported in part by the

National Science Foundation under CAREER Award CCF-1149557, and grants CCF-1117243 and OCI-0937875.

REFERENCES[1] C. Bienia, S. Kumar, J.P. Singh, and K. Li. The PARSEC

Benchmark Suite: Characterization and Architectural Impli-cations. Princeton University Technical Report TR-811-08,January 2008.

[2] Burton H. Bloom. Space/time trade-offs in hash coding withallowable errors. Commun. ACM, 13:422–426, July 1970.

[3] William A. Brant, Michael E. Nielson, and Edde Tin-ShekTang. Power failure responsive apparatus and method havinga shadow dram, a flash rom, an auxiliary battery, and acontroller. In US Patent 5,799,200, 1998.

[4] Jie Chen, R. C. Chiang, H. Howie Huang, and GuruVenkataramani. Energy-aware writes to non-volatile mainmemory. SIGOPS Oper. Syst. Rev., 45(3):48–52, January2012.

[5] Jie Chen, Zachary Winter, Guru Venkataramani, andH. Howie Huang. rpram: Exploring redundancy techniquesto improve lifetime of pcm-based main memory. In Pro-ceedings of the 2011 International Conference on ParallelArchitectures and Compilation Techniques, 2011.

[6] Sangyeun Cho and Hyunjin Lee. Flip-n-write: a simpledeterministic technique to improve pram write performance,energy and endurance. In MICRO, 2009.

[7] Intel Corporation. Intel core i7-920 processor.http://ark.intel.com/Product.aspx?id=37147, 2010.

[8] Dave Hayslett. System z redundant array of independentmemory. In IBM SWG Competitive Project Office, 2011.

[9] Engin Ipek, Jeremy Condit, Edmund B. Nightingale, DougBurger, and Thomas Moscibroda. Dynamically replicatedmemory: building reliable systems from nanoscale resistivememories. In ASPLOS, 2010.

[10] Lei Jiang, Yu Du, Youtao Zhang, B.R. Childers, and Jun Yang.Lls: Cooperative integration of wear-leveling and salvagingfor pcm main memory. In DSN, pages 221 –232, june 2011.

[11] Nikolai Joukov, Arun M. Krishnakumar, Chaitanya Patti,Abhishek Rai, Sunil Satnur, Avishay Traeger, and Erez Zadok.Raif: Redundant array of independent filesystems. MSST,0:199–214, 2007.

[12] Randy H. Katz. Raid: A personal recollection of how storagebecame a system. Annals of the History of Computing, IEEE,32(4), 2010.

[13] HP Labs. Cacti 5.3. http://quid.hpl.hp.com:9081/cacti/, 2010.[14] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger.

Architecting phase change memory as a scalable dram alter-native. In ISCA, 2009.

[15] Rami Melhem, Rakan Maddah, and Sangyeun Cho. Rdis: arecursively defined invertible set scheme to tolerate multiplestuck-at faults in resistive memory. In Proceedings of theInternational Conference on Dependable Systems and Net-works, 2012.

[16] Numonyx. Phase change memory: A new memoryto enable new memory usage models. White Paperhttp://www.numonyx.com/, 2009.

[17] David A. Patterson, Garth Gibson, and Randy H. Katz. A casefor redundant arrays of inexpensive disks (raid). In SIGMOD,pages 109–116, 1988.

[18] Devices Process Integration and Structures. Internationaltechnology roadmap for semiconductors. http://www.itrs.net,2007.

[19] Feng Qin, Shan Lu, and Yuanyuan Zhou. Safemem: Ex-ploiting ecc-memory for detecting memory leaks and memorycorruption during production runs. In HPCA, 2005.

[20] Moinuddin K. Qureshi, John Karidis, Michele Franceschini,Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali.Enhancing lifetime and security of pcm-based main memorywith start-gap wear leveling. In MICRO, 2009.

[21] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, andJude A. Rivers. Scalable high performance main memorysystem using phase-change memory technology. In ISCA,2009.

[22] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C.Chen, R. M. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H.-L.Lung, and C. H. Lam. Phase-change random access memory:a scalable technology. IBM J. Res. Dev., 52, July 2008.

[23] Jose Renau et al. SESC. http://sesc.sourceforge.net, 2006.[24] Stuart Schechter, Gabriel H. Loh, Karin Straus, and Doug

Burger. Use ecp, not ecc, for hard failures in resistivememories. In ISCA, 2010.

[25] Nak Hee Seong, Dong Hyuk Woo, and Hsien-Hsin S. Lee.Security refresh: prevent malicious wear-out and increasedurability for phase-change memory with dynamically ran-domized address mapping. In Proceedings of the 37th annualinternational symposium on Computer architecture, 2010.

[26] Nak Hee Seong, Dong Hyuk Woo, Vijayalakshmi Srinivasan,Jude A. Rivers, and Hsien-Hsin S. Lee. Safer: Stuck-at-faulterror recovery for memories. In MICRO, 2010.

[27] Standard Performance Evaluation Corporation. SPEC Bench-marks. http://www.spec.org, 2006.

[28] Chris Wilkerson, Alaa R. Alameldeen, Zeshan Chishti, WeiWu, Dinesh Somasekhar, and Shih-lien Lu. Reducing cachepower with low-cost, multi-bit error-correcting codes. InISCA, 2010.

[29] John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan.The hp autoraid hierarchical storage system. ACM Trans.Comput. Syst., 14, February 1996.

[30] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta.The splash-2 programs: Characterization and methodologicalconsiderations. In ISCA, June 1995.

[31] B. Yang, J. Lee, J. Kim, J. Cho, S. Lee, and B. Yu. Alow power phase change random access memory using a datacomparison write scheme. In ISCAS, 2007.

[32] Doe Hyun Yoon, Naveen Muralimanohar, Jichuan Chang,Parthasarathy Ranganathan, Norman P. Jouppi, and MattanErez. Free-p: Protecting non-volatile memory against bothhard and soft errors. In HPCA, February 2011.

[33] Wangyuan Zhang and Tao Li. Characterizing and mitigatingthe impact of process variations on phase change basedmemory systems. In MICRO, 2009.

[34] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. Adurable and energy efficient main memory using phase changememory technology. In ISCA, 2009.