[ieee 21st international conference on advanced information networking and applications workshops...

An Area-Efficient Approach to Improving Register File ReliabilityAgainst Transient Errors

Mallik Kandala, Wei ZhangDept. of Electrical and Computer Engineering

Southern Illinois University CarbondaleCarbondale, IL 62901

{kandala,zhang}@engr.siu.edu

Laurence T. YangDept. of Computer Science

St. Francis Xavier UniversityAntigonish, NS, B2G 2W5, Canada

[email protected]

Abstract

This paper studies approaches to exploiting thespace both within or across registers efficiently forimproving the register file reliability against transienterrors. The idea of our approach is based on thefact that a large number of register values are nar-row (i.e., less than or equal to 16 bits for a 32-bitarchitecture); therefore, the upper 16 bits of the reg-isters can be used to replicate the short operands forenhancing register integrity. This paper also adaptsa prior register replication approach by selectivelycopying register values (i.e., long operands only) tothe unused physical registers for enhancing reliabil-ity without incurring significant hardware cost. Ourexperiments indicate that on average, 99.3% regis-ter reads (regardless of short or long operands) canfind their replicas available, implying significant im-provement of register file integrity against transienterrors.

1 Introduction

Following performance and power dissipation, re-liability against transient errors (also called soft er-rors or single-event upsets) has become the latestconcern for microprocessor design. With the scalingof technology, the lower supply voltage and higherdensity will make future microprocessors more vul-nerable to energetic particle strikes, leading to moretransient errors. Such transient errors can result insilent data corruption or system crash, if left unpro-tected. Therefore, it is important to employ fault tol-erant techniques to detect and/or correct transient er-rors for reliable computing.

While it is of great importance to avoid transienterrors by careful circuit design and packaging, they

can still occur and hence must be addressed [1]. Re-cently, a large number of research efforts have fo-cused on studying the transient errors for the data-path [5, 6, 7, 8], caches [11, 12] and main mem-ory [17, 18]. In contrast, the reliability of reg-isters against transient errors has been largely ig-nored or sometimes just taken for granted until re-cently. Unfortunately, registers without protectionwill not be automatically immune from energetic par-ticle strikes. Moreover, current trend to employ large,and multi-ported register files for better performance(e.g., IA-64) can exacerbate the register file relia-bility problem. Transient errors occurring in reg-isters can easily propagate to the functional unitsor cache memories since registers are typically ac-cessed very frequently. Also, on a processor withfunctional units and cache memories well protected,an un-protected or under-protected register file maybecome the archilles’ heel of the system reliability.Therefore, the register file reliability against transienterrors must be addressed to ensure reliable and de-pendable computing.

To protect registers against soft errors, some pro-cessors use error detection and correction schemesin the register files. For instance, IBM G5 utilizes anECC-based scheme [7] to protect the registers. WhileECC (i.e., DED-SEC code) can detect double-bit er-rors and correct single-bit errors, it cannot correctdouble-bit or multi-bit errors. Moreover, the ECCscheme is costly in terms of performance and en-ergy consumption. Tremblay et al. [19] found thata simple ECC operation can take up to three timesthe latency of a simple ALU operation. Although itis possible to perform the ECC computation and ver-ification in the background, the energy consumptionof ECC cannot be hidden. In fact, recent work indi-cates that the energy consumption of ECC is approx-imately an order of magnitude larger than that of aregister access [20]. Compared to ECC, a less expen-sive technique to protect register files is parity check.21st International Conference on

Advanced Information Networking and Applications Workshops (AINAW'07)0-7695-2847-3/07 $20.00 © 2007

The parity-based schemes, however, can only detectodd-bit errors and can not correct any errors, whichare not useful for applications that demand high reli-ability.

Another straightforward way to protect registers isto duplicate (triplicate) the whole register file so thatany register value can have one (two) duplicate(s) inthe shadow register file(s). Apparently, such an ap-proach will be highly reliable yet very costly, becausethe area overhead of the register file will be more than100%(200%). In this paper, we will study an area-effective strategy to replicate register values withinthe existing space of logical registers and physicalregisters for superscalar microprocessors. The ideais based on the observation that about half of the reg-ister values are narrow (i.e., within 16 bits or less,also called short operands in this paper); therefore,the upper 16 bits of registers can be used to replicatethe short operands within the register itself for higherreliability. On the other hand, the long operands (i.e.,with more than 16 bits) will be duplicated to theunused physical registers for increasing register in-tegrity without impacting performance or demandingadditional space, which is based on an extension ofprevious work proposed by Memik et al. [1]. How-ever, compared with Memik’s approach [1], whichduplicates every register value into the physical reg-isters (and thus can only achieve a moderate registerreplica read rate), our scheme can achieve higher reli-ability by making use of the in-register replication forfree for short operands and exploiting unused physi-cal registers more selectively and efficiently.

The rest of the paper is organized as follows. Thenext section explains the in-register replication forshort operands. Section 3 introduces a selective repli-cation scheme for long operands by exploiting theunused physical registers. Section 4 presents theevaluation methodology and Section 5 gives the sim-ulation results. Finally, Section 6 concludes the pa-per.

2 In-Register Replication

(IRR) for Short Operands

It is generally known that a large fraction of reg-ister values are narrow, i.e., these operands are lessthan or equal to 16 bits wide 1. Previous work hasexplored narrow operands for performance and en-

1By default, we consider a 32-bit processor. For 64-bitprocessors, the narrow or short operands can contain morebits, such as 32 bits.

ergy optimizations [2, 13, 4, 3, 15, 14]. Comparedwith their work, this paper exploits narrow operandsfor higher reliability against transient errors. Re-cently, Kumar et al. studied register bits reuse (RBB)[24], which allows the main thread and the redun-dancy thread to write the results into the same reg-ister if the output of the main instruction is of nar-row size. In contrast to Kumar’s work [24] thataimed at reducing the resource pressure for redun-dant multi-threading (RMT), this paper focuses onexploring short operands to enhance register file in-tegrity for single-threaded applications. More re-cently, Jie et al. proposed in-register duplication [16],which also takes advantages of narrow register val-ues for higher register reliability of single-threadedapplications. While Jie’s approach [16] is very closeto the first scheme (part of this paper) studied in thispaper, our work is done independently and perhapsconcurrently with the work presented in [16]. Never-theless, there are also some importance distinctionsbetween Jie’s work [16] and this paper. First, whileJie et al. studied a 64-bit microprocessor [16], thiswork focuses on examining a 32-bit processor, whichis more challenging in terms of extracting and ex-ploiting short operands. More importantly, this paperexplores and evaluates an extension of a previous ap-proach [1] to protect long operands as well. In con-trast, only short operands are protected in [16], wherelong operands can become a severe reliability prob-lem for registers.

In this paper, we focus on studying a 32-bit archi-tecture, where each register can store a 32-bits value.Therefore, the upper 16 bits of the registers, regard-less of logical (architectural) or physical registers,can be used to replicate the short operands for achiev-ing higher reliability against transient errors withoututilizing additional registers. Such a scheme is calledin-register replication (IRR) in this paper. It shouldbe noted that this paper focuses on studying the IRRscheme for a 32-bit architecture. With the advent of64-bit architectures, such as Intel Itanium and AMDHammer, the IRR approach is expected to be more ef-fective at exploiting the in-register space for increas-ing register file reliability against transient errors.

To distinguish short operands from other registervalues that are longer than 16 bits, a short operand bitis associated with each register. Initially, all the shortoperand bits are set to 0. When a value is written intoa register (i.e., a physical register at the completionstage, or a logical register at the commit stage), ifit can be represented with 16 bits, the short operandbit will be set by the data width detection logic, asshown in Figure 1. The overhead of adding an ad-21st International Conference on


11616 Register

Data

1: short operands0: long operands

32 16

Determination LogicData Width

Figure 1. Write a short/long operand to aregister.

Fetch Dispatch Exection Memory Cmplt1 Cmplt2 Commit1 Commit2

Fetch Dispatch Exection Memory Cmplt Commit

Figure 2. Modification of the completionand commit stages to distinguish shortand long operands written to registers.

ditional bit for a 32-bit register is just 3.1%. Sinceour goal is to improve the register file reliability, wechoose to use a non-speculative technique to calcu-late the width of the data written to the register file,although previous work shows that the width of reg-ister values are highly predictable [4]. We conser-vatively assume that it takes one clock cycle to de-termine the width of the data written to the regis-ter file. Therefore, the completion stage as well asthe commit stage will be extended into two stages tosupport the in-register replication for physical regis-ters and logical registers respectively, as illustratedin Figure 2. Therefore, the clock cycle time of theprocessor will not be impacted. Moreover, since thedata can be forwarded to the dependent instructionsbefore the completion stage 2, such modification ofthe pipeline will not significantly impact the perfor-mance, although the values will be written back toregisters later. To read operands from either the phys-ical registers or logical registers, the short operandbits will be checked first. If the operand is narrow, itwill be sign-extended before it is passed to the ALU,as illustrated in Figure 3.

To improve the data integrity of register files, weassume the existence of a parity bit for every 16 bitsin the registers, as shown in Figure 4. This parity

2It should be noted that the data forwarding networkneeds to be extended to support the additional forwardingpaths from the complete2 stage and the commit2 stage toprevious pipeline stages.

Reg

Sign Extend

32

MUX

Reg

Sign Extend

32

MUX

ALU

16 16

Figure 3. Read short/long operands fromregisters.

Parity BitParity BitShort Operand Bit

111 1616

Figure 4. A parity bit is associated withboth the upper and lower 16 bits.

bit indicates whether these 16 bits have single bit (orodd bit) soft errors or not. When a register is read,the short operand bit will be checked to determinethe width of the operand. For a short operand, theparity bit of the lower 16 bits will be read. If thereare no soft errors, these 16 bits will be passed to theALU. Otherwise, the parity bit of the upper 16 bitswill be checked. If this parity bit indicates no error,then the upper 16 bits will be transferred to the ALUand copied to the lower half of the register as well forfuture references. If the parity bit of the upper 16 bitsalso shows a soft error, such an error will not be re-coverable. However, it is vary rare for both the upper16 bits and lower 16 bits of a register to have soft er-rors at the same time, since soft errors are generallydistributed uniformly [10].

3 Across-Register Replication

(ARR):Exploiting Unused

Physical Registers for Repli-

cating Long Operands

So far, we have only covered how to protect theshort operands by replicating them within registersthemselves. However, the long operands are onlycovered by parity bits, which cannot correct any softerrors. In order to enhance the error recovery ca-21st International Conference on


pability for long operands, we extend a recent ap-proach proposed by Memik et al. [1], which repli-cates register values into unused physical registersto increase the register file immunity from soft er-rors. While Memik’s approach can exploit the un-used physical registers efficiently for higher reliabil-ity, it can only improve the register reliability mod-erately. The main reason is that it attempts to copyevery register value, regardless of its width. Conse-quently, given a limited number of physical registers,it may happen that there is no physical register avail-able to copy the current register value, or the previ-ous copy register has to be overlapped by the registerrenaming of another value, both of which can com-promise the register file integrity. As the experimentsin [1] indicate, only about 78% of register accessescan find their duplicates available in the physical reg-isters. In other words, about 22% of register accessesare not protected by Memik’s scheme, which can be-come a severe reliability bottleneck for applicationsthat demand high reliability. Based on our discussionon in-register replication, all the short operands canbe copied into its original registers without requir-ing any additional unused registers. Therefore, onlylong operands need to be copied to the unused phys-ical registers for higher reliability. This extension ofMemik’s approach is called across-register replica-tion in this paper (in contrast to in-register replicationfor short operands). Similar to the short operands, wealso assume the existence of parity to protect both theoriginal long operands and their replicas. The errordetection and correction strategy for long operands isalso the same as that for short operands, i.e., the par-ity bit is checked first. Once a soft error is detected,the replica of the long operand will be used to recoverthe error if the parity bit of the replica indicates error-free. Since generally both short operands and longoperands are present in applications, we propose tocombine IRR and ARR (also called IRR+ARR) tofully exploit the register space for replicating bothshort operands and long operands, which can po-tentially lead to very high register reliability againsttransient errors without significant hardware cost.

4 Evaluation Methodology

4.1 Experimental Setup

We have implemented the proposed replicationstrategy for short operands and long operands in theSimplescalar 3.0 [21]. Our simulator models an out-

Configuration Parameter ValueProcessor

Functional Units 4 integer ALUs1 integer multiplier/divider

4 FP ALUs1 FP multiplier/divider

Fetch Width 4 instructions/cycleIssue Width 4 instructions/cycle

Logical Registers 32 INT, 32 FPPhysical Registers 80 INT, 80 FP

Instruction Window 80-RUU, 40-LSQCache and Memory Hierarchy

L1 Instruction Cache 32KB, 1-way, 32 byte blocks1 cycle latency

L1 Data Cache 32KB, 1-way, 32 byte blocks, WB1 cycle latency

L2 1MB unified, 8-way LRU, WB64 byte blocks, 6 cycle latency

Memory 100 cycle latencyTLB Size 128-entry, 30-cycle miss penalty

Table 1. Configuration parameters ofsimulated microprocessor.

of-order superscalar microprocessor 3. The importantparameters of the processor and the memory hierar-chy are listed in Table 1. We select eleven applica-tions from the SPEC 2000 benchmark suite [22] forthis evaluation. For each benchmark, we fast-forwardan application-specific number of instructions basedon SimPoint [23] and then simulate the next 500 mil-lion instructions.

4.2 Reliability Metrics

To evaluate the effectiveness of our scheme, weuse the following two metrics, in addition to the faultinjection experiments.

1. replica write rate: it is the fractionof the register writes that can successfully du-plicate the data. Obviously, for short operands,the replica write rate is always 100%. By com-parison, the replica write rate for long operandsis dependent on the availability of the unusedphysical registers at the time of register writes.

2. replica read rate: it is the fractionof the register reads that can find the repli-cas. Since the replica of the short operandsalways co-exist with the short operands them-selves, the replica read rate for short operands

3It should be noted that the IRR scheme proposed inthis paper is not limited to the superscalar processors, it canalso be applied other architectures, such as statically-issuemicroprocessors, for improving the register reliability in acost-effective manner.21st International Conference on


0%

10%

20%

30%

40%

50%

60%

70%

80%

apsi ar

t

bizp2

fma3

dgc

cgz

ipm

cf

mes

a

sixtra

cksw

im

wupwise

aver

age

Per

cen

tag

e o

f S

ho

rt O

per

and

s

Figure 5. Percentage of register valuesthat are less than or equal to 16 bitswide.

will be 100%. For long operands, however, thereplica read rate depends on the availability ofthe replicas at the time of register reads, sincecopy registers may be allocated for other reg-ister values before they are read. In general,the larger the replica read rate, the higher theregister file reliability.

5 Results

Our first experiment studies what percentage ofinteger register operands are narrow, which is shownin Figure 5. As we can see, most of the benchmarkshave a fraction of short operands around 50%. On av-erage, the ratio of short operands is 48%, indicatingthat almost half of the registers can be used to repli-cate the short operands for better reliability. Theseresults also imply that only 52% of operands (i.e.,long operands) need to use additional physical regis-ters for replication.

Figure 6 depicts the replica write rate and replicaread rate for long operands. Interestingly, we findthat all the long operands can be successfully repli-cated across all the benchmarks, mainly because in-register replication can filter out about half of all theregister values, which will not compete for physicalregisters. In addition, a long operand can be copiedto an existing copy register for another long operandeven if there is no unused physical register available.However, this perfect replica write rate does not di-rectly imply perfect reliability, since register reliabil-ity is improved only when the replicas can be found

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

apsi ar

t

bizp2

fma3

dgc

cgz

ipm

cf

mes

a

sixtra

cksw

im

wupwise

aver

age

Replica Write Rate Replica Read Rate

Figure 6. The replica write rate andreplica read rate for long operands.

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

apsi ar

t

bizp2

fma3

dgc

cgz

ipm

cf

mes

a

sixtra

cksw

im

wupwise

aver

age

Replica Write Rate Replica Read Rate

Figure 7. The overall replica write rateand replica read rate.

at the time of register reads (not writes). As can beseen from Figure 6, the replica read rate is also veryhigh. On average, 98.5% of long operand reads areable to find their replicas available, which is a resultof perfect replica write rate for long operands.

Since both short operands and long operandscan be read and written, Figure 7 gives the overallreplica write rate and overall replica read rate for theIRR+ARR scheme. Due to the fact that both thereplica write rate and the replica read rate for shortoperands are always 100%, the overall replica writerate is still 100%. Also, the overall replica read rateby considering both short and long operands is ex-pected to be larger than that of the long operandsonly. As can be seen from Figure 7, the average over-all replica read rate across all benchmarks is 99.3%.21st International Conference on


Therefore, by combining IRR for short operands andARR for long operands, the overall register reliabilitycan be greatly enhanced without significant hardwareoverhead.

6 Concluding Remarks

With the shrinking feature size, the lower supplyvoltage and higher density, future microprocessorswill be more vulnerable to transient errors. Whiletechniques to protect the datapath and cache mem-ories have been studied extensively, there is rela-tively little work in improving the register file relia-bility against transient errors. This paper exploits thefact that a large fraction of register values are nar-row, which can be replicated within registers for en-hancing reliability without demanding information orspace redundancy. Also, built upon a previous reg-ister replication scheme by Memik et al. [1], wepropose to replicate only long operands in the un-used physical registers for achieving higher reliabil-ity. Our experiments indicate that on average, 99.3%register reads can find their replicas available, imply-ing significant register reliability improvement. Webelieve the approach studied in this paper is espe-cially desirable for systems that demand very highregister reliability while under cost constraint.

References

[1] G. Memik, M. Kandemir and O. Ozturk. Increasing registerfile immunity to transient errors. In Proc. of DATE, 2005

[2] D. Brooks and M. Martonosi. Dynamically exploiting nar-row width operands to improve processor power and per-formance. In Proc. of the Fifth International Symposium onHigh-Performance Computer Architecture (HPCA-5), Jan-uary 1999.

[3] M. H. Lipasti, B. R. Mestan and E. Gunadi. Physical regis-ter inlining. In Proc. of ISCA’31, June 2004.

[4] G. H. Loh. Exploiting data-width locality to increase super-scalar execution bandwidth. In Proc. of MICRO’35, 2002.

[5] P.Shivakumar, M. Kistler, S. Keckler, D. Burger and L.Alvisi. Modeling the effect of technology trends on soft er-ror rate of combinational logic. In Proc. of DSN, 2002.

[6] T. Austin. DIVA: a reliable substrate for deep submicronmicroarchitecture design. In Proc. of MICRO, 1999.

[7] S. K. Reinhardt and S. S. Mukherjee. Transient fault de-tection via simultaneous multithreading. In Proc. of ISCA,2000.

[8] J. Ray, J. C. Hoe and B. Falsafi. Dual use of superscalardatapath for transient-fault detection and recovery. In Proc.of MICRO’34, 2001.

[9] M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I.Pomeranz. Transient-fault recovery for chip multiproces-

sors. In Proc. of the 30th Annual International Symposiumon Computer Architecture (ISCA), June 2003.

[10] S. S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T.Austin. A Systematic Methodology to Compute the Archi-tectural Vulnerability Factors for a High-Performance Mi-croprocessor. In Proc. of MICRO, 2003.

[11] S. Kim and A. K. Somani. Area efficient architectures forinformation integrity in cache memories. In Proc. of ISCA,1999.

[12] C. Chen and A. K. Somani. Fault containment in cachememories for TMR redundant processor systems. IEEETransactions on Computers, March 1999.

[13] O. Ergin, D. Balkan, K. Ghose and D. Ponomarev. Regis-ter packing: exploiting narrow-width operands for reducingregister file pressure. In Proc. of MICRO’37, 2004.

[14] R. Gonzalez, A. Cristal, A. Veidenbaum, M. Pericas, andM. Valero. An asymmetric clustered processor based onvalue content In Proc. of ICS’05, June 2005.

[15] M. Kondo and H. Nakamura. A small, fast and low-powerregister file by bit-partitioning. In Proc. of HPCA’05, Feb2005.

[16] J. Hu, S. Wang and S. G. Ziavras. In-register replication:exploiting narrow-width value for improving register file re-liability. To appear in Proc. of DSN’06, June 2006.

[17] T. J. Dell. A white paper on the benefits of chipkill-correctECC for PC serve main memory. IBM, Nov 1997.

[18] C. L. Chen and M.Y Hsiao. Error-correcting codes for semi-conductor memory applications: a state of the art review. InReliable Computer Systems - Design and Evaluation, pages771-786, Digital Press, 2nd edition, 1992.

[19] M. Tremblay and Y. Tamir. Support for fault tolerance inVLSI processors. ISCS, 1989.

[20] R. Phelan. Addressing soft errors in ARM core-based SoC.ARM White Paper, Dec. 2003.

[21] http://www.simplescalar.com.

[22] http://www.spec.org.

[23] T. Sherwood, E. Perelman, G. Hamerly and B. Calder. Au-tomatically characterizing large scale program behavior. InProc. of the Tenth International Conference on Architec-tural Support for Programming Languages and OperatingSystems (ASPLOS 2002), October 2002.

[24] S. Kumar and A. Aggarwal. Reduced resource redundancyfor concurrent error detection techniques in high perfor-mance microprocessors. In Proc. of International Confer-ence on High Performance Computer Architecture, 2006.

21st International Conference onAdvanced Information Networking and Applications Workshops (AINAW'07)0-7695-2847-3/07 $20.00 © 2007

[ieee 21st international conference on advanced information networking and applications workshops...

Documents