an efficient and experimentally tuned software-based hardening strategy for matrix multiplication on...

8

Click here to load reader

Upload: l

Post on 25-Dec-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013 2797

An Efficient and Experimentally TunedSoftware-Based Hardening Strategy for

Matrix Multiplication on GPUsP. Rech, C. Aguiar, C. Frost, and L. Carro

Abstract—Neutron radiation experiment results on matrix mul-tiplication on graphic processing units (GPUs) show that multipleerrors are detected at the output in more than 50% of the cases. Inthe presence of multiple errors, the available hardening strategiesmay become ineffective or inefficient. Analyzing radiation-inducederror distributions, we developed an optimized and experimentallytuned software-based hardening strategy for GPUs. With fault-in-jection simulations, we compare the performance and correctingcapabilities of the proposed technique with the available ones.

Index Terms—Graphic processing unit (GPU), matrix multipli-cation, neutron radiation testing, software-based hardening.

I. INTRODUCTION

G RAPHIC processing units (GPUs) are electronic devicesfirstly designed to perform high-performance image pro-

cessing. In order to achieve the proposed objective, GPUs musthave the ability of rapidly manipulate a high amount of memorylocations and are typically able to execute several elementarytasks in parallel at a very high speed [1]. Due to their highly par-allel structure, the latest GPUs may be more effective than gen-eral-purpose CPUs for algorithms in which large blocks of datacan be processed in parallel. GPUs have then begun to be pre-ferred to CPUs for extreme parallelized applications like oil ex-ploration, analysis of air traffic flow, medical image processing(e.g., identify hidden plaque in arteries), linear algebra, statis-tics, 3D reconstruction, and stock options pricing determination[2].The newest GPU core processors are built with cutting-edge

technology and then are potentially very prone to experience ra-diation-induced errors [3], [4], even on terrestrial applications,where neutrons are among the main responsible for failures.In fact, neutrons originated by the interaction of cosmic rayswith the atmosphere may generate different kind of errors in

Manuscript received September 26, 2012; revised February 01, 2013; ac-cepted March 08, 2013. Date of publication April 30, 2013; date of currentversion August 14, 2013. This work was supported by the CAPES Foundationof the Ministry of Education, the CNPq Research Council of the Ministry ofScience and Technology, and the FAPERGS Research Agency of the State ofRio Grande do Sul, Brazil. Experiments were performed at the ISIS, RutherfordAppleton Laboratories, Didcot, U.K., founded by the Science and TechnologyFaculty Council (STFC), U.K.P. Rech, C. Aguiar, and L. Carro are with the Instituto de Informática, Fed-

eral University of Rio Grande do Sul (UFRGS), Porto Alegre, RS 9500, Brazil(e-mail: [email protected]; [email protected]; [email protected]).C. Frost is with STFC, RutherfordAppleton Laboratories, Didcot QX11 0QX,

U.K. (e-mail: [email protected]).Digital Object Identifier 10.1109/TNS.2013.2252625

modern computing systems as GPUs. The impinging neutronmay corrupt one or more bits stored in memory elements like in-ternal registers or cache [single- or multiple-cell upsets (SEUsor MCUs)] or generate temporary voltage pulses in logic re-sources [single-event transients (SETs)] [5].It was already demonstrated in [6] that bothmemory and logic

resources of a GPU may be corrupted by atmospheric neutrons.This paper evaluates the neutron-induced error rate of matrixmultiplication, which is an important tool for several GPU appli-cations such as signal and control algorithm, image processing,video/audio editing, weather prediction, and finite element anal-ysis to name a few [7], [8]. In some applications in which matrixoperations are exploited, such as audio, video, graphics, and vi-sualization processing, a certain number of errors in the outputmay be tolerated [9]. However, if uncorrected or underestimatedradiation errors may drastically lower the quality of the user ex-perience or compromise the media rendering. Moreover, matrixoperations on GPUs are widely used in safety-critical applica-tions that require a high level of reliability.The experimental results we present here show that neutrons

generate multiple output errors when matrix multiplication isexecuted onmodern GPUs.Multiple errors, in most of the cases,affect a single row or column of the resulting matrix but mayalso be randomly distributed. The observed phenomena maygenerate unacceptable visible/audible patterns in entertainmentapplications and are surely a critical issue if reliability is a majorconcern.Unfortunately, most of the available hardening techniques

are based on the assumption that just single radiation-inducedoutput error can occur [10]. As we demonstrate here, this hy-pothesis is no longer valid for modern GPUs. Thus, the correc-tion capability or performance of the available hardening strate-gies may be compromised. It is then mandatory to design novelhardening solutions capable of efficiently deal with multiple er-rors. We built an optimized software-based hardening strategyfor GPUs designed and tuned taking advantage of experimentaldata. The proposed technique is more efficient than the avail-able ones, as we will demonstrate through fault-injection simu-lations.The remainder of the paper is organized as follows. Section II

summarizes the principal characteristics of the GPU computingunits. Section III reports the matrix multiplication neutronsensitivity wen executed on a GPU. Section IV lists thehardening strategies we have implemented on the GPU, high-lighting the peculiarities of the proposed experimentally tuned

0018-9499 © 2013 IEEE

Page 2: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

2798 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013

Fig. 1. Simplified internal structures of a GPU.

software-based hardening strategy. Section V evaluates the ef-ficiency of the different hardening techniques, while Section VIconcludes the paper.

II. GRAPHIC PROCESSING UNITS INTERNAL STRUCTURE

GPUs are divided into various computing units, namedstreaming multiprocessors (SMs), each of which has the abilityof executing several threads in parallel. Each basic computingunit (named CUDA core in NVIDIA devices) in the SM ex-ecutes one thread with dedicated memory locations, avoidingcomplex resource sharing or the need of long pipelines. Avery simple control unit is then sufficient to schedule threadsexecution [2] (Fig. 1). The structure of a GPU is different fromthe typically CPU one, where sophisticated control units areneeded to schedule complex thread execution sequentially orin parallel with other executions, eventually taking advantageof the presence of multiple arithmetic logic units. Moreover,on a CPU, large cache memories must be provided to minimizethe instruction and data access latencies of large complexapplications.In a GPU, each thread may dispose of up to tens of megabits

of internal registers and has access to shared memory, which isnecessary to avoid multiple accesses to the DRAM performedby threads executing operation on the same data. Usually, on aGPU, threads do not interact with each other tominimize latencyand waiting delays and to take full advantage of the GPU par-allelism. Thus, each thread is executed by a small stand-alonecomputing unit, and the GPU scheduler is needed just to syn-chronize all threads and to check if execution has been com-pleted. The GPU physical design may then be viewed as com-posed of several isolated elementary computing units (each ex-ecuting a thread) whose corruption will generate an output errorin the threads it executes but will not affect other units.

III. MATRIX MULTIPLICATION NEUTRON SENSITIVITY

The tested devices are commercial-off-the-shelves GeForceGTX480 GPUs designed by NVIDIA in a 40-nm technologynode. This GPU can run with a maximum frequency of 1.215GHz and is divided into 15 SM, which enables the GPU to in-stantiate 64 M threads in parallel. However, the GTX480 dis-pose of just 480 CUDA cores, which are the essential GPUcomputing units, each of which is able to execute one threadat a time. This means that just 480 threads can be really treated

Fig. 2. ISIS spectrum compared with those of the LANSCE and TRIUMF fa-cilities and to the terrestrial one at sea level multiplied by and [11].

in parallel; the others are pipelined, and their execution is con-trolled by two schedulers.The radiation experiments were performed at the ISIS facility

in the Rutherford Appleton Laboratories (RAL), Didcot, U.K.,with a neutron spectrum (Fig. 2) that has already been demon-strated to be suitable for emulating the atmospheric one [11].The available neutron flux was of about n/(cm s) forenergies above 10 MeV. The beam was focused on a spot witha diameter of 2 cm plus 1 cm of penumbra, which was enoughto cover the whole GPU chip leaving the on-board DDR and thepower control circuitry not irradiated.During experiments, the GPU was controlled by a desktop

PC through a 2.5-GHz PCI-Express bus. An extension of 20 cmwas added to the PCI-Express bus to prevent scattering neutronsfrom affecting the PC functionalities. The extension was pro-vided with fuses to prevent current spikes from the GPU fromreaching the PC motherboard. The only role of the PC is to ini-tialize the board under test, download the results, and check formismatches when the test is finished. Thanks to the extreme highfrequency of both the PC and PCI-Express, test initialization, re-sults readback, and errors checking are performed very quickly(in the order of milliseconds in the worst case), making it un-likely for a neutron to generate an error during their execution.We tested under radiation a benchmark that performs the mul-

tiplication of two 2048 2048 random matrices (A and B) ex-ecuting 2048 2048 parallel threads, each in charge of calcu-lating a single element of the resulting matrix following (1) [12].The experimentally obtained neutron-induced error rate of

matrix multiplication is errors/execution. The ap-plication cross section, obtained dividing the number of faultyexecutions per unit time by the flux, is cm . As thespectrum of neutron energies at ISIS resembles the atmosphericone [11], the cross section obtained at ISIS is also the probabilityfor a neutron, in the natural environment, to corrupt matrix mul-tiplication execution on the GPU. Multiplying the cross sectionwith the natural neutron flux (which is of about 14 n/cm h inNew York City [13]), we obtain the radiation-induced error rateof matrix multiplication when executed on a GPU during realapplications, which is failures in time (FIT). For a singleuser utilization, like gaming or video editing, this value may bereasonable. However, the error rate of matrix multiplication isunacceptable for safety critical application or supercomputer. A

Page 3: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

RECH et al.: AN EFFICIENT AND EXPERIMENTALLY TUNED SOFTWARE-BASED HARDENING STRATEGY 2799

TABLE ISINGLE VERSUS MULTIPLE ERRORS

Fig. 3. Percentage of corrupted output matrices affected by single and multipleoutput errors.

supercomputer composed of 500 GPUs, for instance, will expe-rience about FIT, i.e., almost one error per week, whichmay be intolerable. It is worth noting that input matrices werestored in the DDR available on the GPU board, which were notirradiated. Output errors are then produced by the corruption ofGPU internal memory and logic resources.We can further study experimental data analyzing the charac-

teristics of the corrupted resulting matrix. Table I reports andFig. 3 shows the percentage of faulty executions in which asingle error or multiple errors were detected on the output ma-trix. As can be seen, single output errors are detected in lessthan 43% of the cases. This result is of extreme importance asit demonstrates that, for modern GPUs ,the accredited assump-tion of having just single radiation-induced output errors is nolonger valid.Fig. 3 shows the different error patterns we detected when

multiple errors affect the output matrix. In most of the cases,multiple errors are distributed on a single row or column, whilein just 8% of the cases errors are randomly distributed.As stated in the previous section, the error rate of matrix mul-

tiplication irradiated at ISIS is of errors/execution.It is then very unlikely for more than one neutron to corruptthe GPU during the same benchmark execution. Moreover, asdemonstrated in [6], theMCU cross section of the GPUmemoryresources is one order of magnitude lower than the SEU one.Thus, the observed multiple-output errors are caused by neithermultiple neutrons nor a single neutron generating multiple er-rors. The observed phenomena can be explained analyzing thebenchmark code and the GPU internal structure.Errors on a single row or column may be due to cache bits

corruption. In fact, all the threads in charge of calculating arow of matrix M take the same row of matrix A but differentcolumns of matrix B as input. To improve the code parallelismand to optimize the memory usage, the row of A is stored onthe cache of the SM where the considered threads are executed.

TABLE IIRANDOM ERRORS DISTRIBUTION

Fig. 4. Percentage of corrupted output matrices affected by 2, 3, 4 or 5 ran-domly distributed errors.

Thus, if a bit of that row is corrupted, all the correspondent el-ements in the row of M will be erroneous. It is worth noticingthat the threads in charge of calculating a row of M are not alldestined to the same SM; on the contrary, they are likely to bedistributed homogeneously among the 15 SM to maintain a highlevel of parallelism. During our experiments, in fact, we neverobserved a whole row of M corrupted, but only some locationsin some random locations inside the row were found to be erro-neous. Same consideration applies to multiple errors on a singlecolumn.Randomly distributed errors are probably caused by sched-

uler failure. The scheduler is in charge of designating the groupof threads that has to be executed per SM and of detecting ifall the threads have completed computation after the execution.If so, results are presented at the output and another group ofthreads is executed in the correspondent SM. In the event ofscheduler corruption, the results may be presented even if somethreads have not completed computation, leading to wrong re-sults. As reported in Fig. 4 and Table II, just two locations of Mwere corrupted in the majority of the cases in which randomlydistributed errors occurred, and it is very unlikely to have threeor four wrong random locations, as this happens on approxi-mately 1.14% and 0.57% of the faulty computations, respec-tively. As we will detail in the following sections, this informa-tion is essential to optimize the proposed hardening strategy andtune its correction capability.

IV. HARDENING STRATEGIES

Hardening techniques may be applied to different levelsof abstraction to strengthen matrix multiplication algorithmand meet the required level of reliability. Here, we focus onsoftware-based hardening strategies, as introducing hardwaremodifications is normally costly and would require the designand production of custom devices. The analyzed software-based

Page 4: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

2800 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013

hardening techniques do not require layout or hardware modifi-cations however; they demand additional resources utilizationand computational effort, which will be evaluated in the fol-lowing sections. An easy way of detecting and correcting errorsduring the execution of an algorithm is to execute it more thanone time and compare the different executions results. Thisstrategy introduces costly computing and power overheads.However, its inefficiency is compensated by its generality, as itcan be applied to any algorithm.To harden the matrix multiplication algorithm when executed

on a GPU, we implement two known hardening strategies anddesign a novel one able to efficiently correct the experimentallyobserved multiple errors.

A. Triple Modular Redundancy

The first considered hardening technique is the triple modularredundancy (TMR). On a GPU, where a huge amount of parallelthreads can be instantiated, the number of computing units exe-cuting the algorithm can be tripled, and results can be comparedonce calculation is completed.As detailed in Section II, even if 64M threads can be instanti-

ated on the tested GPU, not all of them can run in parallel due tolimited computing resources. Thus, for small matrices, the ex-ecution time of the hardened code may be comparable with thePlain version (i.e., not hardened), but significant performancedegradation may be introduced for big matrices (greater than256 256 in the available GPU). Moreover, TMR introduces aconsiderable resources utilization overhead that may limit thenumber of applications that can be executed in parallel on theGPU and increases the power consumption.The resource overhead and increased power consumption that

derives from TMR approach is compensated by the generalityof the approach. In fact, TMR can be applied to any algorithm inthe same manner, without the need of re-engineering the code.

B. Algorithm Based Fault Tolerance

The second hardening technique we have implemented on theGPU is designed according to the algorithm based fault toler-ance (ABFT) philosophy. The basic idea of ABFT is to analyzethe algorithm structure in order to find an efficient and cleverway for strengthening it. Typically, the ABFT hardened algo-rithm is based on the coding of input data. The algorithm isthen redesigned so as to allow it to work with the coded data,and the computation steps are distributed among different cal-culating units, when available, in order to reduce the occurrenceof multiple errors. Finally, after computation, the output data ischecked taking advantage of the coding introduced in the inputs.In the case of matrix multiplication, a clever ABFT strategy

has been proposed in [10], and it is based on the results checkingapproach introduced by Freivalds in [14]. Input matrices A andB are coded before computation, adding columns and rowscheck-sums vectors (Ac and Br in Fig. 5) following (2) and (3),respectively.The input matrices A and B have then an additional row and

column, respectively. The th element of the added row of Acontains the sum of all the elements in the th column of A, andthe th element of the added column of B contains the sum ofall the element of the th row of B.

Fig. 5. Input coding and single error correction on matrix multiplicationhardened with the ABFT introduced in [6]. The results of the multiplicationof the column check-sum matrix A and the row check-sum matrix B is a fullycheck-sum matrix M. Single output errors can be corrected using either the rowor the column check-sum information of M.

The result of the multiplication of the A column check-summatrix and the B row check-summatrix is a fully check-summa-trixM (Fig. 4), where the th row and the th columncontain the columns (Mc) and rows (Mr) check-sum vectors ofM, respectively [10]. This information can be adequately usedto detect and correct errors that occur during the matrix multipli-cation computation. When multiplication is finished, M columnand row check-sum vectors are recalculated summing the firstcolumns and rows of M. The check-sum vectors resulting

from multiplication (Mc and Mr) and calculated after compu-tation (Mc and Mr ) are then compared. When a mismatch isdetected between Mr[i] and Mr [i], it means that at least oneerror is present in the th row ofM, and respectively for columns(Fig. 5).If M[i,j] is identified as the only error in M, it can be corrected

using either the row or column check-sum vectors, following (4)or (5) [10]:On a GPU, rows or columns check-sums can be calculated

in linear time instantiating a thread for each row and column tobe summed. For the same reason, just constant time is requiredto detect faulty rows and columns and, eventually, correct theerror. ABFT is very efficient in detecting errors but requiresthe matrix multiplication recomputation when multiple errorsoccur. Unfortunately, as experimental results highlight that inthe majority of the cases the output matrix is affected by mul-tiple errors when executed on a GPU, the available ABFT maybecome inefficient.

C. ExtABFT

We have designed an optimized ABFT hardening strategy formatrix multiplication on GPUs, named extABFT, which is anextension of the ABFT capable of correcting the experimentallyobserved multiple output errors distributions and efficiently de-tect when recomputation is strictly necessary. Experimental re-sults (Fig. 3) show that there are three output error distributions:single errors, errors on a row/column, and randomly distributederrors.The error distribution can be easily distinguished counting the

number of mismatches between M row and column check-sumsresulting from multiplication (Mc and Mr) and calculated aftercomputation (Mc and Mr ). Errors_Count procedure, shownin Fig. 6, counts the number of rows ( #Err_row) and columns

Page 5: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

RECH et al.: AN EFFICIENT AND EXPERIMENTALLY TUNED SOFTWARE-BASED HARDENING STRATEGY 2801

Fig. 6. Pseudocode of the procedure that counts and labels rows and columnsof M that contain at least one error.

Fig. 7. Pseudocode of the procedure that corrects a single error.

(#Err_col) of M that contain at least one error and keeps trackof them in the vectors Faulty_rows and Faulty_cols.It is worth noting that, even if never experimentally observed,

check-sums calculation may be corrupted by radiation. In thisunlikely situation, there would be one or more incorrect ele-ments in the th row of A (same consideration applies tothe th column of B). This will lead the entire throw ofmatrixM, and just that row, to be corrupted. The throw of M contains Mc; thus, Errors_count will classify all thecolumns of M as faulty but all the rows as correct. The proce-dure will then identify the anomaly as a check-sum error andthe matrix M can be considered as correct. Similarly,an error in Mc or Mr calculations will be identified as a cor-rupted column without the correspondent corrupted row, andvice versa.When both #Err_row and #Err_col equals 1, just one

location of M is corrupted. The error is located in theonly M row and the only column

detected as faulty. In all the situations inwhich a single error is detected, the already developed ABFTfor matrix operations can be straightforwardly applied to cor-rect it. The localization and correction of a single error can bedone using the Single_Error_Correction procedure describedin Fig. 7.Note that (4) or (5) can be used to correctM[i,j] only if there is

just an error on the th row or th column ofM, respectively. Un-fortunately, the row and column check-sum vectors informationis limited to the identification of rows and columns that containsat least one error but cannot give information about the numberof errors that affect them. Thus, erroneous locations cannot beuniquely identified if more rows and columns are detected asfaulty.When multiple errors are distributed on a single row or

column one, and just one, between #Err_row and #Err_col

Fig. 8. One single mismatch is found on row check-sum vectors (Mr[i]and Mr’[i] in the figure), and various mismatches are detected between thecolumn check-sum vectors. In this case, errors are located on the th row ofM (black squares in the figure) and can be efficiently corrected using (5) for

, and .

Fig. 9. Pseudocode of the procedure that corrects multiple errors on a row orcolumn.

equals one. If errors are distributed on a single row or column,the check-sums information is still sufficient to univocallydetect, and thus correct, them. In fact, (5) [or (4)] can be usedto correct any M[i,j] if it is the only corrupted element inthe th column ( th row) of M. Let us consider, as in Fig. 8,without loss of generality, that multiple errors are detected on asingle row. Errors_Count will detect just one mismatch whencomparing the row check-sum vectors and (positionin Fig. 8). On the contrary, various errors will be detectedwhen comparing and ( , and in Fig. 8). Whenthis situation occurs, one is sure that one and only one errorper identified faulty column is present in M, and this error islocated on the th row. Column check-sums can then be usedto correct , and through (5).The same considerations can be applied when just one columnand various rows are identified as faulty. Fig. 9 shows thepseudocode of the procedure designed for correcting errors ona single row or column.As can be seen in Fig. 3, the probability of occurrence of mul-

tiple errors on a row or column is far from being negligible. Infact, almost 50% of the faulty computations present errors on

Page 6: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

2802 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013

Fig. 10. Various mismatches are found on both the row and the columncheck-sum vectors. In this case, neutron-induced errors (black squares in thefigure) cannot be unequivocally identified, as with the information providedby the check-sum we can say that any of the 12 marked positions in M arepotentially erroneous.

a row or column. With the procedure presented above, the ef-ficiency of the ABFT applied to matrix multiplications is in-creased, avoiding several recomputations.The last situation that may occur is having more than one

mismatch on row check-sums and more thanone mismatch on column-check-sums , asdepicted in Fig. 10. It is easy to see that in the particular case ofFig. 10, all the four radiation-induced errors in M (representedwith black squares) could be corrected through the columncheck-sum using (5), as every column has one and only oneerror, while (4) would not succeed in correcting errors in row. The check-sums information provided by is

not sufficient to identify uniquely those errors but only to statethat at least one error affects rows , and and at least oneerror affects columns , and . Thus,only indicates that any of the 12 marked elements in the Fig. 10are potentially erroneous.The proposed hardening algorithm tries to correct all the

for the rows for which differs from(stored in ) and the columns for whichdiffers from (stored in ) using (5) [thesame considerations can be applied to (4)]. It is worth noticingthat may not be faulty but identified as an errorbecause there is at least another faulty location in rowand column . The execution of (5) succeeds in correcting

if and only if is effectively an error andif it is the only error on the column. In fact, if isnot an error, (5) will modify a correct value or, ifis an error and there are other faulty locations on column ,(5) will apply the wrong correction factor. In both cases, thefaulty situation is not solved. The row check-sum can be usedto check if has been effectively corrected after theapplication of (5). is recalculated with the new value of

. If still differs from , either (5) failed incorrecting or at least another error is present on row. In both cases, the information provided by the check-sum

is not sufficient to further analyze . The algorithmrestores as it was before (5) application and triesto correct the next potential error. On the contrary, ifequals , there was effectively an error in andthe algorithm succeeded in correcting it. As we have said, this

Fig. 11. Pseudocode of the procedure that tries to correct randomly distributedmultiple errors.

also means that was the only faulty location in rowand on column . Since row and column do not containany error, and have to be updated. If

and/or equal one, we know that just oneerror remains on M or that the remaining errors are located ona single row or column. In these cases, the proper procedureis executed. Otherwise, the algorithm tries to correct the nexterror.Fig. 11 summarizes the steps to perform in order to try to

correct randomly distributed errors, showing the pseudocodeof the procedure . When all theneutron-induced errors are located in rows or columns thatcontain at least another error,does not succeed in correcting any of the faulty locations.In this case, the only solution is the recomputation of thepotentially faulty elements of M. In all the other situations,

succeeds in correcting the errors.Experimental results (Fig. 4) states that 80% of cases in

which randomly distributed errors affect matrix M are com-posed of double errors. The two errors are not on the samerow (column), otherwise they would have been detected aserrors on a single row (column). Then, the two rows and thetwo columns detected as faulty contain one and only one erroreach. Thus Random_Err_Correction corrects them. Even when3 or 4 randomly distributed errors occur, the probability of

Page 7: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

RECH et al.: AN EFFICIENT AND EXPERIMENTALLY TUNED SOFTWARE-BASED HARDENING STRATEGY 2803

Random_Err_Correction failure in correcting the errors isextremely low.It is worth noticing that we never observed errors that could

not be corrected with the proposed strategy even after severalhours of irradiation. If such a situation occurred Faulty_rowsand Faulty_cols would hold the output matrix locations thatRandom_Err_Correction did not correct, so only these locationswould be recomputed, not the entire matrix.

V. HARDENING STRATEGIES PERFORMANCE EVALUATION

Through a fault-injection simulator the efficiency and perfor-mance of the implemented strategies were evaluated. Fault-in-jection was performed inside the GPU with a dedicated threadin charge of modifying the GPU resources used during M Mcomputation. Errors were injected with probabilities and distri-butions (single error, errors in a row or in a column, randomerrors) that directly derive from experimental results.ABFT always succeeded in correcting single errors, but when

multiple errors occur recomputation was necessary, and thishappened in most of the cases (Fig. 3). TMR and the extABFTalgorithm that we propose here succeed in correcting all theinjected errors, even if randomly distributed. Since fault-in-jection was tuned with experimentally obtained probabilisticfunctions, the simulation results are consistent and prove thatthe applied technique actually increase the M M reliability.Fault-injection simulations were used to evaluate the time

needed for detection and correction for each of the presentedhardening strategies. Fig. 12 shows the simulation results fordifferent matrix dimensions. As can be seen in Fig. 12(a), forsmall matrices, TMR requires almost the same computationaltime of the Plain version. ABFT and extABFT are definitelyworse, since for small matrices, the time required for errordetection/correction is similar to the time needed to computematrix multiplication. For matrices bigger than 256 256,TMR performances are compromised, as it instantiates anumber of threads that exceeds the maximum number ofthreads the GPU can effectively execute in parallel. ExtABFThas better performances than ABFT ones only for matricesbigger than 256 256, meaning that, for small matrices, re-computation is more efficient than multiple errors correction.As shown in Fig. 12, for big matrices, the extABFT is alwaysbetter than the other strategies. In particular, for 2048 2048matrices, extABFT execution time is less than half of the TMRone, and about the 65% of ABFT one.The hardening techniques performances can be further ana-

lyzed evaluating their computational cost in terms of the numberof basic operations executed. To give an approximation of eachtechnique resource utilization and power consumption, we as-sumed the cost of addition and comparison operations to be uni-tary, and the multiplication cost to be four times larger.The Plain matrix multiplication algorithm requires sums

and multiplications for the computation of every element inM (1). Each of the instantiated threads calculates one elementin M, with a computational cost of . Thus, on a GPU,the overall cost of the Plain matrix multiplication is .The computational cost of TMR is three times the Plain ver-

sion one, plus the voting (two comparisons for each element ofM): .

Fig. 12. (a), (b) Computational time for the different matrix multiplication al-gorithms. Errors were injected with experimentally obtained probabilities anddistributions.

ABFT requires the coding of A and B matrices, which has acost of as we need to perform n sums for every column ofA and row of B. ABFT multiplication costs as A has

rows and B has columns.M rows and columns check-sums calculation and comparison for errors detection require

operations, so the total cost of ABFT when no erroroccurs is . Single errors are corrected withtwo operations ((4) or (5)) while when multiple errors occur theoverall cost is doubled because of recomputation.The proposed extABFT has the same cost of ABFT for input

coding, multiplication, error detection, and single error correc-tion. When multiple errors are on a row or column, each of themis corrected with two operations ((4) or (5)); thus, in the worstunlikely case of having the whole row or column corrupted, thecorrection cost is 2 . When random errors occur, extABFT triesto correct the potentially faulty location with two operations ((4)or (5)) and checks if correction succeeded with oper-ations ( to recalculate the check-sum and one to compare itwith the check-sum resulting from multiplication). Therefore,

operations are needed for every faulty location. Thenumber of faulty locations is the product of the number of rowsand columns detected as faulty (Fig. 4).

Page 8: An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

2804 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013

Fig. 13. Hardening strategies computational costs.

Fig. 14. Hardening strategies computational costs when 2048 2048 matricesare multiplied compared to the Plain version (dashed line).

As reported in Fig. 13, the computational costs of all the hard-ening strategies are asymptotic with as the Plain version, butdifferent overheads are introduced. Fig. 14 shows the compu-tational cost for 2048 2048 matrices multiplications, as theones tested during the radiation experiment campaign. TMR hasthe higher cost, which is more than three times higher than thePlain one (dashed line in Fig. 14). ABFT is efficient when asingle error occurs but becomes costly because of recomputa-tion if multiple errors are detected. The proposed extABFT costis compatible with the Plain one for any of the experimentallyobserved error distributions.The proposed strategy was designed starting from an already

available efficient technique and extended taking advantage ofexperimental results. Thanks to both the analytical study andtest, the obtained technique is very efficient and optimized. Asthe proposed hardening technique is intended to be optimizedand tuned, there is not a generalized solution that can be ap-plied to any algorithm. Nevertheless, the adopted approach thatincludes algorithm analytical details study, experimental tests,results and architecture analysis, and the design of dedicatedcorrecting procedures can be fruitfully extended to other appli-cations.

VI. CONCLUSION

Experimental results show that neutrons cause differentoutput error patterns on matrix multiplication executed on aGPU. We have implemented on the GPU two known hardeningstrategies (i.e., TMR and ABFT), and design an extension ofthe ABFT technique for matrix multiplications able to correctthe experimentally observed multiple errors distributions.We have demonstrated that the proposed extABFT technique

is more efficient than TMR and ABFT for matrixes bigger than256 256, and its computational cost is close to the Plain ver-sion even when multiple errors occur.

REFERENCES

[1] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips, “GPU computing,” Proc. IEEE, vol. 96, no. 5, pp. 879–899,May 2008.

[2] J. Kruger and R. Westermann, “Linear algebra operators for GPU im-plementation of numerical algorithms,” ACM Trans. Graph, vol. 3, no.22, pp. 908–916, 39–55, 2003.

[3] C. Slayman and O. A. La Carte, “Soft errors-past history and recentdiscoveries,” in Proc. IEEE Int. Integr. Reliab. Workhop (IIRW), 2010,pp. 25–30, Invited Paper.

[4] A. Dixit and A. Wood, “The impact of new technology on soft errorrates,” in Proc. IEEE Int. Reliab. Phys. Symp. (IRPS), 2011, pp.5B.4.1–5B.4.7.

[5] E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl.Sci., vol. 43, no. 6, pp. 2742–2750, Dec. 1996.

[6] P. Rech, C. Aguiar, R. Ferreira, M. Silvestri, A. Griffoni, C. Frost, andL. Carro, “Neutron-induced soft errors in graphic processing units,”presented at the IEEERadiat. Effects DataWorkshop (REDW),Miami,FL, USA, Jul. 2012.

[7] J. Kruger and R. Westermann, “Linear algebra operators for GPU im-plementation of numerical algorithms,” ACM Trans. Graph., vol. 22,no. 3, pp. 908–916, July 2003.

[8] J. Liepe, C. Barnes, E. Cule, K. Erguler, P. Kirk, T. Toni, and M. P. H.Stumpf, “ABC-SysBio-approximate Bayesian computation in Pythonwith GPU support,” Bioinformatics, vol. 26, no. 14, pp. 1797–1799,May 2010, 2010.

[9] M. A. Breuer, S. K. Gupta, and T. M. Mak, “Defect and error toler-ance in the presence of massive number of defects,” IEEE Des. TestComput., vol. 21, no. 3, pp. 216–227, 2004.

[10] K. H. Huang and J. A. Abraham, “Algorithm-based fault tolerancefor matrix operations,” IEEE Trans. Comput., vol. c-33, no. 6, pp.518–528, Jun. 1984.

[11] M. Violante, L. Sterpone, A. Manuzzato, S. Gerardin, P. Rech, M.Bagatin, A. Paccagnella, C. Andreani, G. Gorini, A. Pietropaolo, G.Cargarilli, S. Pontarelli, and C. Frost, “ANewHardware/Software Plat-form and a New 1/E Neutron Source for Soft Error Studies: TestingFPGAs at the ISIS Facility,” IEEE TNS Nucl. Sci., vol. 54, no. 4, pp.1184–1189, Aug. 2007.

[12] D. B. Kirk and W. W. Hwo, Programming Massively Parallel Proces-sors. San Mateo, CA, USA: Morgan Kaufmann.

[13] M. Nicolaidis, Soft Errors in Modern Electronic Systems. New York,NY, USA: Springer.

[14] R. Freivalds, “Fast probabilistic algorithms,”Math. Formulat. CS, Lec-ture Notes in Comput. Sci., vol. 74, pp. 57–69, 1979.