Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs

Download Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs

Post on 03-Feb-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>1874 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014</p><p>Software-Based Hardening Strategies for NeutronSensitive FFT Algorithms on GPUs</p><p>L. L. Pilla, P. Rech, F. Silvestri, C. Frost, P. O. A. Navaux, M. Sonza Reorda, and L. Carro</p><p>AbstractIn this paper we assess the neutron sensitivity ofGraphics Processing Units (GPUs) when executing a Fast FourierTransform (FFT) algorithm, and propose specific software-basedhardening strategies to reduce its failure rate. Our research ismotivated by experimental results with an unhardened FFT thatdemonstrate a majority of multiple errors in the output in the caseof failures, which are caused by data dependencies. In addition,the use of the built-in error-correction code (ECC) showed a largeoverhead, and proved to be insufficient to provide high reliability.Experimental results with the hardened algorithm show a twoorders of magnitude failure rate improvement over the originalalgorithm (one order of magnitude over ECC) and an overhead64% smaller than ECC.</p><p>Index TermsError correction code (ECC), fast fourier trans-form (FFT), graphics processing unit (GPU), neutron sensitivity,software-based hardening strategies.</p><p>I. INTRODUCTION</p><p>T HE Fast Fourier Transform (FFT) is one of the most rep-resentative algorithms in high performance computing.FFT algorithms are used in several applications such as signalprocessing, vibration and spectrum analysis, speech processing,communication, linear algebra, statistics, 3D reconstruction,and stock options pricing determination [1], [2]. In addition tothis pervasiveness, the FFT algorithm is also highly parallel,which makes it a suitable candidate for acceleration in GraphicsProcessing Units (GPUs).Nowadays, every desktop computer, laptop, or portable de-</p><p>vice includes at least one GPU, mainly used as a support for theCentral Processing Unit (CPU) to accelerate graphics rendering.Due to their highly parallel structure, GPUs are more effectivethan general-purpose CPUs when large blocks of data need tobe processed in parallel, and have recently become popular forhigh performance computing. For instance, TITAN, the second</p><p>Manuscript received September 26, 2013; revised December 09, 2013; ac-cepted January 15, 2014. Date of publication March 07, 2014; date of currentversion August 14, 2014.L. L. Pilla, P. Rech, P. O. A. Navaux, and L. Carro are with the Instituto</p><p>de Informatica, Federal University of Rio Grande do Sul (UFRGS), PortoAlegre, RS91509-900, Brazil (e-mail: llpilla@inf.ufrgs.br; prech@inf.ufrgs.br;navaux@inf.ufrgs.br; carro@inf.ufrgs.br).F. Silvestri is with the Dipartimento di Ingegneria dellInformazione, Uni-</p><p>versity of Padova, Padova 35100, Italy (e-mail: silvest1@dei.unipd.it).C. Frost is with STFC, RutherfordAppleton Laboratories, Didcot OX11 0XQ,</p><p>U.K. (e-mail: christopher.frost@stfc.ac.uk).M. Sonza Reorda is with the Dipartimento di Automatica e Informatica, Po-</p><p>litecnico di Torino, Torino 10129, Italy (e-mail: matteo.sonzareorda@polito.it).Digital Object Identifier 10.1109/TNS.2014.2301768</p><p>most powerful of supercomputers [3] in November 2013, in-cludes 18,000 GPUs to enhance its performance.As we have demonstrated in previous works, radiation-in-</p><p>duced errors, including those generated by the terrestrial neu-tron radiation environment at ground level, are one of the majorissues for the newest GPU cores reliability [4], [5]. Still, thestate of the art lacks studies describing radiation test methodsfor extremely parallel systems and parallel algorithm behavioranalysis in radiation environments. Motivated by these facts, weinvestigate the behavior of a parallel Fast Fourier Transform al-gorithm executed on a GPU irradiated with neutrons.The results of extensive radiation test campaigns attest that</p><p>the FFT algorithm experiences a very high error rate and, in themajority of the cases, its output is affected by multiple errors.As we demonstrate in this paper, based on algorithmic and archi-tectural analysis, multiple errors occur mainly because the FFTcomputation requires sequential iterations which are dependenton data from previous iterations. If in a given iteration a threadis corrupted by radiation, the error is then likely to spread overthe following iterations leading to multiple output errors. Ad-ditionally, we experimentally demonstrate that the error-correc-tion code (ECC) available in the latest GPUs is not sufficient toensure by itself a high reliability as it address only L1 and L2cache memories and registers, leaving logic and scheduling re-sources unprotected.In this scenario, we propose a dedicated software-based</p><p>hardening strategy for the FFT algorithm executed on a GPUbased on the Algorithm-Based Fault Tolerance (ABFT) philos-ophy [6]. The resulting algorithm, named ABFT-FFT, uses ahardening strategy designed to profit from the FFT propertiesdemonstrated by Jou and Abraham [7], and applies them inthe GPU algorithm to detect and correct faulty executions. Theresults of our experimental evaluation with ABFT-FFT underan accelerated neutron beam show a two orders of magnitudefailure rate reduction over the original FFT algorithm with anincrease in execution time of only 18%. We also extend theproposed ABFT approach to achieve prompt error detection andprevent errors propagation, and demonstrate its applicabilitywith fault-injection experiments.The remaining sections of this paper are organized as fol-</p><p>lows: the experimental methodology, including the tested de-vices and algorithms, is detailed in Section II. An evaluation ofthe original FFT algorithm and the use of ECC is discussed inSection III. Our hardening strategies and their evaluations arepresented in Section IV. Concluding remarks and future workare drawn in Section V.</p><p>0018-9499 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.</p></li><li><p>PILLA et al.: SOFTWARE-BASED HARDENING STRATEGIES FOR NEUTRON SENSITIVE FFT ALGORITHMS ON GPUS 1875</p><p>II. EXPERIMENTAL METHODOLOGY</p><p>A. Neutron BeamRadiation experiments were performed at the ISIS facility in</p><p>the Rutherford Appleton Laboratories (RAL) in Didcot, UK [8].The available neutron flux was of about n/cm s. Thebeam was focused on a spot with a diameter of 2 cm plus 1 cmof penumbra, which is enough to fully and homogeneously ir-radiate the GPU chip without directly affecting nearby powercontrol circuitry and DDR memories on the board. Neverthe-less, even if the beam is collimated, scattering neutrons maystill wander from the beam spot. Thus, to ensure that the DDRcontent was consistent during our experiments, we periodicallychecked it during experiments, but no error has been observed.It is worth noting that input and output data were stored in the</p><p>GPUs DDR memory, and no L1 cache memory was employed,so the errors reported in the following sections were only causedby the corruption of the GPU core logic resources or internalflip-flops. Irradiation was performed at room temperature withnormal incidence, nominal voltages and frequency of operation.</p><p>B. Tested DevicesWe tested commercial-off-the-shelf Tesla C2050 GPUs de-</p><p>signed by NVIDIA and manufactured in a 40 nm technologynode. The C2050 model includes 14 Streaming Multiproces-sors (SMs), each of which is divided into 32 Compute UnifiedDevice Architecture (CUDA) cores [9]. In the C2050 GPU, 14blocks of threads can be executed in parallel with a maximumof 32 threads in each block for a total of 448 threads. If morethreads or blocks are instantiated, their execution will be de-layed until they can be scheduled.NVIDIA provides the newest GPUs, such as the C2050</p><p>family, with an internal ECC mechanism that can be activatedby the user. This mechanism was first disabled for the eval-uation of the FFT sensitivity to neutrons, and later enabled.A discussion on the efficiency and drawbacks of the NVIDIAECC mechanism takes place in Section III-B.It is important to emphasize that the input vectors, as well as</p><p>the results of computation, are stored in the GPU board DDR,which was not irradiated. On a realistic application, a highernumber of blocks to be processed may extend the exposure timeof input or output data, increasing the probability of having themcorrupted by neutron-induced Single Event Upsets or SingleEvent Functional Interruptions. However, DDR sensitivity hasbeen proved to decrease with the shrinking of technology nodes[10], and modern DDR memories are provided with efficientECCs that increase in several orders of magnitudes the devicereliability [11]. It is then reasonable to consider negligible, evenon a real application, the probability for GPU input or outputvectors to be corrupted.</p><p>C. Tested FFT AlgorithmWe tested under radiation a benchmark that implements1D-FFTs of 64-point each. The FFT input is composed of</p><p>a double precision floating-point matrix for thereal part and a matrix for the imaginary part.We choose to test relatively small FFTs (64-point) to limit thenumber of iterations and ease the study of error propagation,</p><p>Fig. 1. A basic butterfly module used to update two-by-two all the 64 elementscomposing the FFT.</p><p>while having 1D-FFTs eases the gathering of a sta-tistically significant amount of errors.The implemented algorithm is based on the Fast Fourier</p><p>Transform kernel of the NAS Parallel Benchmarks [12],named FT, implemented in C and ported to the GPU archi-tecture using the NVIDIA CUDA programming model. Each64-point 1D-FFT kernel is composed of 6 sequential iterations( ) of a variant of the Stockham FFT algorithm [13].Threads update the values of complex floating-point elements</p><p>of the matrix in pairs using their previous values as input. Thisbehavior mimics a butterfly module [7], which is illustrated inFig. 1. This process involves reading four floating-point valuesto registers (two real and two imaginary values), using themonce to compute their updated values, and writing them inmemory. Due to the amount of data being processed in parallel,the FFT is not able to benefit from a caching mechanism.The execution of the FFT algorithm in the GPU requires the</p><p>instantiation of parallel threads. They are grouped inblocks of 512 threads each. In this scenario, a thread is in chargeof evaluating the FFT values on its assigned complex vector ofsize 64.</p><p>III. EXPERIMENTAL RESULTS AND DISCUSSIONThis section presents the experimental results obtained with</p><p>the plain FFT algorithm (i.e. the algorithm without any hard-ening solution applied, as described in the previous section) aswell as with the use of ECC.</p><p>A. Sensitivity of the Original FFTTable I reports the experimentally measured cross section and</p><p>failures in time (FIT, or failures in hours) for the testedFFT code under the neutron beam discussed in Section II-A.The cross section is obtained dividing the number of faulty FFTexecutions per unit time by the flux. As the ISIS neutron spec-trum of energies is suitable to mimic the terrestrial one, a FITcan be calculated multiplying the experimentally obtained crosssection by the expected natural neutron flux at sea level (aboutn/cm h [14]). The values reported in Table I confirm that</p><p>GPUs are extremely prone to be corrupted by neutrons.The tested FFT algorithm produces complex double-preci-</p><p>sion floating-point data as output, which can be split into realand imaginary parts. An execution is considered as faulty if at</p></li><li><p>1876 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014</p><p>TABLE I64-POINT FFTS CROSS SECTION AND FIT AT NYC</p><p>TABLE IIREAL AND IMAGINARY PART ERRORS ON 64-POINT FFTS</p><p>Fig. 2. Graphical representation of the FFT algorithm under radiation. At eachiteration, a thread updates two-by-two all the 64 values of the FFT using thebasic butterfly module. Six iterations are necessary to complete the execution.If an operation in one iteration is corrupted by radiation, two (or more) valueswill be wrongly updated, and the number of errors will double in the followingiteration.</p><p>least one difference with respect to the expected value is de-tected in the real, imaginary, or both parts of the output.Table II shows the percentage of faulty executions of the FFT</p><p>algorithm in which the real part or the imaginary part was de-tected as corrupted, as well as the cross section and FIT of eachof these parts. Even if the implemented algorithm is symmetricfor the real and imaginary parts, as illustrated in Fig. 1, some ex-ecutions experienced errors in just one of the parts. These errorsare caused by the corruption of internal registers used by threadsfor storing the intermediate values of the complex number. Asstated in the second column of Table II, more than 90% of theexecutions considered faulty experience errors on both the realand imaginary part of the output. In all other situations, the radi-ation induced error corrupted a resource in a thread and its effectwas confined in the real or imaginary part of the result, withoutpropagating to the other part.The computation steps in the FFT algorithm are not indepen-</p><p>dent, since an iteration uses the output of previously executedones to update the real and imaginary parts of two complex ele-ments (see Fig. 2). It is then very likely that a radiation-inducedevent affecting a thread in the early stages of the FFT executionwill spread, affecting various bits of the output. This fault be-havior is illustrated in Fig. 2, where an error in the first iterationof the FFT corrupts the whole output vector.As a thread is in charge of updating two complex values, a ra-</p><p>diation-induced error that prevents the thread from completingits execution or corrupts the thread input data produces at least</p><p>TABLE IIISINGLE AND MULTIPLE ERRORS ON 64-POINT FFTS</p><p>two output errors. Nevertheless, a single error in a thread can begenerated by the corruption of the internal register that stores thevalue of just one of the two elements to update, or disturbing justone of the operations needed to compute the FFT. A thread canthen complete its execution, allowing the correct calculation ofthe second complex number. Single output errors occur in theFFT only if such single thread error occurs in the last FFT itera-tion. To illustrate that, Table III reports the proportion of singleand multiple errors in each part of the complex values. Singleoutput errors occurred in just 1.63% and 4% of the faulty execu-tions for the real and imaginary parts, respectively. These errorsshould become even less likely with an increase in the numberof points computed by the FFT algorithm, as the number of it-erations increases by the logarithm of the number of points, andso the portion of execution time spent on iterations other thanthe last one also increases.The importance of the occurrence of multiple errors in real-</p><p>istic applications is highlighted in the last column of Table III,in which the FIT values of FFT executions affected by single ormultiple errors in the real and imaginary parts are reported.The experimentally observed multiple e...</p></li></ul>

Recommended

View more >