software-based hardening strategies for neutron sensitive fft algorithms on gpus
Post on 03-Feb-2017
Embed Size (px)
1874 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014
Software-Based Hardening Strategies for NeutronSensitive FFT Algorithms on GPUs
L. L. Pilla, P. Rech, F. Silvestri, C. Frost, P. O. A. Navaux, M. Sonza Reorda, and L. Carro
AbstractIn this paper we assess the neutron sensitivity ofGraphics Processing Units (GPUs) when executing a Fast FourierTransform (FFT) algorithm, and propose specific software-basedhardening strategies to reduce its failure rate. Our research ismotivated by experimental results with an unhardened FFT thatdemonstrate a majority of multiple errors in the output in the caseof failures, which are caused by data dependencies. In addition,the use of the built-in error-correction code (ECC) showed a largeoverhead, and proved to be insufficient to provide high reliability.Experimental results with the hardened algorithm show a twoorders of magnitude failure rate improvement over the originalalgorithm (one order of magnitude over ECC) and an overhead64% smaller than ECC.
Index TermsError correction code (ECC), fast fourier trans-form (FFT), graphics processing unit (GPU), neutron sensitivity,software-based hardening strategies.
T HE Fast Fourier Transform (FFT) is one of the most rep-resentative algorithms in high performance computing.FFT algorithms are used in several applications such as signalprocessing, vibration and spectrum analysis, speech processing,communication, linear algebra, statistics, 3D reconstruction,and stock options pricing determination , . In addition tothis pervasiveness, the FFT algorithm is also highly parallel,which makes it a suitable candidate for acceleration in GraphicsProcessing Units (GPUs).Nowadays, every desktop computer, laptop, or portable de-
vice includes at least one GPU, mainly used as a support for theCentral Processing Unit (CPU) to accelerate graphics rendering.Due to their highly parallel structure, GPUs are more effectivethan general-purpose CPUs when large blocks of data need tobe processed in parallel, and have recently become popular forhigh performance computing. For instance, TITAN, the second
Manuscript received September 26, 2013; revised December 09, 2013; ac-cepted January 15, 2014. Date of publication March 07, 2014; date of currentversion August 14, 2014.L. L. Pilla, P. Rech, P. O. A. Navaux, and L. Carro are with the Instituto
de Informatica, Federal University of Rio Grande do Sul (UFRGS), PortoAlegre, RS91509-900, Brazil (e-mail: firstname.lastname@example.org; email@example.com;firstname.lastname@example.org; email@example.com).F. Silvestri is with the Dipartimento di Ingegneria dellInformazione, Uni-
versity of Padova, Padova 35100, Italy (e-mail: firstname.lastname@example.org).C. Frost is with STFC, RutherfordAppleton Laboratories, Didcot OX11 0XQ,
U.K. (e-mail: email@example.com).M. Sonza Reorda is with the Dipartimento di Automatica e Informatica, Po-
litecnico di Torino, Torino 10129, Italy (e-mail: firstname.lastname@example.org).Digital Object Identifier 10.1109/TNS.2014.2301768
most powerful of supercomputers  in November 2013, in-cludes 18,000 GPUs to enhance its performance.As we have demonstrated in previous works, radiation-in-
duced errors, including those generated by the terrestrial neu-tron radiation environment at ground level, are one of the majorissues for the newest GPU cores reliability , . Still, thestate of the art lacks studies describing radiation test methodsfor extremely parallel systems and parallel algorithm behavioranalysis in radiation environments. Motivated by these facts, weinvestigate the behavior of a parallel Fast Fourier Transform al-gorithm executed on a GPU irradiated with neutrons.The results of extensive radiation test campaigns attest that
the FFT algorithm experiences a very high error rate and, in themajority of the cases, its output is affected by multiple errors.As we demonstrate in this paper, based on algorithmic and archi-tectural analysis, multiple errors occur mainly because the FFTcomputation requires sequential iterations which are dependenton data from previous iterations. If in a given iteration a threadis corrupted by radiation, the error is then likely to spread overthe following iterations leading to multiple output errors. Ad-ditionally, we experimentally demonstrate that the error-correc-tion code (ECC) available in the latest GPUs is not sufficient toensure by itself a high reliability as it address only L1 and L2cache memories and registers, leaving logic and scheduling re-sources unprotected.In this scenario, we propose a dedicated software-based
hardening strategy for the FFT algorithm executed on a GPUbased on the Algorithm-Based Fault Tolerance (ABFT) philos-ophy . The resulting algorithm, named ABFT-FFT, uses ahardening strategy designed to profit from the FFT propertiesdemonstrated by Jou and Abraham , and applies them inthe GPU algorithm to detect and correct faulty executions. Theresults of our experimental evaluation with ABFT-FFT underan accelerated neutron beam show a two orders of magnitudefailure rate reduction over the original FFT algorithm with anincrease in execution time of only 18%. We also extend theproposed ABFT approach to achieve prompt error detection andprevent errors propagation, and demonstrate its applicabilitywith fault-injection experiments.The remaining sections of this paper are organized as fol-
lows: the experimental methodology, including the tested de-vices and algorithms, is detailed in Section II. An evaluation ofthe original FFT algorithm and the use of ECC is discussed inSection III. Our hardening strategies and their evaluations arepresented in Section IV. Concluding remarks and future workare drawn in Section V.
0018-9499 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
PILLA et al.: SOFTWARE-BASED HARDENING STRATEGIES FOR NEUTRON SENSITIVE FFT ALGORITHMS ON GPUS 1875
II. EXPERIMENTAL METHODOLOGY
A. Neutron BeamRadiation experiments were performed at the ISIS facility in
the Rutherford Appleton Laboratories (RAL) in Didcot, UK .The available neutron flux was of about n/cm s. Thebeam was focused on a spot with a diameter of 2 cm plus 1 cmof penumbra, which is enough to fully and homogeneously ir-radiate the GPU chip without directly affecting nearby powercontrol circuitry and DDR memories on the board. Neverthe-less, even if the beam is collimated, scattering neutrons maystill wander from the beam spot. Thus, to ensure that the DDRcontent was consistent during our experiments, we periodicallychecked it during experiments, but no error has been observed.It is worth noting that input and output data were stored in the
GPUs DDR memory, and no L1 cache memory was employed,so the errors reported in the following sections were only causedby the corruption of the GPU core logic resources or internalflip-flops. Irradiation was performed at room temperature withnormal incidence, nominal voltages and frequency of operation.
B. Tested DevicesWe tested commercial-off-the-shelf Tesla C2050 GPUs de-
signed by NVIDIA and manufactured in a 40 nm technologynode. The C2050 model includes 14 Streaming Multiproces-sors (SMs), each of which is divided into 32 Compute UnifiedDevice Architecture (CUDA) cores . In the C2050 GPU, 14blocks of threads can be executed in parallel with a maximumof 32 threads in each block for a total of 448 threads. If morethreads or blocks are instantiated, their execution will be de-layed until they can be scheduled.NVIDIA provides the newest GPUs, such as the C2050
family, with an internal ECC mechanism that can be activatedby the user. This mechanism was first disabled for the eval-uation of the FFT sensitivity to neutrons, and later enabled.A discussion on the efficiency and drawbacks of the NVIDIAECC mechanism takes place in Section III-B.It is important to emphasize that the input vectors, as well as
the results of computation, are stored in the GPU board DDR,which was not irradiated. On a realistic application, a highernumber of blocks to be processed may extend the exposure timeof input or output data, increasing the probability of having themcorrupted by neutron-induced Single Event Upsets or SingleEvent Functional Interruptions. However, DDR sensitivity hasbeen proved to decrease with the shrinking of technology nodes, and modern DDR memories are provided with efficientECCs that increase in several orders of magnitudes the devicereliability . It is then reasonable to consider negligible, evenon a real application, the probability for GPU input or outputvectors to be corrupted.
C. Tested FFT AlgorithmWe tested under radiation a benchmark that implements1D-FFTs of 64-point each. The FFT input is composed of
a double precision floating-point matrix for thereal part and a matrix for the imaginary part.We choose to test relatively small FFTs (64-point) to limit thenumber of iterations and ease the study of error propagation,
Fig. 1. A basic butterfly module used to update two-by-two all the 64 elementscomposing the FFT.
while having 1D-FFTs eases the gathering of a sta-tistically significant amount of errors.The implemented algorithm is based on the Fast Fourier
Transform kernel of the NAS Parallel Benchmarks ,named FT, implemented in C and ported to the GPU archi-tecture using the NVIDIA CUDA programming model. Each64-point 1D-FFT kernel is composed of 6 sequential iterations( ) of a variant of the Stockham FFT algorithm .Threads update the values of complex floating-point elements
of the matrix in pairs using their previous values as input. Thisbehavior mimics a butterfly module , which is illustrated