neutron sensitivity and software hardening strategies for matrix multiplication and fft

32
Neutron Sensitivity and Neutron Sensitivity and Software Hardening Software Hardening Strategies for Matrix Strategies for Matrix Multiplication and FFT Multiplication and FFT on Graphics Processing Units on Graphics Processing Units June 18 th , 2013 – New York City, NY, USA P. Rech , L. Pilla, F. Silvestri, P. O. Navaux, and Luigi Carro

Upload: herman-poole

Post on 02-Jan-2016

13 views

Category:

Documents


0 download

DESCRIPTION

June 18 th , 2013 – New York City, NY, USA. Neutron Sensitivity and Software Hardening Strategies for Matrix Multiplication and FFT on Graphics Processing Units. P. Rech , L. Pilla, F. Silvestri, P. O. Navaux, and Luigi Carro. Outline. Radiation Effects on Graphics Processing Units - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Neutron Sensitivity andNeutron Sensitivity andSoftware Hardening Strategies for Software Hardening Strategies for

Matrix Multiplication and FFTMatrix Multiplication and FFTon Graphics Processing Unitson Graphics Processing Units

June 18th, 2013 – New York City, NY, USA

P. Rech, L. Pilla, F. Silvestri,P. O. Navaux, and Luigi Carro

Page 2: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions2/27

Page 3: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions

Page 4: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Terrestrial Radiation EnvironmentTerrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere

shower of energetic particles:- Muons- Pions- Protons- Gamma rays- Neutrons

13 n/(cm2h) @sea level

Radiation is an issue at sea level!!

3/27

Page 5: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

GPU Internal StructureGPU Internal Structure

GPU

ThreadThread ThreadThread ThreadThread

RegReg RegReg RegReg

Shared MemoryShared Memory

ThreadThread ThreadThread ThreadThread

RegReg RegReg RegReg

Streaming Multiprocessor

DRAMDRAM

A GPU is an array of Streaming Multiprocessors

The SMs share DRAM

SM executes various threads in parallelThreads has access to Registers and Shared Memory

4/27

Page 6: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Streaming Multiprocessor

Radiation Effects on a GPURadiation Effects on a GPU

GPU

ThreadThread ThreadThread ThreadThread

RegReg RegReg RegReg

Shared MemoryShared Memory

ThreadThread ThreadThread ThreadThread

RegReg RegReg RegReg

DRAMDRAM

SEU

SEU

SEU

SETRadiation can corrupt memory resources (SEU)……but also logic (SET) and control circuitry:a scheduler failure may have severe repercussions

5/27

Page 7: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Why Radiation Test on GPUs?Why Radiation Test on GPUs?

6/27

Titan (Oak Ridge National Lab): 18,000 GPUs

Pedestrian Detection*

High probability of having a GPU corrupted

High reliability is required

*From 2015: 5 stars of security only to cars with pedestrian detection (Euro NCAP)

NVIDIA Tegra

Page 8: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions

Page 9: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Tested DevicesTested Devices

NVIDIA GeeForce GTX480(desktop board)

NVIDIA TESLA C2050(built-in ECC)

7/27

Page 10: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Radiation Test FacilitiesRadiation Test Facilities

p+

8/27

Page 11: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Radiation Test FacilitiesRadiation Test Facilities

9/27

Page 12: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Radiation Test FacilitiesRadiation Test Facilities

Weapon Nuclear Research

10/27

Page 13: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Neutrons SpectrumNeutrons Spectrum

1 sec @ISIS = 107 sec(110 days) of natural irradiation @NYC

11/27

Page 14: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

GPU Radiation Test SetupGPU Radiation Test Setup

PC

20 cm PCI-E bus

Beam spotPC inside the room butout of the beam

PCI-E bus extension between PC and GPU

Extension with fuseson power linesto avoid GPU latchups to affect the PC

12/27

Page 15: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

GPU Radiation Test SetupGPU Radiation Test Setup

GPU power control circuitry is out of beam

power control circuitry failure could compromise the experience and the GPU

DDR are out of beam

Beam spot is 3cm wide:GPU fully irradiated

13/27

Page 16: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions

Page 17: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

x

Matrix MultiplicationMatrix Multiplication

AA

2048 elements

BB

204

8 e

lem

ent

s

2048 elements

MM

204

8 e

lem

ent

s

2048 x 2048 threads2

048

su

m &

mu

lt

=

204

8 s

um

& m

ult

14/27

Page 18: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Matrix Multiplication ResultsMatrix Multiplication ResultsExperimental Cross Section* @ISIS = 2.0110-6 cm2

The Cross Section @ISIS resemble the Cross Section @sea level

2.60104 FIT1 error every 4,5 years

Neutrons spectrum @ISIS resemble the atmospheric one

Cross Section #Particles (@sea level) = Error Rate

2.0110-6 cm2 13 n/cm2/h =

Titan (GTX): 18,000 errors every 4,5 years10 errors per day!

Titan (GTX): 18,000 errors every 4,5 years10 errors per day!

*with double data

15/27

Page 19: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Multiple Output ErrorsMultiple Output ErrorsIt was accredited that just single error affects output

Experimental results:

Single: 42.2%Multiple: 58.8%

the majority of errors are multiple output errors

16/27

Page 20: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Multiple Output Errors AnalysisMultiple Output Errors AnalysisThree different Multiple Errors patterns are detected:

Out

put

Err

ors

[%]

Multiple

Sin

gle

Ro

w

Co

lum

n

RN

D

1) 22.8% on the same Row

MM

xx xx x

xxx

xx

x

x

x

x

2) 26.8% on the same Column

3) 8% Cluster Errors

17/27

Page 21: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Errors on Row/Column CausesErrors on Row/Column Causes

AA BB

MM

……

GPU cachex

xxxxxxx

M column is calculated using A rows and one column of B, stored in the GPU cache.

Cache corruption causes errors on row/column

threads on a SM share cachethreads on a SM share cache

18/27

Page 22: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Errors CorrectionErrors Correction1) ECC on Cache memory

- Corrects multiple errors on Row/Column, which are almost 50% of the total (tested on C2050)

- Memory availability is reduced of 12.5%*

- Execution time is increased of up to 30%*

19/27

*NVIDIA datasheet

2) Algorithm Based Fault Tolerance:technique specifically designed for an algorithm

xAA BB

checksumchecksum

chec

ksum

chec

ksum

∑ MM=

col-checkcol-check

row

-che

ckro

w-c

heck

*Freivalds ‘79

Page 23: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

MM

col-checkcol-checkro

w-c

heck

row

-che

ck

col-sumcol-sumro

w-s

umro

w-s

um

X

X

X

Single Errors* aredetected in O(N)and corrected in O(1)

Matrix Multiplication ABFTMatrix Multiplication ABFT

MM

col-checkcol-check

row

-che

ckro

w-c

heck

col-sumcol-sum

row

-sum

row

-sum

X

X

XXX

XXErrors on a Row/Col* are detected in O(N)and corrected in O(1)

*Huang and Abraham ‘84

*P. Rech at al, ‘12

20/27

Page 24: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Cluster Errors CausesCluster Errors Causes

Scheduler failure affects some threads synchronization or provides incomplete results

Random locations of M result then erroneous

MMx xx

x21/27

Cluster errors can be caused by-Cache cross-talk-Errors in dirty cache flags-Pairwise bit flips in cache-Scheduler failure

Page 25: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Cluster Errors CriticalityCluster Errors Criticality

Cluster errors:-not corrected by ECC (tested on C2050)-scheduler cannot be physically harden-scheduler SW hardening* not yet proved on GPU

22/27

*Rossi et al. ’10*Karimi et al. ‘10

Out

put

Err

ors

[%]

Multiple

Sin

gle

Ro

w

Co

lum

n

Cluster errors are less likely to occur, however their FIT is 1.13103, which is not negligible!

Page 26: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

MM

col-checkcol-check

row

-che

ckro

w-c

heck

col-sumcol-sum

row

-sum

row

-sum

X

X

XX

XX

X

X

various mismatches between row-checkrow-check row-sumrow-sum

various mismatches between col-checkcol-check col-sumcol-sum

checksum info is not enough for distinguishing errors but…

…we can try to correct errors with row-checksums orcol-checksums and check if correction succeed

Experimentally observedcorrupted location on a cluster ≤ 4:

at most 16 checks are needed!MM

XX

X X

23/27

Cluster Errors CorrectionCluster Errors Correction

Page 27: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions

Page 28: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

Fast Fourier TransformFast Fourier Transform

64-p

oint

s F

FT

64-p

oint

s F

FT

...

64-p

oint

s F

FT

64-p

oint

s F

FT

64-p

oint

s F

FT

...64-p

oint

s F

FT

64-p

oint

s F

FT

64-p

oint

s F

FT

...

...

512 FFTs

512

FFTs log264=6 iterations required

512x512 threads, each executing the Stockham algorithm on a 64-points FFT

at each iteration a thread updates 2-by-2 the 64 elemens

a thread in one iteration uses the output of previous threads as input

Threads are not independent, errors are likely to spread

FFT cross section = 3.6910-6 cm2 (5.17105 FIT) 24/27

Page 29: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

FFT Multiple ErrorsFFT Multiple Errors

Multiple Errors

Per

cent

age

of fa

ulty

FF

T

0

1

2

3

4

5

6

7

8

9

10

2 4 6 9-11 14 16 18 20-21 24 26 28 30 34-39 42 44 46-47 50-51 54-55 62 64 66-12632 57-59 >130128

Less than 4% of execution has single errorsfew executions has odd amount of errors

Most executions has less than 32 errors or 64 (thread failure leads to the wrong update of all the 64 elements in the FFT)

Software hardening idea: prevent errors propagation25/27

Page 30: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

FFT HardeningFFT Hardeninginput coding

output decodingchecksum generation

All errors are detected with a wise coding-decoding scheme*...

*J.Y. Jou and Abraham ’88*P. Rech and al. ‘13

...but just when all iterations are completed: errors do propagate and FFT recomputation is required

Divide the N-FFT in N2-FFTs and N1-FFTs (N=N1*N2) performing coding-decoding-checksum on each smaller FFT......only the small FFT found corrupted has to be recomputed

error propagation computational overhead

check

check

check26/27

FF

T

FF

T

ABFT

Ha

rden

ed

FF

T

Page 31: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

OutlineOutline Radiation Effects on Graphics Processing Units

Experimental Setup

Matrix Multiplication

- Error Rate at Sea Level

- Hardening Techniques

Fast Fourier Transform

- Error Rate at Sea Level

- Hardening Techniques

Conclusions

Page 32: Neutron Sensitivity  and Software Hardening Strategies for  Matrix Multiplication and FFT

Paolo Rech – FTXS 2013, New York City, NY

- GPUs are very prone to be corrupted by neutrons

- The radiation response depends on executed algorithm

- The corruption of shared and critical resources leads to multiple output errors

- ECC is not sufficient to guarantee high reliability

- Software-Based Hardening Strategies can be built analyzing the algorithm and experimental data

Work in Progress:

- Reduce scheduler strain optimizing thread distributions

- Analyze cache flags corruptions

- Evaluate error criticality (precision of data)

ConclusionsConclusions

27/27