stefgraillat jean-luclamotte siegfriedm.rump svetoslavmarkov ·...
TRANSCRIPT
![Page 1: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/1.jpg)
Interval arithmetic on the Cell processor
Stef Graillat Jean-Luc Lamotte Siegfried M. RumpSvetoslav Markov
LIP6/PEQUAN, P. and M. Curie University, Paris
Institute for Reliable Computing, Hamburg University of Technology
Institute of Mathematics and Computer Science, Bulgarian. Academy of Sciences
13th GAMM - IMACS International Symposium on ScientificComputing, Computer Arithmetic and Verified Numerical
Computations SCAN’08El Paso, Texas, USA, September 29 - October 3, 2008
![Page 2: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/2.jpg)
Overview
The Cell processor
Interval Arithmetic on Cell processor
Conclusions
![Page 3: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/3.jpg)
The Cell processor
SP > 200 GFlops, DP=15 Gflops, 25GB/s memory BW, 300 GB/s EIB
![Page 4: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/4.jpg)
![Page 5: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/5.jpg)
Synergistic Processing Element SPE (1/2)
The SPE is a small processor with a vectorial unit.I small memory (256 KB) for instructions and data, named
“local store” (LS)I 128 registers of 128 bitsI 1 SPU “Synergistic Processing Unit”
I 4 units for single precision computationI 1 unit for double precision computation
I MFC “Memory Flow Controller” which manages memoryaccess through DMA
![Page 6: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/6.jpg)
Synergistic Processing Element SPE (2/2)
128-bit registers :I 16 integers of 8 bit,I 8 integers of 16 bit,I 4 integers of 32 bit,I 4 single precision floating point numbers,I 2 double precision floating point numbers.
The SIMD processor is based on FMA and is fully pipelined in SP :
Peak performance SP : 4× 2× 3.2 = 25.6GFLOPsNot fully pipelined in double precision :
Peak performance in DP : 2× 2× 3.2/7 = 1.8GFLOPs
![Page 7: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/7.jpg)
Copyright
IBM
![Page 8: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/8.jpg)
Parallelism on Cell
3 levels of parallelism :1. processes run on Cell processors, exchange with a MPI library,2. Data distribution and communication on the 8 SPE,
I ALF, DacsI POSX thread, CELL thread,I mailing box, exchange through DMAI data need to be aligned on quadwordI double buffering technique
3. inside a threadI only 256 KBI Altivec programmingI code and data dependencies : not to break the SIMD pipeline
![Page 9: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/9.jpg)
The performance price on SPE
No division1/x and 1/
√x : only the 12 first bits are exact.
SPU float arithmetic is not IEEE compliant :I only rounding mode to zero (truncation).I The highest exponent (128) is used not for Infinity or NaN,
but is used to extend the range of the floating point.I Inf and NaN are not recognized by arithmetic operations.I Overflow results saturate to the largest representable positive
or negative values, rather than producing +/-IEEE Infinity.I No denormalized results : +0 instead.
![Page 10: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/10.jpg)
The performance price
SPU double arithmetic is IEEE compliant except :I FP trapping is not supported.I Denormalized operands are treated as 0.I NaN results are always the default QNaN (Quiet NaN)
![Page 11: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/11.jpg)
Reliable computing on Cell processor
I difficult to implement interval arithmetic.I possible to “emulate” a rounding mode toward +∞
if r ∈ R non-negative, fl0(r) ≤ r ≤ succ(fl0(r))and
succ(f ) = max{fl0((1 + 2u)f ), fl0(f + u)}.
where u is the relative rounding error and u the underflow unit
On the Cell processor, no underflow
I succ(f ) = fl0((1 + 2u)f ) if f > 0I 1
2uu−1 if f = 0
![Page 12: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/12.jpg)
Interval with a rounding mode toward zero
Three representations :I endpointI center-radiusI leftpoint-diameter
![Page 13: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/13.jpg)
Endpoint representation — addition
Let A = [ainf , asup] , B = [binf , bsup] and C = [cinf , csup] be threeintervals, C = A + B is defined by :Let ⊕ and ⊗ be the floating point addition and multiplication withrounding towoard zero.
cinf =
− succ(|ainf ⊕ binf |) if (ainf ⊕ binf ) < 0ainf ⊕ binf if (ainf ⊕ binf ) > 0−1
2uu−1 if (ainf ⊕ binf ) = 0
csup =
succ(asup ⊕ bsup) if (asup ⊕ bsup) > 0ainf ⊕ binf if (asup ⊕ bsup) < 012uu
−1 if (asup ⊕ bsup) = 0
![Page 14: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/14.jpg)
Endpoint representation — multiplication
x = min(ainf ⊗ binf , ainf ⊗ bsup, asup ⊗ binf , asup ⊗ bsup)y = max(ainf ⊗ binf , ainf ⊗ bsup, asup ⊗ binf , asup ⊗ bsup)
C = A× B is defined by :
cinf =
− succ(|x |) if x < 0−1
2uu−1 if x = 0
x else
csup =
succ(y) if y > 012uu
−1 if y = 0y if y < 0
![Page 15: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/15.jpg)
Center-radius — addition
Let A = [a, α] , B = [b, β] and C = [c , γ] be three intervals.
Rump’s algorithm :
c = �(a + b)γ = 4(2u · |c |+ α+ β)
C = A + B is defined by :I c = a ⊕ bI γ = succ(2u ⊗ |c | ⊕ succ(α⊕ β))
![Page 16: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/16.jpg)
Center-radius — multiplication
Rump’s algorithm :
c = �(a · b)γ = 4(u + 2u · |c |+ (|a|+ α)β + α|b|))
C = A× B is defined by :I c = a ⊗ bI γ = succ(succ(2u⊗|c |⊕succ(succ(|a|⊕α)⊗β))⊕succ(α⊗|b|)
![Page 17: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/17.jpg)
Implementation of intervals vectors
typedef struct{float center;float radius;
} T_INTERVAL;
T_INTERVAL x[...]
c1 r1 c2 r2 c3 r3 c4 r4
typedef struct{int intervalnumber;float *center;float *radius;
} T_INTERVAL;
c1 c2 c3 c4
r1 r2 r3 r4
All the SIMD operations use vectors of four 32-bit floating pointnumbers.
![Page 18: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/18.jpg)
Performances on 1 SPE
Operations MFLOPsIdle (function call) 477.6Add [crcr] 273.5Add [ccrr] 345.9Mul [crcr] 244.3Mul [ccrr] 255.1Add inf-sup 285.7
![Page 19: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/19.jpg)
Center-diameter representation
Seems to be useful with rounding toward zero !
A = (a, α) =
{{x ∈ R : a ≤ x ≤ a + α} if a ≥ 0{x ∈ R : a − α ≤ x ≤ a} if a < 0
But very difficult to implement !
![Page 20: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/20.jpg)
Conclusions
I hard programming jobI necessity to develop complex algorithms to reach a high level
of performance.I to prepare the work for the new Cell :
I fully pipelined double precision floating point number (100 DPGFlOPS)
I up to now no information on the floating point quality
![Page 21: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/21.jpg)
Rumours on the next generation
I IEEE compliantI from 8 to 32 SPEI over 1TFLOPS
![Page 22: StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov · IntervalarithmeticontheCellprocessor StefGraillat Jean-LucLamotte SiegfriedM.Rump SvetoslavMarkov LIP6/PEQUAN, P. and](https://reader034.vdocuments.us/reader034/viewer/2022042808/5f899b06b681856c8d1f3349/html5/thumbnails/22.jpg)
Acknowledgment
Thanks to the CINES Center for providing IBM Cell Blade access.