(slides) efficient evaluation methods of elementary functions suitable for simd computation

International Supercomputing Conference 2010

1

Efficient Evaluation Methods ofElementary Functions

Suitable for SIMD Computation

Efficient Evaluation Methods ofElementary Functions

Suitable for SIMD Computation

Naoki ShibataShiga University

Naoki ShibataShiga University


2

Overview• Methods for evaluating following functions usin

g SIMD instructions in double precision– sin, cos, tan– log, exp– asin, acos, atan

• Fast– Two times as fast as FPU evaluation

• Accurate– Maximum error is within 6 ulps from the true value


3

Overview (contd.)• Advantages against existing methods

– No conditional branches– No gathering/scattering– No table look-ups

x

y

SIM

D o

pera

tion

SIM

D o

pera

tion

SIM

D o

pera

tion

SIM

D o

pera

tion

sin x

sin y

Apply exactly same seriesexactly same series of SIMD operations to x and y,and we get sin x and sin y.


4

Outline• Overview• Background

– Cares needed for SIMD optimization– Related works

• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function

• Evaluation– Accuracy– Speed– Code size

• Conclusion


5

Background• SIMD instructions are now pervasive

– SSE in x86 processors– Altivec in Power / PowerPC processors– NEON in ARM processors– Cell Broadband Engine– Many GPU models

• Length of SIMD registers is going to be extended– 256 bits in Sandy Bridge– 512 bits in Larrabee (or Knights Ferry?) GPUs


6

Background (contd.)

• SIMD inst. set does not include instructions for evaluating elmentary functions.– sin, cos, tan, log, exp, asin, acos, atan

• Two possibilities– FPU

• Elementary functions are available in limited architectures.

– Software library • Many are not optimized for SIMD calculation


7

Cares needed in SIMD optimization• We need special cares for SIMD optimization

– Memory access and conditional branches are slow• with modern processor models with long pipeline

– Table look-up is slow– Gathering and scattering operations are slow

• Some traditionally slow operations are not slow anymore– Division and Square root can now be evaluated

using one instruction each– No extra register needed

• We need new, specialized algorithms to make efficient uses of SIMD unit


8

Gathering and scatteringoperations are slow

• We need to compose/decompose each element of vector register– At least one instruction is needed for each

element– SIMD ALU may be idle during this operation

• Register spills may happen– This requires extra execution of instructions– This may cause memory access


9

Table look-up is slow• Table look-up is frequently used in traditional

implementation for evaluating elementary functions– Both in HW and SW

x

y

x

y

Scatter

Look-up

Look-up

x'

y'

x

yGather

Table

This part must be repeated for each

element


10

Division and SQRT are not too slow

• 69 clocks of latency and throughput for 1 execution of SQRTPD instruction[*]

• 69 clocks of latency and throughput for 1 execution of DIVPD instruction[*]

• Memory access latency is 165 clocks(C2Q 9300, measured by CPU-Z)

• The good point is that they do not require extra instruction execution and registers[*] Intel 64 and IA-32 Architectures Optimization Reference Manual


11

Challenge• Usually, evaluation of elementary functions is

performed using calculation with extra precision– e.g. x86 FPU does evaluation using 80bit calculation,

and round the result into 64bit.

• With extra-precision calculation, obtaining accurate output is not very hard.

• But in the proposed method, 64bit-precision calculation is only available

• Thus, error in each step of evaluation leads to the error in the output

So, we cannot tolerate error in each step


12

Related Works• GNU C Library and FPU emulator in

Linux OS utilize conditional branches– Thus, not suitable for SIMD computation

• There are many researches on evaluating elementary functions on hardware– Many of them utilize table look-ups– Not suitable for SIMD computation

• There are several multiple-precision floating-point computation libraries– The design is very different from our implementation


13





• Conclusion


14

Trigonometric functions• Finds sin x and cos x at the same time• Consists of two steps

– Step 1 : Argument is reduced within 0 and /4

utilizing the symmetry and periodicity

– Step 2 : Evaluate sin and cos functions on the reduced argument

• tan x can be evaluated by simply calculating (sin x / cos x)


15

• Given argument d, find s and an integer q so that 0 <= s < /4

• Is it as easy as just dividing d with /4? – No!• Finding q is easy, but s may be inaccurate because of

cancellation error– when d is a large number close to a multiple of /4

• Some implementations like FPU on x86 processors exhibits this problem

Reducing argument within 0 and /4


16

Cancellation error• Cancellation error happens when we subtract

one number from another where– Two numbers are very close– Both numbers have limited precision, and already

rounded

• Suppose that we are calculating 1 – x, with 5 digits of precision.– True value of x is 0.999985 which is rounded to 0.99999

– True value : 1 – 0.999985 = 0.000015

– Rounded : 1 – 0.99999 = 0.00001

– In this case, the results has only 1 digit of accuracy. Loss of accuracy is caused by cancellation


17

Solution for this problem• Basic idea : calculate this part with extra

precision utilizing properties of IEEE 754• q is assumed to be expressed in 26 bits,

which is a half of mantissa• /4 is split into three parts

so that all of the bits in the lower half of the mantissa are zero,


18

Solution for this problem (contd.)• Multiplying these numbers with q does

not produce an error– because, lower half of these numbers are zero

• Subtracting the resulting numbers in descending order from d does not produce cancellation error– Results of each step are all accurate

• In this way, we can obtain s accurately.


19

• Simply summing up the terms of the Taylor series does not produce accurate result– Because of accumulation of errors

• Using another kind of reduction is common– Utilizing triple angle formula of sine

– e.g. First divide x by 3, evaluate Taylor series, and then apply triple angle formula

– This speeds up calculation by reducing x

Step 2 in evaluating trigonometric functions


20

• This method is not good for our case.– When sin x is small, difference between two

terms are so large and it produces rounding error

• So, we use double angle formula of cosine, instead

• sin x can be found by the following formula

Step 2 (contd.)


21

Step 2 (contd.)• But, we have another problem

– When x is close to 0, cos2 x gets close to 1, and we have cancellation error

• So, we use double angle formula of (1 – cos x)• Let f(x) be (1 - cos x), and we get

• And, these do not produce cancellation or rounding errors


22

Inverse trigonometric functions

• We are going to find atan x, and obtain asin x and acos x from atan x

• Again, just evaluating terms of Taylor series produces accumulation of errors

• There seems no good way of argument reduction is for inverse trigonometric functions– We propose a new way of argument reduction for atan


23

New argument reduction forevaluating arc tangent (1/3)

• Suppose x = atan d, thus tan x = d• If d <= 1, we evaluate atan function on the arg

ument of 1/d instead of d• Then subtract the result from /2

• Now, we can assume that d > 1 and /4 < x < /2


24

New argument reduction forevaluating arc tangent (2/3)• We use following formulas

• We use (13) to calculate cotangent of argument d• Then we repeatedly use (14) to find cotangent of (d /

2n)– which is actually "enlarging" cot d

• By (13), enlarging cot d corresponds to reducing tan d

(13)

(14)


25

New argument reduction forevaluating arc tangent (3/3)

• Suppose that atan 2 = , tan = 2

• Apply (13) to get• Apply (14) to get• Apply (13) to get• = 0.618 is less than 2, thu

s we reduced the argument

(13)

(14)

After argument reduction, we calculate Taylor series of atan


26

asin and acos functions• We have problem if we simply use the

following functions

– These produce cancellation errors when |x| is close to 1

• This problem can be avoided by small modifications of the formulas


27

Exponential function• Consists of two steps

– Similar to trigonometric functions

• Step 1 : Argument is reduced within 0 and loge 2

• Step 2 : Further reduce the argument, and evaluate Taylor series


28

Step 1• We find s and an integer q so that 0

< s <= loge 2

• This step is very similar to the first step for trigonometric functions

• The same problem arises, and we solve the problem in the same way


29

Step 2• We can use the following formula to further

reduce argument

• And the evaluate Taylor series

• But, if x is close to 0, we will have rounding errors– Since difference between 1 and other terms are large

• So, we find (exp x) – 1, instead of exp x• And the remaining part is similar to sin and

cos


30

Logarithmic function• We use the following series, rather than

Taylor series

• This series converges faster than Taylor series

• Just evaluating this series is enough.• This series is well known.


31





• Conclusion


32

Evaluation• Accuracy

– We compared the output by proposed method and output by MPFR library

– We measured the evaluation accuracy within a few ranges for each function

• Speed– We compared the speed of the proposed

methods, an FPU and MPFR Library– We used Core i7 920 (2.66GHz), Core 2

Duo E7200 (2.53GHz)


33

Accuracy

Accuracy results for sin, cos and tan

Accuracy results for atan

Accuracy results for asin and acos


34

Accuracy (contd.)

Accuracy results for exp

Accuracy results for log

In all cases, error does not exceed 6 ulps


35

Speed

Proposed method is about two times as fast as FPU calculation


36

Code size

• The total code size is very small• Suitable for Cell B.E.

– which has only 256K bytes of directly accessible scratch pad memory in each SPE.


37

Conclusion• We proposed efficient methods for evaluating e

lementary functions in double precision• It does not include table lookups, scattering fro

m, or gathering into SIMD registers, or conditional branches.

• The average and maximum errors were less than 0.67 ulp and 6 ulps, respectively.

• The evaluation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.


38

Thank you• An implementation of the proposed method is now available as public domain software.

• Contact

http://freshmeat.net/projects/sleef

(slides) efficient evaluation methods of elementary functions suitable for simd computation

Technology

simd calculation

background simd instructions

cos functions

element simd alu

method trigonometric

following functions

elmentary functions

slow table lookup